Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

MapReduce — a Major Step Backwards?

ScuttleMonkey posted more than 6 years ago | from the angry-dbas-are-never-a-good-thing dept.

Databases 157

The Database Column has an interesting, if negative, look at MapReduce and what it means for the database community. MapReduce is a software framework developed by Google to handle parallel computations over large data sets on cheap or unreliable clusters of computers. "As both educators and researchers, we are amazed at the hype that the MapReduce proponents have spread about how it represents a paradigm shift in the development of scalable, data-intensive applications. MapReduce may be a good idea for writing certain types of general-purpose computations, but to the database community, it is: a giant step backward in the programming paradigm for large-scale data intensive applications; a sub-optimal implementation, in that it uses brute force instead of indexing; not novel at all -- it represents a specific implementation of well known techniques developed nearly 25 years ago; missing most of the features that are routinely included in current DBMS; incompatible with all of the tools DBMS users have come to depend on."

cancel ×

157 comments

Sorry! There are no comments related to the filter you selected.

may be missing the (data)points (5, Insightful)

yagu (721525) | more than 6 years ago | (#22100086)

I don't know why this article is so harshly critical of MapReduce. They base their critique and criticism on the following five tenets, which they further elaborate in detail in the article:

  1. A giant step backward in the programming paradigm for large-scale data intensive applications
  2. A sub-optimal implementation, in that it uses brute force instead of indexing
  3. Not novel at all -- it represents a specific implementation of well known techniques developed nearly 25 years ago
  4. Missing most of the features that are routinely included in current DBMS
  5. Incompatible with all of the tools DBMS users have come to depend on

If you take the time to read the article you'll find they use axiomatic arguments with lemmas like: "schemas are good", and "Separation of the schema from the application is good, etc. First, they make the assumption that these points are relevant and germaine to MapReduce. But, they mostly aren't.

Also taking the five tenets listed, here are my observations:

  1. A giant step backward in the programming paradigm for large-scale data intensive applications

    they don't offer any proof, merely their view... However, the fact that Google used this technique to re-generate their entire internet index leads me to believe that is this were indeed a giant step backward, we must have been pretty darned evolved to step "back" into such a backwards approach

  2. A sub-optimal implementation, in that it uses brute force instead of indexing

    Not sure why brute force is such a poor choice, especially given what this technique is used for. From wikipedia:

    MapReduce is useful in a wide range of applications, including: "distributed grep, distributed sort, web link-graph reversal, term-vector per host, web access log stats, inverted index construction, document clustering, machine learning, statistical machine translation..." Most significantly, when MapReduce was finished, it was used to completely regenerate Google's index of the World Wide Web, and replaced the old ad hoc programs that updated the index and ran the various analyses.
  3. Not novel at all -- it represents a specific implementation of well known techniques developed nearly 25 years ago

    Again, not sure why something "old" represents something "bad". The most reliable rockets for getting our space satellites into orbit are the oldest ones.

    I would also argue their bold approach to applying these techniques in such a massively aggregated architecture is at least a little novel, and based on results of how Google has used it, effective.

  4. Missing most of the features that are routinely included in current DBMS

    They're mistakenly assuming this is for database programming

  5. Incompatible with all of the tools DBMS users have come to depend on

    See previous bullet

Are these guys just trying to stake a reputation based on being critical of Google?

Re:may be missing the (data)points (3, Insightful)

CajunArson (465943) | more than 6 years ago | (#22100220)

Are these guys just trying to stake a reputation based on being critical of Google? I tend to agree, I could probably write a nice article about how map-reduce would be a terrible system to use in making a 3D game. Could an article like that be technically true? Sure. Would it be in anything more than a logical non-sequiter? Not unless Google all of the sudden came out and claimed mapreduce is the new platform for all 3D game development (not likely).

Re:may be missing the (data)points (1)

MajinBlayze (942250) | more than 6 years ago | (#22101624)

This article comes from a website called "databasecolumn.com" they are going to look at this from a database perspective. (assuredly a *relational* database perspective)

The first sentence states

On January 8, a Database Column reader asked for our views on new distributed database research efforts, and we'll begin here with our views on MapReduce.

This indicates that they are responding to a fairly specific question, or rather a series of specific questions made by individuals who (misguidedly) asked if MapReduce (for example) seemed like an advancement beyond RDBMS's.

To me, it seems a little more like responding to your example by saying MapReduce won't work for 3D gaming development for these reasons instead of saying MapReduce has implications in other fields, but probably won't affect 3D gaming development.

If you re-read the article with this in mind, it seems more correct if still a little troll-ish.

Re:may be missing the (data)points (4, Funny)

abscondment (672321) | more than 6 years ago | (#22102862)

It's also terrible for painting.

  1. Since the bucket doesn't enforce any schema, you never know what color paint the bucket might hold. Heck, it could even be full of honey. You just can't know, and not being able to know is, well, like programming assembly.
  2. Buckets aren't indexed, so you're not able to find that one ounce of paint that you really want to use next. You've got to split up all of the paint into ounce cups each time and examine very cup. It's very intensive, and really slows down your painting. If you stored the paint in a B-tree of ounce cups, your search for the right ounce of paint would be much more efficient.
  3. Painting is so old. I mean, get with the program. Gold plate your house, or something newer (since newer is always better!). In fact, decades of research into titanium has determined that it'll hold up better to the elements, anyway, so you should just get titanium siding instead of painting.
  4. Painting is an incomplete process. What if you want a window? Yeah, you can't paint a window for yourself, now can you? Did you need a jacuzzi? A fireplace? A new car? Sorry! Painting doesn't support those features yet. You'd better not paint at all if you want those things.
  5. Painting, believe it or not, is incompatible with tennis. There's no racket, there's no court, and there's no ball. There's not even a net (unless you're working from a really tall building, in which case you might fall and so a net is often used). I mean, you don't even need to paint with another person. It's so... incompatible.

Re:may be missing the (data)points (4, Informative)

starwed (735423) | more than 6 years ago | (#22100336)

I thought that this blog post [typicalprogrammer.com] was a pretty good sounding critique of the article in question. (Of course, I don't know a damn thing about DB, relational or otherwise. . )

Re:may be missing the (data)points (3, Insightful)

dezert_fox (740322) | more than 6 years ago | (#22100342)

>If you take the time to read the article you'll find they use axiomatic arguments with lemmas like: "schemas >are good", and "Separation of the schema from the application is good, etc. Actually, it says: "The database community has learned the following three lessons from the 40 years that have unfolded since IBM first released IMS in 1968. Schemas are good. Separation of the schema from the application is good. High-level access languages are good." Way to conveniently drop important contextual information. Axioms like these, derived from 40 years of experience, carry a lot of weight for me.

Re:may be missing the (data)points (1)

MagikSlinger (259969) | more than 6 years ago | (#22101714)

Way to conveniently drop important contextual information. Axioms like these, derived from 40 years of experience, carry a lot of weight for me.

While a good point, it's still irrelevant to Google and Map-Reduce because Google's search engine is NOT a RDBMS. It's almost pure indexing, and what they are doing is comparing, say , Oracle to a specific B+-Tree implementation. They are seeing the Map-Reduce algorithm purely from a RDBMS perspective--not a "let's solve this specific problem" perspective.

I'm reminded of a co-worker who accused me of being a database-bigot: you want to solve everything with a [RDBMS]. He was right. :-)

Re:may be missing the (data)points (5, Funny)

Anonymous Coward | more than 6 years ago | (#22100414)

You missed points 6 through 9:

6. New things are scary.
7. Google is on their lawn.
8. Matlock is the best television show ever.

Re:may be missing the (data)points (1)

MorpheousMarty (1094907) | more than 6 years ago | (#22102172)

I am still missing point 9.

Re:may be missing the (data)points (1, Funny)

Anonymous Coward | more than 6 years ago | (#22102432)

9. Profit!

Re:may be missing the (data)points (1)

Otter (3800) | more than 6 years ago | (#22100522)

Are these guys just trying to stake a reputation based on being critical of Google?

I don't know much about database theory, but do know that Michael Stonebraker already has a reputation.

Re:may be missing the (data)points (1)

oldhack (1037484) | more than 6 years ago | (#22100882)

I'm guessing MapReduce schemes are eating into traditional RDBMS market? If so, are there concrete products that implement MapReduce algorithm?

Re:may be missing the (data)points (2, Interesting)

mini me (132455) | more than 6 years ago | (#22101728)

CouchDB, ThruDB, RDDB, and SimpleDB, to name a few.

Re:may be missing the (data)points (3, Insightful)

samkass (174571) | more than 6 years ago | (#22101040)

Speaking as someone who works for a company whose product uses a database that is neither relational nor object-oriented, I can say from experience that folks who have devoted a significant amount of their lives to mastering that methodology see anything else as a threat. There are definitely use-cases for non-relational databases-- they're used at both Google and Amazon, as well as many other places. You can either burn significant effort defending your decision to go non-relational, or you can move on and ignore these folks and produce great products. The problem is that sometimes they make good points (especially about some aspects of indexing), but it's almost always lost in the "but... but... but... you're not relational!" argument.

Re:may be missing the (data)points (1)

NewbieProgrammerMan (558327) | more than 6 years ago | (#22101282)

Speaking as someone who works for a company whose product uses a database that is neither relational nor object-oriented, I can say from experience that folks who have devoted a significant amount of their lives to mastering that methodology see anything else as a threat.
I've bumped into this attitude in the little bit of time I spent as a developer: people who think that every last bit of configuration and data can (and must!) be crammed into a relational model, whether it belongs there or not. Performance and complexity be damned....it's relational! It's got to be good!

Re:may be missing the (data)points (1)

Larry Lightbulb (781175) | more than 6 years ago | (#22101542)

Pick?

Wait, are they "experts", though? (1)

SanityInAnarchy (655584) | more than 6 years ago | (#22103088)

Note how their blog represents the post as having a single author, when, in fact, it has multiple authors?

That does not sound at all like a database expert to me. It's a simple many-to-many relationship!

Re:may be missing the (data)points (0)

Anonymous Coward | more than 6 years ago | (#22101052)

So it is a tool from google for google, the rest of us who are not interested in building search engines can safely ignore this framework and use and build better tools for the job

Re:may be missing the (data)points (3, Interesting)

DragonWriter (970822) | more than 6 years ago | (#22101060)

I don't know why this article is so harshly critical of MapReduce.


The primary grounds for complaint seems to be "this isn't the way we do things in the database world". Each of the complaints (except #3) boils down to this (#1: The database community had arguments a few decades back and developed, at the time, a set of conventions; Map Reduce doesn't follow them and is, therefore, bad; #2: All databases use one of two kinds of indexes to accelerate data access; MapReduce doesn't and is, therefore, bad; #3: Databases do something like MapReduce, so MapReduce isn't necessary; #4: Modern databases tend to offer a variety of support utilities and features that MapReduce doesn't, so MapReduce is bad; #5: MapReduce isn't out-of-the-box compatible with existing tools designed to work with existing databases and is, therefore, bad.)

And its from The Database Column, a blog that from its own "About" page is comprised of experts from the database industry.

I suspect part of the reason they are harshly critical is that this is a technology whose adoption and use in large, data-centric tasks is (regardless of efficiency) a threat to the market value of the skills in which they've invested years and $$ developing expertise.

At the end, they note (as an afterthought) that they recognize that MapReduce is an underlying approach, and that there are projects ongoing to build DBMS's on top of MapReduce, a fact which, if considered for more than a second, explodes all of their criticism which is entirely premised on the idea that MapReduce is intended as a general purposes replacement for existing DBMSs, rather than a lower-level technology which is currently used stand-alone for applications for which current RDBMSs do not provide adequate performance (regardless of their other features), and on which DBMS implementations (with all the features they complain about MapReduce lacking) might, in the future, be built.

Re:may be missing the (data)points (1, Funny)

Anonymous Coward | more than 6 years ago | (#22101420)

And its from The Database Column, a blog that from its own "About" page is comprised of experts from the database industry
 
Yes, I'm sure they are, but notice that they were unable to resolve a many to many relationship for authors and articles on their own website's db:
 
  [Note: Although the system attributes this post to a single author, it was written by David J. DeWitt and Michael Stonebraker]

Re:may be missing the (data)points (2, Insightful)

ShakaUVM (157947) | more than 6 years ago | (#22101090)

Map/Reduce is a very common operation in parallel processing. From my very quick look, it does seem as if the authors are right -- it looks like a quick and dirty implementation of a common operation, and not a "paradigm shift" in the slightest.

Re:may be missing the (data)points (2)

Splab (574204) | more than 6 years ago | (#22101116)

Did an assignment on map reduce some time ago, while I wasn't really impressed with it as a "Database" it was some really cool stuff they did with distributing the calculations - I did however note back then that it wasn't really useful for the general industry, but still was a very nice piece of software.

Re:may be missing the (data)points (1)

datablaster (999781) | more than 6 years ago | (#22101354)

Kinda looks like somebody who didn't really get it saw Google's paper, told a DBA half the details of what they didn't understand anyway, the DBA heard the word "data", read half a paragraph of Google's paper, and next thing you know all hell broke loose in the DBA's office. The DBAs called their friends who didn't actually read the paper, etc etc.

Indexing is useless here. (4, Insightful)

SharpFang (651121) | more than 6 years ago | (#22101570)

Indexing works by picking a small slice of the data you have (as a list of hashes), and changing it into a much smaller table mapping the data onto a group of records matching it. The index is smaller and conforms to a certain strict standard, so it's very fast to brute force. Then as you get the list of indices, you brute force them, and this way you get the record.

This works well if you can create such a slice - a piece of data you will match against. It becomes increasingly unwieldy if there are many ways to match a data - multiple columns mean multiple indices. And then if you remove columns entirely, making records just long strings, and start matching random words in the record, index becomes useless - hashes become bigger than chunks of data they match against, indexing all possible combinations of words you can match against results in index bigger than the database, and generally... bummer. Index doesn't work well against freestyle data searchable in random form.

Imagine a database with its main column being VARCHAR(255) and using about full length of it, then search using a lot of LIKE and AND, picking various short pieces out of that column, and the database being terabytes big. Try to invent a way to index it.

Re:Indexing is useless here. (3, Funny)

Timothy Brownawell (627747) | more than 6 years ago | (#22103190)

Imagine a database with its main column being VARCHAR(255) and using about full length of it, then search using a lot of LIKE and AND, picking various short pieces out of that column, and the database being terabytes big. Try to invent a way to index it.
Convert it to an HTML table and put it where googlebot can see it.

Re:may be missing the (data)points (0)

Anonymous Coward | more than 6 years ago | (#22101594)

Where does anyone say Map reduce is a general purpose DBMS?
Sure, it's good for 'G's problem, just like my indexed recipe card file is good in the kitchen.

Google = statistical database? (2, Insightful)

gnuman99 (746007) | more than 6 years ago | (#22101816)

I thought Google search weren't exact. You know, they were more statistical in nature. The entire algorithm is not probably based on absolute number (guessing, but otherwise it would not make sense).

The thing is if Google uses this to create their index-like structure of the internet for their search engine, and it is not exactly like a RDBMS, well, so what? The MapReduce thing seems to be targeted at large sets of data and semi-accurate data mining, not exact results. No one really cares if there are 3,000,000,000 sites or 3,000,000,002 sites with Linux in it somewhere.

Comparing RDBMS to MapReduce is like comparing math function to a paper graph of that function. The first one gives you exact results for all data in its domain. The second gives out quick, pain-free and semi-accurate results for some parts of the domain.

Now, I will not be using MapReduce but then I don't see why Google should not. It is their business.

Re:may be missing the (data)points (5, Interesting)

mishabear (1222844) | more than 6 years ago | (#22101948)

> I don't know why this article is so harshly critical of MapReduce.
> Are these guys just trying to stake a reputation based on being critical of Google?

Um... yes?

The Database Column is being coy about being a corporate blog for Vertica, a high performance database database product, but in fact it is. Vertica is a commercial implementation of C-Store and was founded by Michael Stonebraker, the most prominent proponent of column based databases (get it? the database column). So yes, they have a very good reason to be hostile to Google.

http://www.vertica.com/company/leadership [vertica.com]
http://en.wikipedia.org/wiki/C-Store [wikipedia.org]
http://en.wikipedia.org/wiki/Michael_Stonebraker [wikipedia.org]
http://www.databasecolumn.com/2007/09/contributors.html [databasecolumn.com]

Re:may be missing the (data)points (3, Insightful)

einhverfr (238914) | more than 6 years ago | (#22102082)

Hmmm.... ISTM that the basic critiques come down to:

1) No indexing.

Which means

2) Certain types of constraints probably don't work (such as UNIQUE constraints)

Which also means

3) Referential integrity checking and other things don't work.

This leads to the conclusion that the idea is good for certain types of data-intensive but not integrity-intensive applications (think Ruby on Rails-type apps) but *not* good for anything Edgar Codd had in mind....

Re:may be missing the (data)points (2, Interesting)

Blakey Rat (99501) | more than 6 years ago | (#22103134)

What bothers me the most is how much hype it gets. I work for a company that has had a "MapReduce" implementation (used internally) for as long as Google has, and we're not getting drooled over by the tech press. I'm sure tons of companies that have had to solve similar problems have already made this tool, even though the languages and syntax involved might change between implementations, it's nothing all that great.

Just watch. (1, Insightful)

jonnythan (79727) | more than 6 years ago | (#22100174)

It's a technical step backwards, they're doing it all wrong, experts say you should do it this other way....

And watch. It'll be massively successful because it works.

Blink blink (4, Funny)

Thelasko (1196535) | more than 6 years ago | (#22100226)

Once I saw the word paradigm in the summary I just glazed over like I do whenever our CEO gives a speech.

Re:Blink blink (4, Funny)

spun (1352) | more than 6 years ago | (#22101186)

Ah, the old "eyes glazing over" paradigm. Definitely no synergy in that. Here's an action item: leverage your value added intellectual capital to architect a new scenario.

Re:Blink blink (1)

MetalPhalanx (1044938) | more than 6 years ago | (#22101468)

You sound like my boss.... Except he's not joking when he does that. :(

Re:Blink blink (1)

putch (469506) | more than 6 years ago | (#22101452)

[obligatory simpsons quote] Excuse me, but "proactive" and "paradigm"? Aren't these just buzzwords that dumb people use to sound important?

Not that I'm accusing you of anything like that.

I'm fired, aren't I?[/obligatory simpsons quote]

Fnord? (1)

SanityInAnarchy (655584) | more than 6 years ago | (#22102984)

Paradigm.

Does that mean Paradigm is a Fnord? As in, I can now say stuff you won't be able to consciously read, because it has the Fnord Paradigm in it?

Databases? WTF? (4, Insightful)

mrchaotica (681592) | more than 6 years ago | (#22100228)

Since when did MapReduce have anything to do with databases? It's actually about parallel computations, which are entirely different.

Re:Databases? WTF? (1)

mini me (132455) | more than 6 years ago | (#22100460)

Those newfangled [couchdb.org] document [google.com] databases [thefreedictionary.com] utilize MapReduce to gather records. I'm guessing that's what the article is about.

Re:Databases? WTF? (1)

dezert_fox (740322) | more than 6 years ago | (#22100588)

from TFA: 'The map program reads a set of "records" from an input file, does any desired filtering and/or transformations, and then outputs a set of records of the form (key, data). As the map program produces output records, a "split" function partitions the records into M disjoint buckets by applying a function to the key of each output record. This split function is typically a hash function, though any deterministic function will suffice. When a bucket fills, it is written to disk. The map program terminates with M output files, one for each bucket.' (key, data) = database output dude, RTFA And by the way, databases can be used for computation

Re:Databases? WTF? (1)

Temporal (96070) | more than 6 years ago | (#22101582)

Um... Nope, sorry, the OP is right. MapReduce is a framework for batch processing of gigantic data sets where you intend to do something with every item in the set, or at least a large fraction of them. Relational databases are better for quickly looking up subsets of the items in a database based on query terms, and can be used for serving real-time queries.

Re:Databases? WTF? (1)

DragonWriter (970822) | more than 6 years ago | (#22101236)

Since when did MapReduce have anything to do with databases?


MapReduce is a tool, one of whose principal applications is conducting queries on large bodies of data consisting of records of similar structure. It, therefore, competes with traditional DBMSs to a degree.

Now, (largely because of the limitations the authors note), it generally is only used currently for the kind of applications where setting up a traditional RDBMS to handle them would be impractical: Google developed their implementation of MapReduce to handle a task which was at the time straining their existing RDBMS resources.

And, yeah, it requires more custom programming to make work at all than a DBMS, and probably isn't suitable for typical tasks. OTOH, If you are Google, what you are doing isn't a typical DBMS task or load, even if it is within the scope of things a DBMS could do, abstractly, performance considerations aside.

Sometimes rolling-your-own with just the features you need, and an implementation tailored to your particular challenges is more efficient that taking something off the shelf.

And given that there are now open implementations of MapReduce, for people with similar challenges, there isn't as much roll-your-own involved as there were for the people implementing MapReduce the first time, reducing the cost. Yeah, this means that traditional DBMSs have a challenge, though no doubt as MapReduce implementations mature, more of the traditional DBMS features and interfaces (including SQL or something like it) will be bolted on to the successful implementations.

Re:Databases? WTF? (1)

DragonWriter (970822) | more than 6 years ago | (#22101588)

MapReduce is a tool, one of whose principal applications is conducting queries on large bodies of data consisting of records of similar structure. It, therefore, competes with traditional DBMSs to a degree.


Responding to myself is bad form, but:

Obviously, this only makes sense taking "similar structure" extremely loosely; still, the point is that MapReduce was developed to fill a niche for which RDBMSs were being used previously, in the absence of a more specialized tool, so it clearly competes with them, to a degree, even while it isn't an DBMS. So there is a relationship.

Re:Databases? WTF? (1)

einhverfr (238914) | more than 6 years ago | (#22102210)

Not really.

ANd no it is not really a step backwards for databases. It is actually something which offers a niche solution for large-scale single-purpose, semi-accurate databases.

This is almost but not entirely unlike what Codd had in mind when he wrote his seminal paper: "A Relational Model of Data for Large Shared Data Banks."

If it were a paradigm shift, it would be a step backward. However, as "one more tool in the toolbox" it is useful in some cases where RDBMS's are not.

Re:Databases? WTF? (1)

mg_729 (677864) | more than 6 years ago | (#22101612)

I recently assisted a team implement a distributed MapReduce system for a very large dataset. The team had previously attempted to use a database to solve their business problem, but found performance to be unacceptable

The MapReduce implementation was simple and exceeded all performance requirements. However, their DBA threw fits every step of the way. To him, everything involving data could and should be solved with a SQL statement.

More and more systems use databases simply as a data archive, not for primary work. I think the DBA's are starting to be concerned that they will no longer be necessary. Obviously that isn't true, there will always be bigger and tougher problems to solve.

Re:Databases? WTF? (0)

Anonymous Coward | more than 6 years ago | (#22103126)

My guess is the DBA was throwing a fit because DBAs are really *really* uptight about the data being right. MapReduce isn't ACID complient, which means data might get trashed. DBAs hate corrupted data, and don't tend to trust it unless it's been handled by a system that has data integrity as a top priority.

Google is a rare case. If there is a glitch which trashes some of their records, the worst thing likely to happen is you get 107,000 URLs back from a search rather than 109,500 URLs. No big deal. But the vast vast majority of DBAs out there work with databases that have to be exact all the time (finance, inventory, etc, etc) where if you dont get the exact right, and complete answer back, the shit hits the fan. When you get used to all the data being precious, using a system where it isn't (and treats it accordingly) can cause a bit of anxiety.

Huh??? (0)

LWATCDR (28044) | more than 6 years ago | (#22100234)

5. MapReduce is incompatible with the DBMS tools
A modern SQL DBMS has available all of the following classes of tools:
        * Report writers (e.g., Crystal reports) to prepare reports for human visualization

Perl? Really Perl was made for doing reports. I am sure that somebody will create a report writer for it. I am just amazed that Chrystal Reports has become the universal solution for so many things.

This is a pretty new bit of kit. If it catches on then people will start porting tools to it. When it comes to database tech I tend to believe that IBM really knows what they are doing. If this interests them I bet there is something too it.

Money, meet mouth (3, Insightful)

tietokone-olmi (26595) | more than 6 years ago | (#22100242)

Perhaps the traditional RDBMS experts will return when they can scale their paradigms to datasets that are measured in the tens of terabytes and stored on thousands of computers. Following the airplane rule the solution needs to be able to withstand a crash in a bunch of those hosts without coming unglued.

Now, this is not to say that a more sophisticated approach wouldn't work. It's just that when you have thousands of boxes in a few ethernet segments, communication overhead becomes really quite large, so large in fact that whatever can be saved with brute-force computation it'll usually be worth it. Consider that from what I've heard, at Google these thousands of boxes are mostly containers for RAM modules so there's rather a lot of computation power per gigabyte available to throw away with a brute force system.

Also, I would like to point out that map/reduce is demonstrated to work. Apparently quite well too. Certainly better than any hypothetical "better" massively parallel RDBMS available in a production quality implementation today.

Re:Money, meet mouth (2, Interesting)

StarfishOne (756076) | more than 6 years ago | (#22102678)

Agreed.

I recently read somewhere (if only I could recall the link...) that on average Google's MapReduce jobs process something in the order of 100 GB/second, 24/7/365

I've got nothing against RDBMS... but how can you be critical about a tool that scales and performs so well? It's just a matter of selecting and using the right tool for the job.

As one of the comments on the blog ... (3, Insightful)

tcopeland (32225) | more than 6 years ago | (#22100270)

...entry says;

"You seem to not have noticed that mapreduce is not a DBMS."

Exactly. These are the same sort of criticisms that you hear around memcached [danga.com] - the feature set is smaller, etc - and they make the same mistake. It's not a DBMS, and it's not supposed to be. But it does what it does quite well nonetheless!

Even if it was .... (0)

Anonymous Coward | more than 6 years ago | (#22100564)

Even if it was a RDBMS, there are damn good reasons for violating the "rules" in certain situations. If the only tool in your toolbox is a hammer, everything looks like a nail. Knowing the rules and guidelines goes hand in hand in knowing the situations where they don't work or work against you... academics are big on the former and short on the latter that is a real thing in the real world outside of academia.

I had to write a DB application once to handle about 80 full CDs of telephone records from a RDMS. I was able to reduce it so it all fit on one CD and was blazingly fast, but I had to violate several "rules" of proper database programming and layout. It happens.

Summary of reaction... (1)

R2.0 (532027) | more than 6 years ago | (#22100272)

"MapReduce is a software framework developed by Google to handle parallel computations over large data sets on cheap or unreliable clusters of computers."

It ought to be a database, but since it isn't a database, it sucks.

Also: How's a DBM supposed to profit off that? (1)

SteeldrivingJon (842919) | more than 6 years ago | (#22101414)


And, if mapreduce doesn't generate vast license income for Oracle, it must suck. Imagine the per-processor charges Google would be paying!

not so fast db snobs... (1)

sneakyimp (1161443) | more than 6 years ago | (#22100298)

I'm not at all certain about this but I'd bet that indexes can't solve every problem. I was working on a search routine that would attempt to pick 5 records at random from a database containing potentially a billion records. The search criteria were quite complex and included full-text search of a TEXT field and geographic proximity to a given zip code among other things. They client wanted this done in a fraction of a second.

Personally, I'm amazed at what the various google search engines do and would bet that this technique they describe is what ties together their 200,000 servers. I wouldn't dismiss it so quickly.

Re:not so fast db snobs... (1)

e4g4 (533831) | more than 6 years ago | (#22101092)

a search routine that would attempt to pick 5 records at random from a database containing potentially a billion records
Yeah, I'd say an index wouldn't be much help in that situation. A monkey with a keyboard could probably handle it, though.

Re:not so fast db snobs... (0)

Anonymous Coward | more than 6 years ago | (#22101226)

since you could index the geographic location, that actually would beat a brute force map reduce. Using the index to narrow down potential tuples and then going massively multithreaded (as needed) would probably be faster, of course.

distributed indexes? (1)

magarity (164372) | more than 6 years ago | (#22100344)

in that it uses brute force instead of indexing
 
Isn't the overhead of a distributed index usually not worth the bother? This scheme sounds similar to the way Teradata handles its distribution and it manages to get a lot done with hardly any secondary indexes. I think the thinking in the article indicates standalone database server box thinking.

Whew! (0)

Anonymous Coward | more than 6 years ago | (#22100384)

I'm glad someone finally had the nerve to put MapReduce into real perspective. MapReduce has absolutely none of the "why didn't I think of that" factor.

Ideas ahead of their time? (4, Insightful)

dazedNconfuzed (154242) | more than 6 years ago | (#22100388)

it represents a specific implementation of well known techniques developed nearly 25 years ago

There are many classic/old techniques which are only now being used - and very successfully - precisely because the hardware simply wasn't there. A recent /. post told of ray-tracing being soon used for real-time 3D gaming, and how it beats the socks off "rasterized" methods when a critical mass of polygons is involved; the techniques were well known and developed nearly 25 years ago, but only now do we have the CPU horsepower and vast fast memory capacities available for those "old" techniques to really shine. Likewise "old" "brute force" database techniques: they may not be clever and efficient like what we've been using for highly stable processing of relatively small-to-medium databases, but they work marvelously well when involving big unreliable networks of processors working on vast somewhat-incoherent databases - systems where modern shiny techniques just crumble and can't handle the scaling.

Sometimes the "old" methods are best - you just need the horsepower to pull it off. Clever improvements only scale so long.

Re:Ideas ahead of their time? (1)

Duncan3 (10537) | more than 6 years ago | (#22101470)

Exactly, noone except Google claims the MapReduce methods are new in any way. And given their lots-of-junk-machines, it's the way to do it. Anyone in the distributed computing space over the last _35_ years would have done it exactly the same way, just in older programming languages :)

The rest of the article is just DB-centric whining.

Re:Ideas ahead of their time? (0)

Anonymous Coward | more than 6 years ago | (#22101620)

In this video http://channel9.msdn.com/Showpost.aspx?postid=314874 [msdn.com] (590MB WMV to download) of Brian Beckman explaining the difficult physics of driving games, he demonstrates that it's now possible to achieve something similar by simulating some of the particles (in a sense) in a car rather than using abstract models of how wheels work. He shows a slide in the video with "Future = better simulations through simpler physics and more horsepower" on it.

Bad Perspective (1)

Evets (629327) | more than 6 years ago | (#22100430)

This article was written from the perspective that map-reduce based architectures is in competition with common relational database architecture. It's not.

Certainly if you were to implement map-reduce within the confines of the relational database world, there are implementation methodologies that would need to be taken to make it easier for the RDBMS developer to work with the storage and querying mechanisms.

The article implies that map-reduce is bad because it doesn't place restrictions common to the database world on developers. When you get down to programming anything at a basic level, the implementation of standards is an optional step to take.

I would agree that abstraction and structure would be good things because developers would be able to concentrate on higher level problems, but I would strongly disagree that anybody learning about map-reduce algorithms should be confined to a particular implementation methodology.

A completely uninformed analysis (2, Insightful)

abes (82351) | more than 6 years ago | (#22100462)

Well, INDBE, but MapReduce seems like a pretty cool idea (even it is old [which in my books does not equate bad]). A similar argument could be made against SQL -- it's not appropriate to all solutions. It's used for most nowadays, in part because it's the simplest to use, but that doesn't make it necessarily better. It (of course) depends on what data you want to represent.

Even more importantly, you can create schemas with MapReduce by how you write your Map/Reduce functions. This is a matter of the datafunction exchange (all data can be represented as a function, likewise all functions can be represented as data). I admit ignorance to how this MapReduce system works, but I would be surprised if you couldn't get a relational database back out.

The advantage is you get with MapReduce is that you aren't necessarily tied to a single representation of data. Especially for companies like Google, which may want to create dynamic groups of data, this could be a big win. Again, this is all speculative, as I have very little experience with these systems.

A Very Human Response (3, Insightful)

Anonymous Coward | more than 6 years ago | (#22100478)

The reaction seems straightforward enough. The MapReduce paradigm has proved to be very effective for a company that lives and breathes scalability, while it apparently ignores a whole bunch of database work that's been going on in academia. That fact that industry was able to produce something so effective without making use of all this knowledge base at least implicitly undercuts the importance of that work, and is thus threatening to the community which produced that work. Is it any surprise that the researchers whose work was completely side-stepped by this approach aren't happy with the current situation?

Re:A Very Human Response (0)

Anonymous Coward | more than 6 years ago | (#22101058)

Or for another view: Yahoo is the database-research-community's company, and Google the systems-research-community's company ;-)

Re:A Very Human Response (1)

DragonWriter (970822) | more than 6 years ago | (#22102208)

The reaction seems straightforward enough. The MapReduce paradigm has proved to be very effective for a company that lives and breathes scalability, while it apparently ignores a whole bunch of database work that's been going on in academia.


That's not the "problem" (from the perspective of the authors of TFA), really.

The problem is that provides an alternative to work that has been going on in industry, and in particular that it provides a way to end-run some of the limitations of traditional databases that may reduce demand for the particular alternative-model (column-based) database that the company that launched the blog on which TFA is posted is trying to sell as the way to work around the limitations of traditional row-based databases.

Of course, mostly MapReduce isn't about being a database, so the criticism in the article seems bizarre, but its mostly intended for the audience who might have a need for which MapReduce might seem like a viable approach and Vertica's "revolutionary" column-oriented database also might seem to be viable approach. While the two tools don't target the exact same needs, the places where they might be useful do overlap.

Which is why a blog whose initial post heralded the demise of the one-size fits all database is criticizing MapReduce for not being a traditional database. The problem isn't that its different, its that its not the version of "different" that Vertica is selling.

Try (0)

Anonymous Coward | more than 6 years ago | (#22100516)


Lisp

belly acres (1)

rodentia (102779) | more than 6 years ago | (#22100572)

A sub-optimal implementation, in that it uses brute force instead of indexing

As though these are the exclusive choices. TFA goes on to complain about implementing 25 year old ideas, though they are actually rather older than that--they just didn't strike the RDB types until the eighties. They proceed to insist that the system cannot scale. Arguing google's scalability is like arguing gravity.

FTFA (4, Insightful)

smcdow (114828) | more than 6 years ago | (#22100598)

Given the experimental evaluations to date, we have serious doubts about how well MapReduce applications can scale.


That's a joke, right?

I think Google's already taken care of all the experimental evaluations you'd need.

Re:FTFA (1)

sammy baby (14909) | more than 6 years ago | (#22101358)

That's a joke, right?


I know, that's what I thought.

But then again... a few weeks ago I was involved in a phone call with one of our clients. They're a huge client for us, to a degree that they can significantly influence the future direction of our product by complaining loud enough, and our first client to use some new "high-availability" features we're gradually rolling out.

In the course of our conversation, one of the client's guys essentially pooped on a large part of our product roadmap, basically because it involved a load balancer. And when we asked why, he said, "Because nobody in the world has been able to demonstrate a working, network-based load balancing solution."

And that's it. Seriously. As if the entire notion of network based load balancing was a hoax perpetrated on the IT industry, and Google and Yahoo were just having a laugh on us while relying on plain old round-robin DNS or something. I mean, this client has two whole nodes to load balance, and that's clearly out of the reach of, say, F5...

(Okay. Tangent. Sorry.)

A step from where? (3, Funny)

644bd346996 (1012333) | more than 6 years ago | (#22100624)

If you are starting with a good database, MapReduce is definitely a step backwards. But that isn't what MapReduce is designed to replace. In reality, MapReduce replaces the for loop [joelonsoftware.com] , and viewed from that perspective, it is a major step forward. Most languages (C, C++, Java, etc.) define the for loop and other iteration facilities in such a way that the compiler can seldom safely parallelize the loop. MapReduce gives the programmer an easy way to convert probably 90% of their for loops into highly scalable code.

Re:A step from where? (1)

Duncan3 (10537) | more than 6 years ago | (#22101386)

Unless you're using one of the dozens of compilers can do just that, or FORTRAN, or OpenMP, or...

Re:A step from where? (1)

Rakishi (759894) | more than 6 years ago | (#22102128)

Your compiler will parallelize a for loop across 1000 machines AND split the input data across them before you even run the program?

Translation: (1)

Chris Mattern (191822) | more than 6 years ago | (#22100652)

"We spent all these years making these complex, elegant algorithms--see how intricate this wonderful indexing algorithm is?--and then they solve things by simply throwing cheap hardware at it. It's not *fair!*"

Re:Translation: (0)

Anonymous Coward | more than 6 years ago | (#22101994)

It's not smart. Anytime you have a dumb algorithm and make it solve your problem by throwing more hardware at it, you're losing. You could use a smart algorithm on more hardware and get more work done. I don't know very many businesses where developer time is worth more than the millions of dollars it takes to build out massively parallel systems to compensate for stupid algorithms.

Duh.

Missing the forest for the trees... (3, Insightful)

brundlefly (189430) | more than 6 years ago | (#22100720)

The point of MapReduce is that It Works. Cheaply. Reliably. It's not a solution for the Cathedral, it's one for the Bazaar.

Comparing it to a DBMS on fanciness is pointless, because the DBMS solution fails where MapReduce succeeds.

Step backward? (1)

gmuslera (3436) | more than 6 years ago | (#22100776)

The 1st that come to my mind when i read that was the evolution of a programmer [nus.edu.sg] , when a "program" evolving started to get back thin in lines didnt meant that were a step backwards.

Huh? (1)

Black Parrot (19622) | more than 6 years ago | (#22100802)

we are amazed at the hype that the MapReduce proponents have spread about how it represents a paradigm shift in the development of scalable, data-intensive applications.
So much hype that I never even heard of it before their complain hit Slasdot...

so soon? (1)

ImTheDarkcyde (759406) | more than 6 years ago | (#22100902)

I wasn't expecting Google to seize control of the world databases and force people to use their software till at least 2012.

Vertica (3, Interesting)

QuietLagoon (813062) | more than 6 years ago | (#22101010)

The column was copyright by Vertica [vertica.com] . Wouldn't they be concerned about the type of competition that MapReduce presents?

Re:Vertica (1)

SteeldrivingJon (842919) | more than 6 years ago | (#22101562)

Maybe they're getting customers asking about mapreduce, and are tired of trying to convince customers that a conventional system is the way to go.

Information and knowledge management (1)

thomp (56629) | more than 6 years ago | (#22101050)

Data management is becoming so much more than just the data stored in a DBMS. As a data management geek, it's sad that the authors, experts in my field, fail to put MapReduce in its proper context and recognize its value. My bread and butter is DBMS, and even I could see the potential of MapReduce and the failure of the authors' arguments.

tap

The are afraid... (1)

mini me (132455) | more than 6 years ago | (#22101080)

I gather this is a publication for DBAs. It seems they are worried about their jobs more than anything. With the map-reduce-style databases there isn't a need for any kind of special database expert. The business logic all happens in the application. There is no need for tuning indexes. You don't even need to define a schema. When things get slow any monkey can drop in another computer and you're back up to speed and ready to go.

Traditional RDBMSes have their place, but we're going to see a lot more applications built on this technology in the near future. The big players (Google, Amazon, etc.) have been doing it for quite some time and we're now finally seeing the technology available to the average Joe. It's a very interesting shift in how data is stored and should lead to some interesting applications that we can only dream of today.

like Spider Robinson sang.. (2, Funny)

hmaon (11619) | more than 6 years ago | (#22101082)

"...I taped twenty cents to my transmission
So I could shift my pair 'a dimes..."

Article really misses the point (4, Insightful)

steveha (103154) | more than 6 years ago | (#22101098)

I read through the whole article, and was just bemused. According to the article, MapReduce isn't as good as a real database at doing the sorts of things real databases do well. Um, okay, I guess, but MapReduce can do quite a lot of other things that they seem to have missed.

Also, I had a major WTF moment when I read this:

Given the experimental evaluations to date, we have serious doubts about how well MapReduce applications can scale.

Empirical evidence to date suggests that MapReduce scales insanely well. Exhibit A: Google, which uses MapReduce running on literally thousands of servers at a time to chew through literally hundreds of terabytes of data. (Google uses MapReduce to index the entire World Wide Web!)

This in turn suggests that the authors of TFA are firmly ensconced in the ivory tower.

They complained that brute-force is slower than indexed searches. Well, nothing about MapReduce rules out the use of indexes; and for common problems, Google can add indexes as desired. (Google uses MapReduce to build their index to the Web in the first place.) And because Google adds servers by the rackful, they have quite a lot of CPU power just waiting to be used. Brute force might not be slower if you split it across thousands of servers!

Likewise, they complain that one can't use standard database report-generating tools with MapReduce; but if the Reduce tasks insert their results into a standard database, one could then use any standard report-generating tools.

MapReduce lets Google folks do crazy one-off jobs like ask every single server they own to check through their system logs for a particular error, and if it's found, return a bunch of config files and log files. Even if you had some sort of distributed database that could run on thousands of machines, any of which might die at any moment, and if you planned ahead and set the machines to copy their system logs into the database, I don't see how a database would be better for that task. That's just a single task I just invented as an example; there are many others, and MapReduce can do them all.

And one of the coolest things about MapReduce is how well it copes with failure. Inevitably some servers will respond very slowly, or will die and not respond; the MapReduce scheduler detects this and sends the Map tasks out to other servers so the job still finishes quickly. And Google keeps statistics on how often a computer is slow. At a lecture, I heard a Google guy explain how there was a BIOS bug that made one server in 50 disable some cache memory, thus greatly slowing down server performance; the MapReduce statistics helped them notice they had a problem, and isolate which computers had the problem.

MapReduce lets you run arbitrary jobs across thousands of machines at once, and all the authors of the article seem to be able to see is that it's not as database-oriented as a real database.

steveha

A better Google? (1)

pH7.0 (3799) | more than 6 years ago | (#22101204)

They should implementation their own Google using "modern techniques" and make billions!!!

Article misses the point of MapReduce/RDBMS (1)

duffbeer703 (177751) | more than 6 years ago | (#22101512)

Sounds like the rumblings of grumpy DBAs.

The whole point of a relational DBMS is to store, link and maintain the integrity of data in tables based on the relationships among the data.

MapReduce is about processing data... it's not focused on maintaining integrity, and the kinds of datasets suitable for MapReduce probably don't have well defined relationships.

Re:Article misses the point of MapReduce/RDBMS (1)

DragonWriter (970822) | more than 6 years ago | (#22101886)

Sounds like the rumblings of grumpy DBAs.


Or maybe talking the (free!) competition from a blog launched by a company trying to sell a different alternative to traditional databases (but one which outwardly looks more like a traditional RDBMS) for an overlapping problem domain (that is, column-oriented databases, which address some of the same distribution and parallelization issues that MapReduce addresses, and target some of the same areas [e.g., "big science"] where it has been suggested that MapReduce might be useful.)

Re:Article misses the point of MapReduce/RDBMS (1)

duffbeer703 (177751) | more than 6 years ago | (#22102110)

Thanks for pointing that out -- I hadn't realized that the article was part of a corporate blog!

Re:Article misses the point of MapReduce/RDBMS (1)

IvyKing (732111) | more than 6 years ago | (#22102092)

On the other hand TFA may be more of a caution to those who think that Google has solved the "Database on Clusters" problem. The key point is integrity or lack there of (as you pointed out), Google doesn't need to maintain 100% integrity unlike the typical used of a DBMS.

Authors went off topic (1)

pontificator (1160147) | more than 6 years ago | (#22101698)

In the intro they mention that

"a few select universities to teach students how to program such clusters using a software tool called MapReduce [1]. Berkeley has gone so far as to plan on teaching their freshman how to program using the MapReduce framework"

and you would assume that the article argue why this is a bad trend. They may be right that MapReduce might be getting more attention than it deserves but in their article doesn't discuss this at all. Their editor should have pointed out to them that they went way off topic.

In related news: Screwdrivers suck because... (5, Funny)

DragonWriter (970822) | more than 6 years ago | (#22101812)

1) They don't look like hammers,
2) They don't work like hammers,
3) You can already drive in a screw with a hammer,
4) They aren't good at ripping out nails, and
5) They aren't good at driving nails.

Brought to you by The Hammer Column, a blog written by experts in the hammer industry, and launched by Hammertron, makers of a revolutionary new kind of hammer [vertica.com] .

They have a point. And it matters (1)

Animats (122034) | more than 6 years ago | (#22101862)

I understand what they're getting at. What makes modern SQL-driven databases so useful is that they optimize queries. If you're asking for every entry in A that's also in B, any modern database will check whether it's faster to look up every A in B, every B in A, or do a match where both databases are read through sequentially by the same key. The best choice depends on the database record counts, available indices, and key types and lengths. The database system figures that out; it's not in the SQL query.

So the user says what they want, and the system figures out how to do it. It's "do what I mean" that really works. We don't see enough of that in programming.

Google search itself works much more like a database than a map/reduce system. Think about what has to happen when you search for multiple keywords. That's a join, and joins on big data sets take forever if you don't have the right data structures and an optimizer.

Maybe it is a step backwards, but I use it a lot.. (1)

10537 (699839) | more than 6 years ago | (#22102010)

...at work. We use it to aggregate millions of dumped events every day, and while it may be missing features that are common in RDBMSes or use brute force rather than special magic, the fact is that we can point it at a cluster of machines and get aggregated stuff out with a lot less computational overhead than if we used anything else. It's not an RDBMS, and we don't use it as one, and therefore don't give a rat's ass if it's any good as one -- it does one thing, and it does it at a good price/scalability/performance/modifiability/ease-of-use multiratio. (And at the risk of being redundant: Photoshop is a crap word-processor, but the problem there isn't Photoshop, it's the fucktard who uses it to write letters.)

This coming from the DB Community? (1)

Qbertino (265505) | more than 6 years ago | (#22102098)

Seriously, the DB Community calling something 'backwards' is a joke. Before going after others the DB people maybe should get up to date with their technology and maybe just get rid of that ancient, crappy POS PL called SQL. They should spend their time migrating to some up-to-date LGPLd solution for connection and glue-code. 'Them' using an early 70s interactive terminal hack as cornerstone of their work and calling others 'backwards' is just plain silly.
When rotating HD disks will be replaced by SSDs and start going the way of the do-do, then we'll see who's backwards and outdated. Until then I'd tune low on any wisecracking about something being 'backwards' compared to DB technology.

The only thing wrong with map-reduce... (1)

frank_adrian314159 (469671) | more than 6 years ago | (#22102138)

... is that they misspelled xapping [homeunix.net] .

Stream processing. (1)

mypalmike (454265) | more than 6 years ago | (#22102302)

The whole point of MapReduce is to take an unindexed stream of data and shrink it down based on some criteria where numerous records can be associated (Map) and aggregated (Reduce). It is a process. The *result* of the process is an indexed database, which is often inserted into a relational or time-series database.

It's an apples and oranges comparison, and the author's never eaten an orange.

reminiscent...(philosophical digression) (1)

cjonslashdot (904508) | more than 6 years ago | (#22102730)

All your comments bring back to my mind the criticisms of XML-based messaging technologies (SOAP, Web services). "A huge step backwards", "incompatible with existing technologies and approaches" (BNF, parsers, languages), "inefficient" (compared to binary formats), etc. Those complaints were right, but they fell on deaf ears, just as these will.... IT is driven by fads and the availability of high-productivity gizmos. Ironically, productivity often suffers in the long run, as people then have to deal with the mess that gets created using approaches that are fundamentally wrong.

Mapreduce is not a database. (0)

Anonymous Coward | more than 6 years ago | (#22102824)

So of course rating it like one will fail.

I see map reduce as a really great way to take 10,000,000,000,000 bytes of raw data, map it to a set of computers and reduce the data to a set of tables that could then be placed in a regular database and queried.

Or is that not how google is using it?

Index Every Column? (1)

Tablizer (95088) | more than 6 years ago | (#22103242)

a sub-optimal implementation, in that it uses brute force instead of indexing;

For Query-by-Example-like tools, often you cannot predict which columns need indexing: they ALL do. At some point it just seems easier to split the data sets up onto dozens or hundreds of hard-drives and just do a sequencial search on each one in parellel. I cannot say whether it is clearly faster than indexing every column, but it is certainly simpler from a technical standpoint. And, it would possibly require less disk-space because there would only be one copy of each cell, unlike indexing which replicates the contents of the indexed column into the index.

What seems conceptually simpler: maintaining 300 indexes, or simply sequentially scanning tables split across many harddrives? (I've thought about this because I've been kicking around how to build a truely dynamic relational database with auto-columns proof-of-concept because the current "Oracle clones" are too stiff for many kinds of nimble apps.)
         
Load More Comments
Slashdot Login

Need an Account?

Forgot your password?