Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

The Big Promise of 'Big Data'

CmdrTaco posted about 4 years ago | from the hope-they-do-direct-withdrawl dept.

IT 78

snydeq writes "InfoWorld's Frank Ohlhorst discusses how virtualization, commodity hardware, and 'Big Data' tools like Hadoop are enabling IT organizations to mine vast volumes of corporate and external data — a trend fueled increasingly by companies' desire to finally unlock critical insights from thus far largely untapped data stores. 'As costs fall and companies think of new ways to correlate data, Big Data analytics will become more commonplace, perhaps providing the growth mechanism for a small company to become a large one. Consider that Google, Yahoo, and Facebook were all once small companies that leveraged their data and understanding of the relationships in that data to grow significantly. It's no accident that many of the underpinnings of Big Data came from the methods these very businesses developed. But today, these methods are widely available through Hadoop and other tools for enterprises such as yours.'"

cancel ×

78 comments

Sorry! There are no comments related to the filter you selected.

Hadoop de poop (-1, Troll)

Anonymous Coward | about 4 years ago | (#33578202)

There is brown stuff coming out of my poophole. Hadoop de poop?

Big Data? (0, Funny)

Anonymous Coward | about 4 years ago | (#33578222)

I don't think Brent Spiner is any fatter than when he played that android on Star Wars.

Re:Big Data? (2, Funny)

abigor (540274) | about 4 years ago | (#33578282)

It's actually his hip-hop name.

Re:Big Data? (-1, Offtopic)

Anonymous Coward | about 4 years ago | (#33578386)

It's actually his hip-hop name.

/. mods are gay and lame and since they are gay and lame AND live in their parents basements they also have no sense of humor.

Re:Big Data? (0)

Anonymous Coward | about 4 years ago | (#33578408)

My parents live in my basement, your sweeping argument is invalidated.

Re:Big Data? (1)

ooshna (1654125) | about 4 years ago | (#33579458)

Damn dude couldn't you use your dads disability check to put him and your mom in a nice retirement home instead of using it for to build your own server farm in the garage?

Re:Big Data? (1)

nacturation (646836) | about 4 years ago | (#33578798)

It's actually his hip-hop name.

I thought "Big Data" was his Italian Gangsta name from when he played Furio on The Sopranos: http://www.imdb.com/media/rm597203712/nm0144843 [imdb.com]

Re:Big Data? (1)

Gilmoure (18428) | about 4 years ago | (#33580276)

Furry gangsters?

Japan does rule the world.

fp (-1, Offtopic)

Anonymous Coward | about 4 years ago | (#33578266)

1st

HDF (-1, Offtopic)

Anonymous Coward | about 4 years ago | (#33578302)

Look it up

LiveSQL (3, Interesting)

ka9dgx (72702) | about 4 years ago | (#33578432)

I think that the real innovation will be a variation of SQL that allows for the persistence of queries, such that they continue to yield new results as new data is found to match them in the database. If you have a database of a trillion web pages, and you continue to put more in, it doesn't make sense to re-scan all of the existing records each time you decide you need to get the results of the query again. It should be possible, and far more computationally efficient to have a stream of results from a LiveSQL query that can feed a stream, instead of batch mode.

I've registered the domain name livesql.org as a first step to helping to organize this idea and perhaps set up a standard.

Re:LiveSQL (1)

PPH (736903) | about 4 years ago | (#33578658)

This sounds like some sort of agent or RSS feed. You tell it to message you every time some event matching your query occurs. Not a big deal for a single database or enterprise-wide data store.

This could get a bit involved for a Internet-wide system with the scope of Google. How many Slashdotters would submit queries like "Tell me whenever new p0rn appears".

Re:LiveSQL (1)

Anonymous Coward | about 4 years ago | (#33578726)

That query is peanuts. How about "tell me whenever new p0rn about pink haired furry hermaphrodite with big tits being raped by purple tentacles appears".

Re:LiveSQL (2, Funny)

Anonymous Coward | about 4 years ago | (#33578934)

You are posting AC, but I bet your IP address resolves to Japan

Re:LiveSQL (1)

davester666 (731373) | about 4 years ago | (#33583188)

or Texas.

Re:LiveSQL (2, Insightful)

starsky51 (959750) | about 4 years ago | (#33578710)

Couldn't this be done using regular sql and an indexed timestamp column?

Re:LiveSQL (2, Funny)

oldspewey (1303305) | about 4 years ago | (#33578960)

Too late, I already registered regularsqlwithanindexedtimestampcolumn.com

Re:LiveSQL (0)

Anonymous Coward | about 4 years ago | (#33580380)

You, sir, made me laugh deeply and loudly on a very very very bad day.

Thank you so much!

Re:LiveSQL (1)

tigre (178245) | about 4 years ago | (#33578790)

Interesting idea. Basically would need to establish event triggers on relevant tables. Should also be able to invalidate results that were previously found, and provide updates as well. Would require a lot of memory to persist enough information about the previous results that you don't end up with duplicates. I'll try and check in when you actually have a site.

Re:LiveSQL (1)

Drumpig (13514) | about 4 years ago | (#33578868)

whois livesql.org
NOT FOUND

Liar.

Re:LiveSQL (1)

ka9dgx (72702) | about 4 years ago | (#33579932)

one has to wait a bit for these things, as DNS updates are still batch oriented. 8)

Re:LiveSQL (4, Informative)

Laxitive (10360) | about 4 years ago | (#33579266)

There are some serious technical challenges to overcome when you think about actually implementing something like this.

Take something like "select stddev(column) from table" - there's no way to get an incremental update on that expression given the original data state and a point mutation to one of the entries for the column. Any change cascades globally, and is hard to recompute on the fly without scanning all the values again.

This issue is also present in queries using ordered results (as changes to a single value participating in the ordering would affect the global ordering of results for that query).

The issue that "Big Data" presents is really the need to run -global- data analysis on extremely large datasets, utilizing data parallelism to extract performance from a cluster of machines.

What you're suggesting (basically a functional reactive framework for querying volatile persistent data), would still involve a number of limitations over the SQL model: basically disallowing the usage of any truly global algorithm across large datasets. Tools like Hadoop get around these limitations by taking the focus away from the data model (which is what SQL excels in dealing with), and putting it on providing an expressive framework for describing distributable computations (which SQL is not so great at dealing with).

-Laxitive

Re:LiveSQL (2, Informative)

Rakishi (759894) | about 4 years ago | (#33579618)

Take something like "select stddev(column) from table" - there's no way to get an incremental update on that expression given the original data state and a point mutation to one of the entries for the column. Any change cascades globally, and is hard to recompute on the fly without scanning all the values again.

Stddev is trivial to recompute on the fly and I'd be surprised if any decent sql engine didn't compute it one row at a time. Store mean(column) and mean(column^2). SD = sqrt(mean(c^2)-mean(c)^2) not considering the unbiasing stuff. Add new row value deltas to both, do some simple math and you're done.

Now medians and quantiles are a bitch.

Frankly complex data mining of large data is a pain in the ass on hadoop as much as anywhere else. You can't do anything too global with hadoop because then you'd need to send all your data to one box anyway. You need specialized complex algorithms since you can only keep a fraction of your data in memory at a time. Simple regression? Have fun.

That said if you're already using hadoop it's quite possible you're using some sort of online learning algorithm anyway for just that reason so converting it to real time updating would be easy.

Re:LiveSQL (1)

u-235-sentinel (594077) | about 4 years ago | (#33580470)

Frankly complex data mining of large data is a pain in the ass on hadoop as much as anywhere else. You can't do anything too global with hadoop because then you'd need to send all your data to one box anyway. You need specialized complex algorithms since you can only keep a fraction of your data in memory at a time. Simple regression? Have fun.

That said if you're already using hadoop it's quite possible you're using some sort of online learning algorithm anyway for just that reason so converting it to real time updating would be easy.

Maybe. But I'm looking at doing it with pushing data to two boxes and two hadoop clusters (mainly for dr purposes). As for complex algorithms, Hive solves many of those problems with allowing people with SQL backgrounds to mine their information without learning a new language.

It's new, it's not perfect but then again, it is improvement over what's already in the market. Over the next couple years I plan on becoming a hadoop expert (even looking at learning java now).

Gotta start somewhere right?

Re:LiveSQL (1)

Laxitive (10360) | about 4 years ago | (#33580618)

I should have thought things out a bit better with the stddev example - and realized that it does indeed have a reasonable closed form. Good catch.

Complex data mining is hard everywhere, that's true. The problem is that even straightforward data mining is hard once the dataset sizes reach into the hundred-millions or billions or trillions in size (implying absolute dataset sizes of terabytes or more). For google it's webpages, for biology labs it's sequences.

The big killer is the cost of transferring data, which is how traditional data systems are built. A remote host has some software set up, and you send it some data, and it processes it and returns it to you. The distinction with Hadoop is that you keep the data on distributed hosts and send the code (which is typically a lot smaller).

The point stands that incremental update of queries on mutation is not a generally solvable problem: it'll still require the addition of new constructs and the limitation of existing constructs in SQL (e.g. ordering). Hadoop approaches the issue from the other end of the spectrum: focusing on a framework that models distributable algorithms directly using a small set of primitive operators (specifically, "map" and "reduce").

-Laxitive

Re:LiveSQL (1)

ka9dgx (72702) | about 4 years ago | (#33580030)

It turns out there is already a protocol [waveprotocol.org] in place for doing a lot of the grunt work. It allows for the federation of changes to a given object across organizations. While I wouldn't want to try build an ACID database on it in my free time, I supposed it could eventually be done with a larger team of programmers.

A query can be distributed across machines, which is what the map-reduce meme is all about. The next stage is to eliminate redundant calculations across time. LiveSQL will do that.

Re:LiveSQL (0)

Anonymous Coward | about 4 years ago | (#33579426)

I'm not sure but what you describe kind of sounds like Microsoft SQL Server 2008 R2 – StreamInsight. You might want to look at it as a reference.

Re:LiveSQL (1)

hackerjoe (159094) | about 4 years ago | (#33579584)

Aren't you basically talking about a materialized view [oracle.com] ? (This FAQ item [orafaq.com] has a simpler explanation than you'll get digging through the documentation above)

I haven't worked with materialized views, but if you want notification when the data changes, usually you can set up a trigger...

Re:LiveSQL (1)

kumanopuusan (698669) | about 4 years ago | (#33580188)

I have worked with materialized views and, yes, you are entirely correct. It takes one "CREATE MATERIALIZED VIEW" statement to have exactly what GP is describing. Unfortunately, in my experience, Oracle often requires so much tuning that a roll-your-own solution can be favorable (though less uniform and thus not suit-friendly).

Re:LiveSQL (4, Informative)

BitZtream (692029) | about 4 years ago | (#33579754)

I think that the real innovation will be a variation of SQL that allows for the persistence of queries

Thats been done for years, materialized views, using triggers on INSERT/UPDATE/DELETE to update the views on the fly.

Streaming results as needed is done with cursors.

I know you think you're probably talking about something that 'materialized views and cursors don't do'. Fortunately, you're wrong and just don't understand how to use them.

It really bothers me how people who talk about problems with SQL really have no fucking clue what they are talking about or how to work with the data in the first place.

Re:LiveSQL (0)

Anonymous Coward | about 4 years ago | (#33579846)

Apologies, but this exists already [wikipedia.org] .

Re:LiveSQL (1)

trb (8509) | about 4 years ago | (#33579978)

Re:LiveSQL (1)

St.Creed (853824) | about 4 years ago | (#33579994)

So you describe a set of conditions, let's be blatant and call them "rules" or even "filters", and when they match something you act upon them?

Stunning :)

Re:LiveSQL (1)

Prune (557140) | about 4 years ago | (#33580360)

Can't this be recast as a form of the eventual consistency paradigm?

Big Data Need (0)

Anonymous Coward | about 4 years ago | (#33578486)

Big machines [ibm.com] , not toys [dell.com] .

Yours In Moscow,
Kilgore T.

Re:Big Data Need (1)

Sarten-X (1102295) | about 4 years ago | (#33578722)

Dells are cheaper than IBM, for the amount of performance they provide. Big Data is best handled in a large cluster, which provides reliability and parallelism to process a highly-scalable amount of data in a very short time. Beyond a certain point, high-end machines just raise the cost of failure, where low-end machines provide only marginally lower performance.

Re:Big Data Need (1)

lgw (121541) | about 4 years ago | (#33579282)

A mainframe is>/i> a large cluster these days, together with the most robust physical hardware. There's a reason that some mainframes have had an uptime of more than a decade, with almost every part having been upgraded along the way.

Hadoop and similar products are finally making mainframe-style data processing cheaper on commodity hardware - or, rather, making such solutions available to the general public. Google's been doing it forever. Still, if you're mainframe has an uptime longer than Google has been around, there's not much incentive to change.

Re:Big Data Need (1)

Sarten-X (1102295) | about 4 years ago | (#33579388)

Mainframes don't offer the data capacity that a large cluster does. The topic here is handling big datasets (measured in the billions of records), and single machines just can't do that as efficiently as a cluster, no matter how beefy they are. The future is parallel.

Re:Big Data Need (2, Informative)

Primitive Pete (1703346) | about 4 years ago | (#33580172)

Mainframes and large multiprocessor machines have been handling multi-billion row data sets on RDBMS systems for a very long time. Data warehouses are commonly into the billions of rows. What commodity clusters provide is not efficiency--they often make poorer use of available cycles and repeat work to achieve goals.

What large commodity clusters provide is a price per cycle low enough that the owner doesn't have to worry about efficiency. For example, Google's Dean and Ghemawat ("MapReduce: Simplified Data Processing on Large Clusters") managed to successfully sort 10^10 100-byte records over 891 seconds, or about 6MB sorted per processor per second. Very fast overall, but hardly efficient use of modern hardware. There's an important place for the new big dataset system, but the argument is cost, not efficiency.

Re:Big Data Need (1)

Sarten-X (1102295) | about 4 years ago | (#33580528)

Money and hardware are interchangeable. Price per cycle is still a measure of efficiency.

Re:Big Data Need (1)

lgw (121541) | about 4 years ago | (#33580224)

A mainframe stopped being a "single machine" in any sense but the physical chassis a long time ago. They are clusters, from a performance perspective. The z196 will put 80 cores (each quite beefy compare to an Intel core) in the main chassis, but allow quite a few more blades to be added for DB query processing and similar tasks (112 blades per node, not sure how many nodes are supported, but I suspect max config will be thousands of cores).

A modern mainframe is really just a large cluster with centralized management and a focus on hot-swappable components. As Google-style "loose clusters" mature, so you just don't care if an individual machine fails, the premium for mainframe hot-swap reliability becomes questionable (where you don't have to care if an individual component fails).

That's what's interesting about Hadoop - it's not that it scales in ways never seen before, but that it scales at prices never seen before (for a non-proprietary solution)

Re:Big Data Need (2, Insightful)

Sarten-X (1102295) | about 4 years ago | (#33580768)

Assuming the maximum configuration is thousands of cores, how does it compare in other aspects to Facebook's 23,000 cores and 36 petabytes of data [developer.com] , with unlimited scalability to come?

For all intents and purposes, mainframes are still mainframes. They're parallel, and they grow, but they still have those limits that clusters just don't have.

(I consider price to be a limit as well)

Go w/ MPP like Vertica (1)

geoffrobinson (109879) | about 4 years ago | (#33582968)

Vertica can handle lots of data in a very fast manner (at least for data warehousing). They use a MPP architecture. Commodity hardware in a cluster running Linux.

No need for big machines. You can use lots of little ones.

Re:Go w/ MPP like Vertica (1)

mark0978 (1052438) | about 4 years ago | (#33635226)

How exactly did war end communism?

Or, for that matter slavery? It may have ended it in the US, but at a staggering cost. But slavery still exists in the world today.

The NAZI's could have been prevented before war, had we not had our head stuck in the sand, our fingers in our ears, all the while saying nah nah nah nah nah

misinformation/deception requires duplication.. (-1, Offtopic)

Anonymous Coward | about 4 years ago | (#33578588)

repetition, mass coverage, & even more 'work' when history is 're-written' (as is now the case) etc... the truth however, only comes in one version & does not require an attached phony continual song&dance routine, which saves both time & space.

as for some of the undressed facts (not that it's 'stuff that matters', unless you have kids/a conscience/heart/stuff like that);

http://search.yahoo.com/search?p=bush+blair+rumsfeld+cheney+wolfowitz+obama&fr=ush-news&ygmasrchbtn=Web+Search

& one of our personal 'favorites';
http://search.yahoo.com/search?p=manufactured+weather&fr=ush-news&ygmasrchbtn=Web+Search

What's the promise? (0)

Anonymous Coward | about 4 years ago | (#33578702)

It can't promise to make your site the next Facebook. That only happens when you have sufficient tech, UI design, and luck. Once the network effect plays into your favor you just snowball along. After that you can even have slightly inferior tech and UI design, and you will continue to win. The inconvenience must not outweigh the switching costs. You figure out a way to make your product sticky, you increase the switching costs...

If "Big Data" is a prerequisite for all of that, then use it. Just don't get the idea that it's a silver bullet.

Re:What's the promise? (4, Insightful)

Sarten-X (1102295) | about 4 years ago | (#33578982)

It isn't about Facebook so much as it's a shift in what problems are practically solvable.

First, realize that traditional approaches like SQL are limited mostly by the single box (or the few mirrors) the platform runs on. Querying a large (a billion rows) table can take minutes on a very fast machine, hours if there's significant disk access needed, and months if the query's complex enough. Clusters can process those same billion records far faster, bringing that time down from months to hours, or even seconds for a simple scan. Advances in cluster computing over the last few years have made this parallel processing much easier.

The promise is that problems that were previously too big to even think about are now easy. If your solved problem is something people want, like showing what their friends are up to, your product will do well.

Re:What's the promise? (1)

BitZtream (692029) | about 4 years ago | (#33579964)

You do realize that 'a cluster' is really just 'a bunch of mirrors' that you're distributing the query across ... right? So yes, two is more than one, and two can be faster than one. Thats true regardless of what you call it. Two mirrored machines ARE a cluster. Hell the summary makes it sound like they want to run virtual machines as part of the cluster ... probably running on the same VM server ...

Clusters REALLY AREN'T DOING ANYTHING WE HAVEN'T BEEN DOING FOR 50 YEARS.

STOP PRETENDING THIS IS NEW.

Just because you just started to become aware of the exact same things that have been done since the 60-70s doesn't mean its actually NEW.

Re:What's the promise? (1)

Sarten-X (1102295) | about 4 years ago | (#33580194)

Think about the term "mirror" for a while. In most mirrored setups, all data goes everywhere. Yes, we've had mirrors for quite a while, and they've provided linear speedup, and increase the cost of upgrades. A cluster doesn't just copy the data everywhere. Hadoop, for example, replicates each block only three times by default. On a large cluster, it's very likely that two randomly-selected nodes won't share any common data. The idea of having parallel data access isn't new, but the technology to do it efficiently while maintaining a low cost is. That's why I included the word "practically" in my first sentence.

It's a new and more efficient use of an old concept. Bicycles have been around for about two hundred years, and yet the Tour de France still makes the news.

Re:What's the promise? (1)

turbidostato (878842) | about 4 years ago | (#33582218)

"You do realize that 'a cluster' is really just 'a bunch of mirrors' that you're distributing the query across"

You do realize that the kind of cluster we are talking here is not "just 'a bunch of mirrors'" by big margin: you don't copy the whole data set to every node; you don't "copy" the computing load to every node.

Distributing is quite different from mirroring.

Re:What's the promise? (1)

Unequivocal (155957) | about 4 years ago | (#33588894)

I think some approaches call this "sharding?"

Re:What's the promise? (1)

Sarten-X (1102295) | about 4 years ago | (#33621692)

I'm no DBA, but my limited understanding is that sharding requires advance knowledge of the data being stored and the application thereof. My understanding also is that there are some designs that can't be effectively sharded without a huge cost penalty.

That's great if your programmers are sitting right next to the DBAs, but it's pretty bad for systems where other have to access the database as well. A project I'm working on now involves a painfully-slow SQL query, because we need to query by date, and the table is partitioned by an id number. Politics and other issues mean we won't be getting an index just to help our very-limited use case.

Hadoop (and more directly equal to SQL databases, HBase) distributes data with no prior knowledge. The distribution is done automatically, with no concern for what the data actually is or how it might be used. It keeps generality up and costs down, going back to my original comment about practical solutions.

Re:What's the promise? (1)

lewiscr (3314) | about 4 years ago | (#33580146)

It isn't about what's practically solvable, it's about what's cheaply solvable. These problems have been practical for anybody with money for a while. Hadoop lowers the barrier to entry.

If you've got the cash, IBM will set you up with a monster SQL cluster that will take that massive complex SQL query (the one that takes a month to run on your desktop), and return results in 2 seconds. If you have to ask how much it costs, you can't afford it.

This setup is still not cheap, it's just much cheaper than it used to be. You still have to build and maintain a large cluster of machines, but you can buy commodity servers instead of IBM mainframes. Now anybody with a couple hundred thousand dollars can play, instead of only Fortune 500 companies.

Re:What's the promise? (1)

turbidostato (878842) | about 4 years ago | (#33582288)

"It isn't about what's practically solvable, it's about what's cheaply solvable."

Aren't they the same? Isn't cost a practical constraint?

"These problems have been practical for anybody with money for a while."

No matter how rich you are, if you need six dollars to get five, it's not practical.

"Hadoop lowers the barrier to entry."

By means of lowering production costs. And that means that now you need four dollars instead of six for your five-dollar opportunity which is exactly which turns an impractical bussiness into a practical one.

"If you've got the cash, IBM will set you up with a monster SQL cluster that will take that massive complex SQL query "(the one that takes a month to run on your desktop), and return results in 2 seconds. If you have to ask how much it costs, you can't afford it."

Your last sentence is *only* valid for luxury goods. For anything else everybody, no matter how rich they are, asks for the costs and rightly so. Do you think rich people get rich by going into negative-ballance bussiness?

Re:What's the promise? (1)

lewiscr (3314) | about 4 years ago | (#33583036)

"Isn't cost a practical constraint?"

Yes, but the degree varies. In a low margin business, I might be willing to invest a couple million dollars, if it will lower my costs by 1%. Who's in a better position to make that statement, the Fortune 500 company, or the guy in the garage?

"No matter how rich you are, if you need six dollars to get five, it's not practical."

Right. But if you need $10M to get $20M, most people can't play. Even if you need $10M to get $100M, you have to convince somebody to give you $10M. I can't do that. But, if I can come up with a business that needs $50k to get $500k, I can talk somebody into that.

Re:What's the promise? (1)

lewiscr (3314) | about 4 years ago | (#33583060)

BTW, thanks for calling me out.

I was replying to the "First, realize that traditional approaches like SQL are limited mostly by the single box" statement in the GP. But I did gloss over a few details...

Re:What's the promise? (0)

Anonymous Coward | about 4 years ago | (#33585964)

First, realize that traditional approaches like SQL are limited mostly by the single box (or the few mirrors) the platform runs on.

So you can't run a SQL query on a highly distributed column store? Oh, what's this [blogspot.com] ?

SQL is an API. SQL is not a storage engine. Please learn the difference. Thank you.

Am I the only one who finds Hadoop unusable? (0)

Anonymous Coward | about 4 years ago | (#33578706)

We have a ~100 node hadoop cluster and it barely works. The software is utterly awful and every time we use it, the cluster falls apart in a few hours and has to be restarted. To read data, you have to copy it into HDFS, which means duplicating all your existing data. The design is awful, the implementation is awful, and it doesn't work.

Am I missing something? How is everyone else using it and making it work?

Re:Am I the only one who finds Hadoop unusable? (0)

Anonymous Coward | about 4 years ago | (#33578816)

I have a beowolf cluster of hadoop and it works grate

Re:Am I the only one who finds Hadoop unusable? (1)

Sarten-X (1102295) | about 4 years ago | (#33578820)

First, I'm not really sure what you mean by "copy it into HDFS", since that's usually where data is stored in the first place. Copying in lots of data without giving the cluster time to stabilize can cause it to go into safe mode, where it won't make changes until everything's properly distributed. It's version 0.20. Don't expect perfection just yet.

There are also experts [cloudera.com] out there who will be quite happy to help get things running better. My company has been using Hadoop quite successfully with their help.

Re:Am I the only one who finds Hadoop unusable? (1)

Laxitive (10360) | about 4 years ago | (#33579336)

In situations where you are using Hadoop, your "primary" data store should BE the HDFS store you are using to analyze it. That's a big part of the actual efficiency proposition of Hadoop.

The big trick with the "big data" approaches is to recognize that you keep _everything_ distributed, _all the time_. Your input dataset is not "copied into the system" for some particular analysis task, it _exists in the system_ from the time you acquire it, and the analysis results from it are kept distributed. It's only at specific points in time (exporting data to send to someone external, importing data into your infrastructure) that you should be messing around with copying stuff in and out of HDFS.

-Laxitive

HAADOOOPPP!!!! (1)

Temujin_12 (832986) | about 4 years ago | (#33578814)

...and 'Big Data' tools like Hadoop are enabling IT organizations...

...these methods are widely available through Hadoop and other tools...

Oh... also... did I mention HADOOP!!??

Nice press release (1)

InlawBiker (1124825) | about 4 years ago | (#33578842)

Looks like somebody got their PR spin piece relayed as a news story again. Bravo!

End of Science (2, Informative)

mysterons (1472839) | about 4 years ago | (#33579024)

Related to using Big Data in Business is Big Data in Science. Wired ran a nice series of articles looking at this (http://www.wired.com/wired/issue/16-07). This raises all sorts of problems (for example, how can results be reproduced? What if the model of the data is as complex as the data? Are all results obtained with Small Data simply artefacts of sparse counts?).

Re:End of Science (1)

MozeeToby (1163751) | about 4 years ago | (#33579302)

How can results be reproduced?

I don't follow how near infinite storage affects the ability of researchers to re-perform an experiment to gather data a second time.

What if the model of the data is as complex as the data?

Then it is, by definition, not a model. A model is a system that describes another system more complex than itself, a model that is as complex as the system it is trying to describe is just different way of looking at the system. It can still be useful, but it doesn't simplify the problem the same way a real model does.

Are all results obtained with Small Data simply artefacts of sparse counts?

Science has had a way to handle that question for centuries, it's called statistics: confidence intervals, standard deviations, etc. Any experimental result could be the result of a freak occurrence, that is why there are official and unofficial confidence interval cutoffs for publishing your results. Even if you had a sample size of one trillion, your confidence in the results could still be quite low.

Re:End of Science (1)

mysterons (1472839) | about 4 years ago | (#33579728)

Experiments being reproduced can be hard if no-one else has the data (this can happen --for example if you are Google and publish results using large fractions of the Web as data) or even if something as trivial as moving it from one site to another requires a lot of effort. This is not really a question of storage costs --it is a question of having the data in the first place and the mechanics of moving it around. Models are used in Science as idealisations; but if you really really want to model the long tail of effects, then your model becomes the data. And this relates to summary statistics: all they do is capture aspects of the data (it is after all a summary). If you want the whole truth, then you can't summarise. Fernando Pereira and Peter Novig have a nice paper on this: http://googleresearch.blogspot.com/2009/03/unreasonable-effectiveness-of-data.html [blogspot.com] [The Unreasonable Effectiveness of Data]

Why it works for Google/Yahoo/Facebook (5, Interesting)

BitZtream (692029) | about 4 years ago | (#33579894)

Consider that Google, Yahoo, and Facebook were all once small companies that leveraged their data and understanding of the relationships in that data to grow significantly.

Because their business is based entirely on how that data correlates.

99.999999999% of the rest of the world do other things as their primary business model. Small businesses aren't going to do this because it requires a staff that KNOWS how to work with this software and get the data out.

Walmart might care, but they aren't a small business.

The local auto mechanic, or plumber, or even small companies like lawn services or maid services simply aren't big enough to justify having a staff of nerds to make the data useful to them, and they really don't have enough data to matter. It simply is too expensive on the small scale.

Companies that can REALLY benefit from the ability to comb vast quantities of data have been doing it for well over a hundred years. Insurance companies are a prime example. You know what? They aren't small in general, so they have the staff to do the data correlation and find out useful information because it works on that scale.

Anyone who cares about churning through massive amounts of data already has ways to do it. Computing will make it faster, but its not going to change the business model.

I'm kind puzzled why virtualization has anything to do with this, unless someone is implying that a smart thing to do is setup a VM server, and then run a bunch of VMs on it to get a 'cluster' to run distributed apps on ... if thats the point being made then I think someone needs to turn in their life card (they clearly never had a geek card).

So now that I've written all that, I went and read the article.

Now I realize that article is written by someone who has absolutely no idea what they are talking about and simply read a wikipedia page or two and threw in a bunch of names and buzzwords.

Hadoop doesn't help the IT department do anything with the data at all.

Its the teams of analyists and developers that write the code to make Hadoop (which is only mentioned because of its OSS nature here) and a bunch of other technologies and code all work together and produce useful output.

This article is basically written like the invention of the hammer made it so everyone would want to build their own homes because they could. Thats a stupid assumption and statement.

Slashdot should be fucking ashamed that this is posted anywhere, let alone front page.

Re:Why it works for Google/Yahoo/Facebook (1)

Sarten-X (1102295) | about 4 years ago | (#33580408)

You're missing the point.

Sure, insurance companies have kept claims data around for many years. They make some pretty good observations about obvious correlations. People who speed too much tend to hit more things. People with chronic diseases tend to die.

What about the data they couldn't handle, though? What about the effects of someone's purchases? Did they buy quality brake pads? What about the circuit breakers installed in their house the last time it was remodeled? What contractor did they hire? How many of that contractor's other remodels had later electrical problems?

Such questions used to be far outside the scope of what was feasible to handle. That's Big Data, and it comes from all sorts of sources, like the mechanic who installed the brake pads. An insurance company can purchase data from the mechanic, and get a better idea of their customers' overall risk. That (ideally) leads to more accurate pricing. A little data here, a little there, and a lot of connections.

That's a lot of data, though. Just think how many car repairs are being done right now. Hadoop helps make the data manageable by keeping all the maintenance behind a nice layer of abstraction. Virtualization helps make Hadoop manageable, since small businesses can purchase server time based on use, rather than invest in a huge cluster right off the bat. Having a huge dataset is now easy, and that qualifies as "stuff that matters".

Re:Why it works for Google/Yahoo/Facebook (1)

Prune (557140) | about 4 years ago | (#33580422)

What's missing is a killer app for most businesses, and it's the data gathering and management side that's lacking, but the analytics side. I think that advanced analytics is not nearly as user friendly and accessible as it could be, and hopefully will be in the future. Visualization/analytics tools like Tableau are a good start, but we need more better (as in smarter in AI/machine learning terms) automation. Eventually I see analytics useful not only to businesses but even individuals, as a way of making the best of the flood of data and information overload we're bombarded with.

Re:Why it works for Google/Yahoo/Facebook (1)

akirapill (1137883) | about 4 years ago | (#33582064)

Actually, while I was also irked by the buzzword-compliance of TFA, I think the point about linking virtualization and the cloud with giving small businesses access to data tools is actually quite valid. Storage and processing are commodities now thanks to these technologies, which significantly reduces the staff and overhead required for a startup or small company to utilize large data sets. I work for a small web design and hosting company and we certainly wouldn't be considering scaling up our data management solutions for our clients if we had to carry the whole infrastructure on our backs. And just because you haven't thought of a novel way to leverage a lot of data doesn't mean that another company won't (and they will).

You really think the housing market (or the 'business model' of building homes) didn't change with the invention of the hammer? I suppose the business model of IT didn't change when people stopped coding in assembly - after all, coding in C is the same thing only faster, and what's all the hype around high level languages since they don't do anything by themselves without a team of software analysts and programmers? I'm actually surprised your post got modded so high since the first half basically amounts to "If it's worth doing it would have been done by now" and the second half is a just a bizarre, directionless and inappropriate outpouring of nerd rage. I guess it's just feels good to rally around someone declaring the popular technology du-jour irrelevant (in this case the cloud - a popular target around here). I'm actually finding it difficult to simultaneously respond to your uninformed opinions and your disrespectful attitude without feeling some nerd rage myself. We should be fucking ashamed? Really?

Re:Why it works for Google/Yahoo/Facebook (1)

turbidostato (878842) | about 4 years ago | (#33582348)

"99.999999999% of the rest of the world do other things as their primary business model. Small businesses aren't going to do this because it requires a staff that KNOWS how to work with this software and get the data out."

Of course, 99.999999999% of the world doesn't have electricity as their primary business model. Does this mean that small business are going to stay with candles and bonfires? Because, you know, they won't have the needed staff for producing and distributing their own electricity.

This new data-mining environments are just borning. First only companies with data mining as its core bussiness invent and use the new technology. Then, big companies with big money to deploy their own version. After that -if there's in fact a use case for it, utility companies will rise that will bring it to everybody.

"Anyone who cares about churning through massive amounts of data already has ways to do it."

But the associated costs can limit the kind of business you can build on top of heavy data crunching. It might be the case (as it has been with other technologies) that the cost drop will allow for new business to arise that were previously impractical.

"This article is basically written like the invention of the hammer made it so everyone would want to build their own homes because they could."

In a rethoric way, the invention of the hammer allowed people to get out of caves since they could build now there own huts first, then towns and finally cities.

nerd porn (1)

thePowerOfGrayskull (905905) | about 4 years ago | (#33579908)

Rummaging in the Bitlocker

Starring everybody's favorite...

Peta Bites

and costarring...

Bare Bones

and making his professional debut:

Big Data!

You can have all of the data in the world but... (1)

divisionbyzero (300681) | about 4 years ago | (#33579948)

you need to know what to look for. In order to know what to look for you need to know what's meaningful and that requires some sort of useful model. Accumulating data in itself isn't that interesting.

Have the authors actually SEEN any NASA code? (1)

wagadog (545179) | about 4 years ago | (#33580014)

EVAR?

The so-called computational scientists I used to work for through an entire alphabet soup of FFRDCs were barely able to program in FORTRAN, much less something as sophisticated at Hadoop.

Notice that the article skirts this issue -- yes, they work with "Big Data" but they don't use any dev tools developed post-1963 to do it in, believe me.

Incidentally, Google moved onto Caffeine. (1)

darkmeridian (119044) | about 4 years ago | (#33581972)

For the most part, Google has moved onto Caffeine and GFS2 for their support. Apparently, Big Table was taking too long to regenerate the entire index, forcing Google to refresh only part of their index frequently. The new Caffeine framework supposedly lets Google get closer-to-real-time search results because newly-indexed/crawled data can be continuously tossed into the search database without requiring an entire batch process. Perhaps that's why quotes from Slashdot comments show up in Google so quickly. This technology allows Google to chase news, blogs, and Twitter feeds while they're still relevant, which is pretty freaking cool.

The guys who were complaining about Google Instant and how Google should make better search results didn't mention Caffeine. Hopefully, Google can figure out how to use this technology to weed out the spam links and SEO crap that dominates some searches.

Suggestion: Read the source cited in the piece (1)

AlanMorrison (1902514) | about 4 years ago | (#33601012)

@CmdrTaco, et al., You might go to the 'in-depth guide" Olhorst mentions [http://www.pwc.com/us/en/technology-forecast/2010/issue3/index.jhtml], and assess that separately. We did a lot of research with the CIO and the rest of the C-suite in mind as a target audience. Of course Google has moved on beyond Bigtable, etc.... According to @royans [http://www.royans.net/arch/pregel-googles-other-data-processing-infrastructure/], Google uses Pregel to mine graphs. Allegedly 20% of their data they mine with Pregel; the other 80% they mine with MapReduce. Two of the Google engineers presented on Pregel at SIGMOD in June. In other words, these companies are developing and using different methods to mine different kinds of data. Much of the tool innovation happens at the companies doing the mining. @BitZstream, et al., Try to step back a bit and think about the frogs in the ponds next to yours. There is life beyond SQL and relational data. IT departments at large enterprises, particularly those with a significant Web presence or large collections of less-structured data, are using Hadoop, and we cite some of them in the issue. Others we have spoken to since we published in the Spring. Hadoop is a true ecosystem with lots of developers who've been plugged in for years, and they work at Web scale. Yes, the challenge of operating at Web scale is not a challenge everyone has, but it's a challenge more will face. @AlanMorrison
Check for New Comments
Slashdot Login

Need an Account?

Forgot your password?