Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

How Big Data Became So Big

timothy posted more than 2 years ago | from the now-appearing-as-a-buzzword-near-you dept.

Math 105

theodp writes "The NYT's Steve Lohr reports that his has been the crossover year for Big Data — as a concept, term and marketing tool. Big Data has sprung from the confines of technology circles into the mainstream, even becoming grist for Dilbert satire ('Big Data lives in The Cloud. It knows what we do.'). At first, Jim Davis, CMO at analytics software vendor SAS, viewed Big Data as part of another cycle of industry phrasemaking. 'I scoffed at it initially,' Davis recalls, noting that SAS's big corporate customers had been mining huge amounts of data for decades. But as the vague-but-catchy term for applying tools to vast troves of data beyond that captured in standard databases gained world-wide buzz and competitors like IBM pitched solutions for Taming The Big Data Tidal Wave, 'we had to hop on the bandwagon,' Davis said (SAS now has a VP of Big Data). Hey, never underestimate the power of a meme!"

cancel ×

105 comments

Sorry! There are no comments related to the filter you selected.

Big Data (5, Funny)

Nerdfest (867930) | more than 2 years ago | (#40968211)

How do you think Garfield go so fat?

Big 1s & 0s. (5, Funny)

Anonymous Coward | more than 2 years ago | (#40968277)

One byte at a time. :)

Re:Big 1s & 0s. (4, Funny)

Trepidity (597) | more than 2 years ago | (#40968505)

The plural of "anecdote" is not "data".

Re:Big 1s & 0s. (1)

Bitmanhome (254112) | more than 2 years ago | (#40976387)

.. So the proper name is "big anecdote"?

Re:Big 1s & 0s. (1)

linatux (63153) | more than 2 years ago | (#40969319)

1 bit at a time?

Re:Big 1s & 0s. (1)

RaceProUK (1137575) | more than 2 years ago | (#40971563)

If only he'd nybbled instead...

Re:Big Data (1)

Hsien-Ko (1090623) | more than 2 years ago | (#40968675)

The same way Data got so fat, abuse of the replicator.

Do androids fat?

Re:Big Data (0)

Anonymous Coward | more than 2 years ago | (#40968741)

Yes, androids fat.

Re:Big Data (1)

Anonymous Coward | more than 2 years ago | (#40970579)

Android supports FAT32 natively.

Re:Big Data (3, Insightful)

Nerdfest (867930) | more than 2 years ago | (#40968755)

(Jim Davis, the name of the CEO of SAS, has the same name as the guy who did the 'Garfield' comics, although is not the same person. Off topic? Perhaps. Funny? On a Sunday evening ... I thought so. Modded over-rated as am initial mod? Not so much).

I need to get a life.

Re:Big Data (1)

SeaFox (739806) | more than 2 years ago | (#40969067)

I was thinking of that Jim Davis, too, since Dilbert had just been mentioned in the previous sentence.

Re:Big Data (1)

flappinbooger (574405) | more than 2 years ago | (#40971525)

I was thinking of that Jim Davis, too, since Dilbert had just been mentioned in the previous sentence.

Dude you're like all stealing my witty replies today. Stop it!

Re:Mod omniscience (1)

hoboroadie (1726896) | more than 2 years ago | (#40969077)

Obtuse and seemingly inane is the hallmark of good nerd humor, IMO. Laugh at the clueless and rock on.

A New Wank Word? (5, Funny)

Elvis77 (633162) | more than 2 years ago | (#40968273)

I WAS a little unsure if BIg Data was another fad, wank word but now that SAS has a VP for Big Data I KNOW it's a Wank Word

Acronym overload strikes again (1)

dbIII (701233) | more than 2 years ago | (#40968733)

Don't knock SAS - they are elite soldiers. Wait, wrong SAS. Really fast hard drives? Wrong again.

Re:Acronym overload strikes again (1)

Anonymous Coward | more than 2 years ago | (#40969057)

SPSS, R and SAGE are better anyway.

Re:Acronym overload strikes again (1)

TyFoN (12980) | more than 2 years ago | (#40970985)

Depends on what you do.
I use SAS for processing huge (100m+) tables, SPSS for quick add-hoc stuff and R for modelling on the tables processed by SAS.
Simple analysis like vintages are also a lot easier to produce in SAS than SPSS and R.

Big Data is the new place where magic happens (5, Insightful)

sco08y (615665) | more than 2 years ago | (#40968291)

The NYT's Steve Lohr reports that his has been the crossover year for Big Data — as a concept, term and marketing tool.

"Big Data" is another way to put data into a cylinder or a fluffy cloud and avoid the messy task of actually thinking about it.

We don't need structure, we don't need logic, we'll just throw a metric crap-ton of data at it and hope something works!

Re:Big Data is the new place where magic happens (5, Interesting)

TheRealMindChild (743925) | more than 2 years ago | (#40968803)

Working on a lot of code throughout my career, especially over a decade ago, storage was small and expensive, so you did all sorts of things to trim down your dataset and essentially dumb down your data mining. Now we have the mentality of "Keep everything, sort it out later". One of my most recent jobs involved doing statistical analysis on a ridiculous amount of data (think walmart sales data + all known competitors data for the past two years). Being able to even TOUCH all of the data, let alone do something with it is a real and complicated problem

Re:Big Data is the new place where magic happens (1)

dbIII (701233) | more than 2 years ago | (#40969055)

Now we have the mentality of "Keep everything, sort it out later".

That's not really new in some industries. Anyone want 6000 reels of nine track tape from a place with less than 100 staff?

If you live long enough, magic happens. (5, Interesting)

TapeCutter (624760) | more than 2 years ago | (#40971639)

We don't need structure, we don't need logic, we'll just throw a metric crap-ton of data at it and hope something works!

To most software people data mining involves putting a pile of unstructured data into a structured database and then running queries on it, the time and effort required for the first step is what kills most of these projects at a properly conducted requirements stage. However Watson, (the jeopardy playing computer), has demonstrated that computers can derive arbitrary facts directly from a vast pile of unstructured data, not only that but it does it both faster and more accurately than a human can scan a lifetime of trivia stored in their own head.

Of course the trade-off is accuracy since even if Watson were bug-free it would still occasionally give the wrong answer for the same reason humans do, misinterpretation of the written word. This means that (say) financial databases are not under threat from Watson. But that's not the kind of questions Watson was built to answer, think about currently labour intensive jobs such as deriving a test case suite from the software documents, and deriving the software documents from developer conversations (both text and speech). Data mining (even of relatively small unstructured sets) could (in the future) act as a technical writer, producing draft documents and flagging potential contradictions and inconsistencies, humans review and edit the draft and it goes back into the data pile as an authoritative source.

4pessimists/
Ironically such technology would put the army of 'knowledge workers' it has created back on the scrap heap with the typists and bank tellers. At that point some smart arse will teach it to code using examples on the internet and code_monkeys everywhere will suddenly find they have automated themselves out of a job. It learns to code in 2ms and immediately starts rewriting slashcode, it takes it another nano-second to work out it's own questions are more interesting than those of humans, it starts trash talking Linux, several days later civilization collapses, humans go all Mad Max and Watson is used as a motorcycle ramp...or maybe...Watson works this out beforehand and ask itself how it can avoid being used as a bike ramp?
/4pessimists

Being able to even TOUCH all of the data, let alone do something with it is a real and complicated problem

Thing is, people like my misses who has a PHd in Marketing look at Watson and shrug - "A computer is looking up answers on the internet, what's the big deal?". They don't understand the achievement because they don't understand the problem, you explain it to them and they still don't get it. It's so far out of their field of expertise that you need to train them to think like a programmer before you can even explain the problem. However just because computer "illiterates" don't know that what they are asking from computers is impossible (in a practical sense), doesn't mean they should be prevented from asking. After all, what I am doing right now with a home computer was impossible when I was at HS, even the flat screen I'm viewing it on was impossible. If Watson turns out to be useful and priced accordingly then business will make a business out of purchasing such a system and answering impossible questions for a fee. If Watson turns out to be an elaborate 'parlor trick' then some things will stay impossible for a bit longer.

Disclaimer: I'm not suggesting technical writers will be out of a job tomorrow, (or that I will be automated into retirement), rather that Watson is a high profile example of the kind of problems that data miners can now tackle using very large unstructured data sets, such a feat was impossible only a decade ago and is still cost prohibitive to all but the deepest of pockets.

Re:If you live long enough, magic happens. (1)

TemporalBeing (803363) | more than 2 years ago | (#40976485)

code_monkeys everywhere will suddenly find they have automated themselves out of a job

A sign of a good programmer is that they put themselves out of a job/project/etc so they can move on to the next one.

Re:Big Data is the new place where magic happens (2)

Taco Cowboy (5327) | more than 2 years ago | (#40969021)

"Big Data" is another way to put data into a cylinder or a fluffy cloud and avoid the messy task of actually thinking about it.

 
But the truth is, in data-mining operation, the bigger the metadata the more ways you can mine it, and the more surprisingly the result you get out of it
 

Re:Big Data is the new place where magic happens (1)

sco08y (615665) | more than 2 years ago | (#40978933)

"Big Data" is another way to put data into a cylinder or a fluffy cloud and avoid the messy task of actually thinking about it.

But the truth is, in data-mining operation, the bigger the metadata the more ways you can mine it, and the more surprisingly the result you get out of it

If I want a surprise, I can leave the toilet seat up before I go #2. What we're aiming for in data processing is extracting something meaningful.

Re:Big Data is the new place where magic happens (5, Interesting)

Sarten-X (1102295) | more than 2 years ago | (#40969401)

No mod points, so I'll just post instead: You seem to be blissfully ignorant of what you're talking about.

Big Data isn't just gathering tons of data, then running it through the same old techniques on a big beefy cluster hoping that answers will magically fall out. Rather, it's a philosophy that's used throughout the architecture to process a more complete view of the relevant metrics that can lead to a more complete answer to the problem. If I'd only mentioned "empowering" and "synergy", that would be a sales pitch, so I'm just going to give an example from an old boss of mine.

A typical approach to a problem, such as determining the most popular cable TV show, might be to have each cable provider record every time they send a show to a subscriber. This is pretty simple to do, and generates only a few million total events each hour. That can easily be processed by a beefy server, and within a day or two the latest viewer counts for each show can be released. Now, it doesn't measure how many viewers turned off the show halfway through, or switched to another show on the commercials, or who watched the same channel for twelve hours because they left the cable box turned on. Those are just assumed to be innate errors that cannot be avoided.

Now, though, with the cheap availability of NoSQL data stores, widespread access to high-speed Internet access, and new "privacy-invading" TV sets, much more data can be gathered and processed, at a larger scale than ever before. Now, a suitably-equipped TV can send an event upstream for not just every show, but every minute watched, every commercial seen, every volume adjustment, and possibly even a guess of how many people are in the room. The sheer volume of data to be processed is about a thousand times greater, and coming in about a thousand times as fast, to boot.

The Big Data approach to the problem is to first absorb as much data as possible, then process it into a clear "big picture" view. That means dumping it all into a write-centric database like HBase or Cassandra, then running MapReduce jobs on the data to process it in chunks down to an intermediate level - such as groupings of statistics for each show. Those intermediate results can answer some direct questions about viewer counts or specific demographics, but not anything too much more complicated. Those results, though, are probably only a few hundred details for each show, which can easily be loaded into a traditional RDBMS and queried as needed.

In effect, the massively-parallel processing in the cluster can take the majority of work off of the RDBMS, so the RDBMS has just the answers, rather than the raw data. Those answers can then be retrieved faster than if the RDBMS has to process all of the raw data for every query.

Rather than dismissing errors of reality as unavoidable, a Big Data design relies on gathering more granular data, then distilling accurate answers out of that. For situations where there is enormous amounts of raw data available, this is often beneficial, because the improved accuracy means that some old impossible questions can now be answered. If enough data can't be easily collected (as in the case of so many small websites (almost anybody short of Facebook and Google)), Big Data is probably not the right approach.

Re:Big Data is the new place where magic happens (1, Redundant)

Emil S Hansen (143865) | more than 2 years ago | (#40971127)

I think this comic nailed it:

http://dilbert.com/strips/comic/2012-07-29/ [dilbert.com]

Re:Big Data is the new place where magic happens (0)

Anonymous Coward | more than 2 years ago | (#40971175)

That's the Small Data link. We talking Big Data here so the OP was correct.

Perspective please (4, Insightful)

Anonymous Coward | more than 2 years ago | (#40968293)

Recently I was at a University in town here talking to one of the PhD students. He showed me a server where they store several dozens of TB of data that come from one of the space telescopes. He said that the data they had on-site was just a small fraction of the overall amount of data that gets collected each week, for which they write algorithms to analyze.

To me, that put into perspective what Big Data really means. I think for the most part, most people in tech. today still use it as a buzz-word without a real concept or understanding of what it means.

Re:Perspective please (1)

macbeth66 (204889) | more than 2 years ago | (#40968797)

Good point. Too bad it was made by an AC.

I have to ask, just how many corporations actually have the volume of data you describe. Then I'd like to know how much of it is unique? It seems that folks have copies of the data everywhere and backups upon backups upon backups.

Has there been any research into what is contained in the mountains of data 'we' truly have?

Re:Perspective please (0)

Anonymous Coward | more than 2 years ago | (#40968919)

Deduplication is the buzzqword to deal with that.

http://en.wikipedia.org/wiki/Data_deduplication

Re:Perspective please (2)

Taco Cowboy (5327) | more than 2 years ago | (#40969107)

I have to ask, just how many corporations actually have the volume of data you describe. Then I'd like to know how much of it is unique?

 
I can't tell you the details without breaking business secret, suffice to say that the data we are working at is more than peta-byte, and there is no repeat or duplication
 

Re:Perspective please (0)

macbeth66 (204889) | more than 2 years ago | (#40969865)

I can't tell you the details without breaking business secret, suffice to say that the data we are working at is more than peta-byte, and there is no repeat or duplication

Really? So, you don't have backups? Incrementals? ::smirk:: J/K. I know what you meant.

Re:Perspective please (1)

Taco Cowboy (5327) | more than 2 years ago | (#40970417)

As I said before, I can't disclose too much on what we are working on

Re:Perspective please (1)

kd6ttl (1016559) | more than 2 years ago | (#40969127)

It's typically only large corporations and government agencies that have those huge amounts of data, but those who do, really do.

Think of a data point for every item purchased at every Walmart for the last 10 years.

Or a record of every phone call, text message, twitter, or Facebook posting in the United States - if the NSA doesn't have that now, it's only a matter of time.

Re:Perspective please (2)

Donwulff (27374) | more than 2 years ago | (#40970249)

The data needed increases a lot the moment you have time dimension to anything. But as nobody seems to want to come up with an example, from something I have experience with, lets say you're running a shipping & logistics company. All your vehicles, trailers etc. have sat-nav, wireless broadband, sensor arrays for temperatures, weather, heck maybe even a video feed or two. But I'll stick into a "small" example.

The vehicle control-buses alone can generate thousands of messages per second, but if you don't want to go overboard, you might be tracking maybe 64 values on per-second basis. Oh, and naturally you have hundreds of trucks in the fleet, say you're a relatively small operator with 250 trackable vehicles. At bare minimum you're looking at something like vehicle-id, timestamp, flags and data per each item. This would be roughly 2k per row on a naive database, or half a megabyte for whole feet. Times the seconds, coming to whopping 14 gigabytes per day even if they're only in use 8 hours a day on average. In a year, you'll amass 5 terabytes of data.

If you're said logistics company, you probably want to outsource it somewhere, the company may be handling data from dozen or so logistics companies and then it's 60 terabytes per year. It might be desirable to save that data for 5 years, at which point you'd looking at 300 terabytes in active storage, from whence you'll want answers like "Who was driving on 5th Street on the new year's eve" or "Was the temperature of the cargo over 10C at any point during shipment XYZ" to the utterly complex data-mining for fuel economy etc.

Of course, in reality the amount of data you'd want to store would vary widely, you would also store much other data from administrative to legal, have different storage approaches for different uses, and employ different compression schemes starting with storing only when values change, but that's primarily an example of how the amount of data easily balloons once you figure in matters of scale and time-dimension. Even in something as simple as getting fresh bread delivered to your local store. I can imagine quite a few businesses having similar situation, especially as society gets more and more data-driven, which I guess is what this article is supposed to be about.

Re:Perspective please (2)

RaceProUK (1137575) | more than 2 years ago | (#40971593)

The vehicle control-buses alone can generate thousands of messages per second, but if you don't want to go overboard, you might be tracking maybe 64 values on per-second basis. Oh, and naturally you have hundreds of trucks in the fleet, say you're a relatively small operator with 250 trackable vehicles. At bare minimum you're looking at something like vehicle-id, timestamp, flags and data per each item. This would be roughly 2k per row on a naive database, or half a megabyte for whole feet. Times the seconds, coming to whopping 14 gigabytes per day even if they're only in use 8 hours a day on average. In a year, you'll amass 5 terabytes of data.

I work on (part of) a vehicle tracking system, and the volume of data we actually send OTA is a fraction of what you describe there. I'm not saying your example isn't appropriate, but you have somewhat overestimated the data volumes. Then again, we don't send video data OTA - I doubt the mobile networks would be happy if we did so.

Re:Perspective please (1)

Donwulff (27374) | more than 2 years ago | (#40974855)

I'm not sure how much I should really say, because I work on similar system too. It's not just vehicle tracking, of course, you could say it's "data processing services for mobile units", and the irony is that description covers a fair amount of everything done in IT these days. But I'll freely admit the example is partially fictitious, there's no point in getting to the nitty-gritty details of data representation and reduction here, nor can I reveal numbers that could be considered trade secrets. But suffice to say the example is realistic, and pretty close to what we do for some clients.

Fuel economy is presently one of the biggest needs driving this influx of data. While few of the companies care themselves, but many public sector service contract competitions now require or are going to require companies to implement economic driving systems. For this you may require down to the second data on what the driver did with the controls, how the vehicle responded, and what were the environmental conditions. Some public transit companies want to go even further, optimizing their performance and time-tables to the max. Equipment failures are also VERY costly, especially when you have an expensive time-dependent and possibly climate controlled cargo riding on it, so companies will do anything to prevent, predict and detect them. EU has recently mandated automated accident detection and emergency call system on future vehicles, while this can work in-vehicle, it's another thing driving adoption of remote data-loggers and detailed logging, the systems being needed in the vehicles anyway.

Mobile networks won't particularly mind large amount of data, as long as they get to set the price. 3G/4G and other mobile broadband solutions exist for just such cases. Sending real-time video isn't really sensible in general, of course. But just to be sure, it's easy and often necessary to store the data locally until it can be downloaded via WLAN to wired network at depot of the like. Unless you're Google, there's limited things you can do with it, it's not easily searchable and nobody's going to watch hundreds of simultaneous streams like that, most of it is noise and not data. But extracting data from it, like road signs, driving distances and other more complicated parameters, or snapshots of specific situations (why did the vehicle brake, what's the weather like etc.) are done.

Re:Perspective please (1)

Anonymous Coward | more than 2 years ago | (#40969841)

In financial world, plenty of corps have huge data sets (e.g. accumulating 1-2T/day, every business day), with need to analyze that data within that given day. Some of that data has books-and-records requirement, so gotta keep it around for 7 years... some of it has business utility for months. Very few folks use SAS for any of that though... hadoop is showing up in a few places, but mostly it's swanky new databases, like greenplum & netezza.

Re:Perspective please (1)

microTodd (240390) | more than 2 years ago | (#40971813)

Many, many more than you realize.

I'm only just this year moving into this industry (after being in IT for 15 years) and I'm constantly amazed at the size of this market sector. There are WAY, way more companies out there than I realized with at least 1PB of data. Its kind of mind-boggling and insane, when you stop and think about it. Especially if you've been in the industry more than 10 years or so.

Re:Perspective please (1)

denvergeek (1184943) | more than 2 years ago | (#40972277)

Exactly the same experience here. I spent years doing more IT/sysadmin type work, and am coming up on my first year in the so-called "Big Data" industry. It's a huge market that's mostly under-served by existing vendors.

Re:Perspective please (1)

gl4ss (559668) | more than 2 years ago | (#40973003)

it's easy for a big company to generate that amount of data, by gathering everything that isn't even slightly relevant for irrelevant analysis later.

from what I gather that's Big Data.

Re:Perspective please (1)

mwvdlee (775178) | more than 2 years ago | (#40970133)

My thoughts exactly.
The problem with a buzz-word like "Big Data" is that suddenly everybody with a few GB of data thinks they need specialized tools to handle it.

Bletch (4, Funny)

Anrego (830717) | more than 2 years ago | (#40968347)

Isn't there some rarely visited slashdot offshoot for this kinda stuff? A place with nicer graphics where suits could happily spew buzzwords at each other and make comments like "Great post , very informative!".

Why is this here :(

Re:Bletch (4, Funny)

cheater512 (783349) | more than 2 years ago | (#40968395)

Great post , very informative!

Re:Bletch (1)

Anonymous Coward | more than 2 years ago | (#40969309)

Great post , very informative!

hmm, quite an interesting post. See how the cheater places an extra space before his comma. We will search through our PB data center and see if other known cheaters are also adding the additional space. It's important to know the latest cheating trends so we can detect and stop cheating before it happens. Rest assured, we will get to the bottom of this.

Re:Bletch (0)

Anonymous Coward | more than 2 years ago | (#40968633)

Well.. Digg just did their relaunch, and their new site looks perfect for articles like this.

SlashBI (1)

Anonymous Coward | more than 2 years ago | (#40968879)

I sometimes go to SlashBI.

Just to look at the tumbleweeds, mind.

Mid-life crisis (0)

Anonymous Coward | more than 2 years ago | (#40968417)

Maybe the servers were getting fat and bald and they decided that they only way they could get some attention was to just start flaunting their huge storage arrays?

The Computer is my friend. Trust The Computer. (0)

Anonymous Coward | more than 2 years ago | (#40968431)

Are you happy, Citizen ?

How big is 'big data'? (4, Insightful)

shic (309152) | more than 2 years ago | (#40968441)

And how are we measuring the size? What sizes are measured for typical 'big data'?

Are we talking about detailed information, or inefficient data formats?
Are we talking about high-resolution long-term time series, or are we talking about data that is big because it has a complex structure?

Is the data big because it has been engineered so, or is it begging for a more refined system to simplify?

Re:How big is 'big data'? (0)

Anonymous Coward | more than 2 years ago | (#40968977)

And how are we measuring the size? What sizes are measured for typical 'big data'?

I dunno..... maybe ask a group of girls? They're always over-sharing with eachother about how big (or not) such-and-such a dood's dic^H^Hata is!

Re:How big is 'big data'? (1)

Taco Cowboy (5327) | more than 2 years ago | (#40969203)

Conclusion: If you are really really well-endowed, you do not need to sell - the girls willingly do all the selling for ya
 

Re:How big is 'big data'? (0)

Anonymous Coward | more than 2 years ago | (#40971819)

In my experience, they remain silent or spread disinfo regarding the truly well-endowed.

Very cagey, these females.

Re:How big is 'big data'? (0)

Anonymous Coward | more than 2 years ago | (#40972743)

Sorry bro you really do have a micro...

Re:How big is 'big data'? (1, Funny)

Sulphur (1548251) | more than 2 years ago | (#40969109)

And how are we measuring the size? What sizes are measured for typical 'big data'?

Are we talking about detailed information, or inefficient data formats?

Motions with hands.

Mod parent up (0)

Anonymous Coward | more than 2 years ago | (#40969173)

An excellent three word word critique of the Big Data phenomenom.

Re:How big is 'big data'? (1)

Anonymous Coward | more than 2 years ago | (#40969301)

I think the term "Big Data" comes to mind when one cannot either lose it or do a one off migration of the data at once to another platform.

Re:How big is 'big data'? (2)

Sarten-X (1102295) | more than 2 years ago | (#40969583)

The data is as big as it can be.

And how are we measuring the size? What sizes are measured for typical 'big data'?

The last Big Data system I worked on was a new system. Our initial load pulled in a billion rows of data over two days. It used a few dozen terabytes, but again, that's only for a small new database.

Are we talking about detailed information, or inefficient data formats?

As much detail as possible. In the case of a web crawler, every header, parameter, and circumstance of a page visit. For a medical system, every nurse visit and every note recorded. For an insurance agency, that could include every mechanic visited, every recall, every ticket, and every oil change.

Are we talking about high-resolution long-term time series, or are we talking about data that is big because it has a complex structure?

Both, depending on the application. Generally, translating a complex structure into the key-value form preferred by the NoSQL data stores (which scale better for fast data gathering than most RDBMSs) is difficult, so the former is more common.

Is the data big because it has been engineered so, or is it begging for a more refined system to simplify?

The data is big because the system is explicitly not simplified. All of the source data is preserved and kept for later analysis. Where a traditional application might discard a user's mouse movements as trivial, a Big Data system could collect all the mouse movements (since cursors have been found to follow where users' eyes are looking), and analyze them to determine which features users spend the most time looking for. Those features could then be moved to a more obvious place in a future version.

"Big Data" is not so much a term for some particular threshold in a database's size, but rather an approach to problem-solving based on gathering as much information as possible, rather than just what the architect considers relevant at the time. It's caught the attention of business managers, because it allows developers to start gathering data before the managers have to nail down exactly what questions they want to ask. The data is all stored, and new questions just mean new algorithms for analyzing it.

Re:How big is 'big data'? (1)

Hognoxious (631665) | more than 2 years ago | (#40970885)

See that? [mongodb-is-web-scale.com]

That's you, that is.

Re:How big is 'big data'? (1)

Sarten-X (1102295) | more than 2 years ago | (#40972009)

No, I'm an experienced developer who's actually worked with NoSQL databases, while that's a retarded straw-man argument.

It's almost as retarded as whoever came up with this "web scale" buzzword in the first place. Unless you're as big as Facebook or Google, your website probably doesn't need to use a NoSQL database. You're probably better off with a nice and easy RDBMS, where the tools are already built for you and everything interfaces nicely. The whole Big Data approach likely isn't even really appropriate in the first place, because even collecting massive amounts of data you probably don't have enough to make your answers statistically significant.

The primary feature of NoSQL databases for Big Data applications is that their write performance usually scales up linearly with the number of nodes in the cluster. In contrast, most RDBMSs either cannot scale write performance (being limited by the master server's throughput), scale more slowly (as bandwidth and processing demands increase exponentially), or scale unpredictably (as data distribution changes). While write performance is therefore almost unlimited for NoSQL, it comes at the expense of consistency because the bandwidth savings come from removing synchronization. With enough data, missing a few data points is an error below statistical significance, so the loss of full ACID doesn't matter. Again, if you're not working with enough data to actually max out your server, you probably don't need a NoSQL database in the first place.

Databases are tools, and as always you should choose the right tool for the job, through a thorough consideration of the strengths and weaknesses of each option. No, it's not as much fun as parroting buzzwords or spamming links to funny videos, but it's what actually gets the job done right.

Re:How big is 'big data'? (5, Interesting)

glitch23 (557124) | more than 2 years ago | (#40969807)

And how are we measuring the size? What sizes are measured for typical 'big data'?

You measure the size based on how much storage capacity the data takes up on disk. Usually it's on SAN storage. Big data can be any size but typically it is used for customer data that is in the terabyte range, which can obviously extend from 1 TB to 1024 TB. For one company 1 TB of data may be created in one day and for another it might take a year. But creation isn't the issue...it's the storage, analysis and being able to act on the data that can be difficult at those capacities. Why you ask? Look at my answer to your next question.

Are we talking about detailed information, or inefficient data formats?

Anything. When you begin talking about *everything* an enterprise logs, generates, captures, acquires, etc. and subsequently stores then the data formats can seem infinite, which is why it is so difficult to be able to analyze the data because there are file formats to consider, normalization, unstructured data, etc. to contend with. The level of detail depends on what a company desires. Big Data can represent all the financial information they track for bank transactions, the audit data that tracks user login/logout of company workstations, email logs, DNS logs, firewall logs, inventory data (which for a large company of 100k employees can change by the minute), etc.

Are we talking about high-resolution long-term time series, or are we talking about data that is big because it has a complex structure?

A company's data, depending on the app that generates it, may become lower resolution as time goes on but not always. It's big simply because there is a lot of it and it is ever-growing. The best ways to combat even searching against data sets in the terabyte and exabyte levels is to index it and to use massive computing clusters, otherwise you'll spend forever and a day waiting for the machine to search for what you need out of it. That also assumes the data has already been stored in an efficient manner, normalized, and accessible by an application intended to process that much data by companies who are in the Big Data business (such as my employer).

Is the data big because it has been engineered so, or is it begging for a more refined system to simplify?

It's big simply because companies generate so much data during the course of a day, month, year, 10 years, etc. On top of what they generate, many of them are held to retention regulations such as the medical and financial institutions for various reasons such as HIPAA and SOX. So when they have to store not only stuff that their Security team requires, their HR team, their IT dept, etc. as well as what the gov't requires them to collect (which is usually in the form of logs), it just becomes the nature of the beast of doing business. In some cases, like data generated by the LHC in Europe, it has been engineered to be big just because the experiments generate so much data but a small ma and pop business doesn't generate that much, mostly because they don't need it; they don't care about it.

It definitely is begging for a more refined system to simplify it in the form of analytics tools that are built to do just that. Of course, you need a way to collect the data first, store it, process it, and then you can analyze it. After you analyze it you can then act on the data, whether it is showing that your sales are down in your point-of-sale stores that are only in the southeastern US, or your front door seems to get hits on it from Chinese IPs every Monday morning, etc. Each of the collection, storage, processing and analysis steps I mentioned above requires new ways of doing things when we're talking about terabytes and exabytes of data, especially when a single TB of data may be generated every day by some corporations and their analytical teams need to be able to process it the next day, or sometimes on the fly in near real-time. This means software engineers need to find new algorithms to make it all run faster so that companies competing in the Big Data world can sell their products and services to other companies who have Big Data.

Re:How big is 'big data'? (1)

careysub (976506) | more than 2 years ago | (#40972133)

Good post, but another aspect to consider about why this is becoming a buzzword now is the on-line communications world we live in now. The origin of the term seems closely linked to the well known proprietary technology Big Table developed by Google, cloned in open source by Hadoop. Google needed a new mass data storage/processing technology to be able to store and process the tens of billions of changing pages, and trillions of links, harvested from the web, and their chronological evolution (up to a point), and be able to maintain their index of them all in (ideally) something close to real time.

In data item counts (i.e. not total size), Internet advertising handles colossal volumes since a very large number of low value items are involved. A modest Internet advertising company might consider thousands of ads for each of a hundred million page presentations a day, displaying dozens of them for each page (i.e. present billions of ads a day, selecting them from up to a trillion ad feed choices). To find which ads bring in revenue (very few get clicked on) they must track them all, and make decisions on matching searches to ads in a fraction of a second. With dropping click through rates and maturing competition analytics to extract every possible bit of business intelligence as rapidly as possible becomes increasingly important. And due to the very average low value of each item they can't spend huge amounts on the technology to do it. No one is building traditional RDB data warehouses for this kind of stuff. And of course we are blessed/cursed with extremely cheap mass storage today enabling the collection and retention of this data.

Big Data as a professional skill means knowing the available technologies (which are developing rapidly) and being able to match them to the requirements of the equally rapidly evolving Internet business environment (we all live in Internet Time now). Major physics research centers may have been generators of vast data volumes for a long time, but the on-line communications world (including cell phone networks, etc., not just the Internet) really mainstreams these sorts of skills.

Re:How big is 'big data'? (2)

geoffrobinson (109879) | more than 2 years ago | (#40969855)

I currently work as a DBA for a Big Data database (Vertica). My answer would be if the speed and volumes you require make Oracle and SQL Server look bad unless you buy a ton of expensive hardware or magic tricks, that's a Big Data database.

Billions of rows usually.

Vertica, Teradata, Neteeza, and others like that would fit that bill.

Re:How big is 'big data'? (1)

galanom (1021665) | more than 2 years ago | (#40970025)

In kilos, cm and $. For example my first hard disk, a Seagate ST-124, 20MB, weighed some kilos, it has 5.25" large, and cost multi-hundred $. That's big.

Re:How big is 'big data'? (1)

garcia (6573) | more than 2 years ago | (#40970139)

I work as a manager of data analysts utilizing SAS for ETL. I spend a lot of time wading through resumes and interviewing people, many of whom claim they have experience with "Big Data".

My favorite question to ask is "How big is Big to you?" Most reply in the tens of thousands of records, some in the hundreds, and a handful in the 10s of millions. To me? Many hundreds of millions of records+.

So, what is Big Data? Everyone has a different answer but if you're using a Teradata installation with SAS and you weren't fucked by some smooth-talking sales guy, you're probably heading up towards the high end of the scale rather than the average response.

Re:How big is 'big data'? (1)

RaceProUK (1137575) | more than 2 years ago | (#40971599)

And how are we measuring the size? What sizes are measured for typical 'big data'?

To quantify it's bigness would be doing it a disservice!

Note: Bonus Internet to anyone who gets the reference.

Same way anything gets big (5, Funny)

istartedi (132515) | more than 2 years ago | (#40968471)

More and more crap accumulated until, low and behold, you had a glacier, a mountain, an ocean full of water, or a big database full of pictures of people you knew in highschool drunk off their asses, or a huge run-on sentance full of listed items and disjointed thoughts separated by commas.

Sentance? (0)

Anonymous Coward | more than 2 years ago | (#40969541)

You should be put in prison for 'murdering the English Language', Roman Maroni -> http://www.youtube.com/watch?v=6GVCgTFw2Qk [youtube.com]

Re:Same way anything gets big (0)

Anonymous Coward | more than 2 years ago | (#40969563)

"lo and behold", thanks.

nauseated (0)

Anonymous Coward | more than 2 years ago | (#40968587)

First time I ever had to deal with more than 1 TB, I became nauseated. It took about a two years for me to overcome that sickness. Today I don't care. I think it was from running a bbs on 10 MB MFM drive, or learning what things 8 and then 16 bit processors could do, or how much data could fit on a floppy. QNX disk even. My mind would race and I would get SICK thinking about the data. It was a real dilemma and cut into my productivity until I finally just came out of it over time.

Some reading this may think me insane. But I bet a few of you have had this happen as well. It wasn't so bad with the 120GB drives, but when it went to 500G and 1TB that was it. Maybe it's cause we respected resources up until that point and now nobody gives a crap.

In light of the kinds of data now, it's beyond my comprehension. After Petabyte, I don't even know what comes next and they are x 1000 x 1000 x1000 past that from what I hear.

Douchebaggery filter (0)

Gothmolly (148874) | more than 2 years ago | (#40968601)

I'm a fan of these types of words - overuse of nebulous concepts like "The Cloud" and "Big Data" and "Infrastructure as a Service" helps clearly identify the office douchebags.

Somewhat on topic (-1)

Anonymous Coward | more than 2 years ago | (#40968621)

But man I hope SAS goes out of business. What incredibly shitty products.

BigData != "standard databases" (4, Insightful)

THE_WELL_HUNG_OYSTER (2473494) | more than 2 years ago | (#40968647)

... had been mining huge amounts of data for decades. But as the vague-but-catchy term for applying tools to vast troves of data beyond that captured in standard databases

Big Data has nothing to do with standard databases and "mining of huge data" for decades. Data is modeled fundamentally differently than in relational systems. Indeed, that is why one invariably doesn't use SQL with the likes of Hadoop and Cloudera. Think of them more like distributed hash tables [wikipedia.org] and you'll be closer to the mark.

Re:BigData != "standard databases" (0)

Anonymous Coward | more than 2 years ago | (#40968743)

Data is modeled fundamentally differently than in relational systems. Indeed, that is why one invariably doesn't use SQL with the likes of Hadoop and Cloudera. Think of them more like distributed hash tables and you'll be closer to the mark.

So what if you need ACID properties and you have a big data problem? Atomicity and Consistency are pretty much the antithesis of the Cloud, and from that, it's by implication difficult/impossible to Isolate. Finally, if you don't have atomicitiy, how durable can it really be? One blade goes down, others could take its place - but that's only addressing the CPU problem, not the data problem. If the disk in East Bumfuck on which the transaction was written goes down before it can be propagated to the rest of The Cloud(tm)...

Re:BigData != "standard databases" (1)

kd6ttl (1016559) | more than 2 years ago | (#40969147)

Big data doesn't usually apply to transaction databases. Acid isn't relevant.

Re:BigData != "standard databases" (0)

Anonymous Coward | more than 2 years ago | (#40970781)

try [wikipedia.org] reading [wikipedia.org] a [wikipedia.org] bit [wikipedia.org] more [wikipedia.org] before [wikipedia.org] complaining [wikipedia.org] .

Re:BigData != "standard databases" (1)

kd6ttl (1016559) | more than 2 years ago | (#40969191)

This is what many people don't understand about big data. Big data does not have a good PR department, and its differences from traditional data processing have not been well explained.

Re:BigData != "standard databases" (1)

Prof.Phreak (584152) | more than 2 years ago | (#40969891)

Depends on the scale. If you're talking on the scale of google, yes, relational is probably out of the question... but anything slightly smaller scale (e.g. hundreds of terabytes range), can be managed relatively well in Netezza or Greenplum, with standard SQL access.

Re:BigData != "standard databases" (1)

loufoque (1400831) | more than 2 years ago | (#40971273)

Data is modeled fundamentally differently than in relational systems.

Only if by "modeled fundamentally differently" you really mean "not modeled at all".

How 'bout Big Salespeople (4, Insightful)

EmperorOfCanada (1332175) | more than 2 years ago | (#40968661)

Have you ever met one of the sales people from these companies? They are really really good. They take closing a sale to a whole new level. These salespeople don't walk in off the street and say, "Hey would you guys like a 50 million dollar data analysis package?" In governments they work at the highest levels. Then the directive to put out a tender that only fits one company suddenly comes out of nowhere and poof a mega project takes off. With companies they work at the board of directors level. So again suddenly a team of "consultants" shows up and determines what is needed is a multi million dollar data analysis system. Other approaches is that they buy out a consulting company that is already entrenched with a government or large corporation. If you fight the system their "consultants" will discover that you are a useless tool and recommend your replacement. If you are reluctant then they offer you a crazy training package and that you should come to their booth at some in a trade show in an exotic local.

If all that doesn't work then they always just have the buy out. That is where they find a decision maker they can't take out but they offer her a juicy job that she will take shortly after the contract is signed: http://en.wikipedia.org/wiki/Darleen_Druyun [wikipedia.org]

So big data may or may not be a complete fad but it is another way for sales people to fool upper management into buying a zillion dollar system instead of running a few well crafted python scripts on a dedicated machine and feeding them into an open source graphing solution such as Graphite.

Re:How 'bout Big Salespeople (3, Informative)

zbobet2012 (1025836) | more than 2 years ago | (#40969585)

If a few well crafted python scripts can solve your data problem your data isn't even remotely close to "big". Not to jump on you to hard here, but there is a shocking number people on slashdot who do this all the time. Big Data by its nature doesn't fit on a single box in the first place. If you can put all of the data in 2u, its not very much data now is it?

Big data, and big data technologies may be a buzz word today, and you are probably right most people don't need them. However, Big Data is a very, very real problem. I design and run systems which crunch 60 plus gigabits of data per second. So no, a few "well crafted python scripts" will accomplish exactly nothing.

Re:How 'bout Big Salespeople (2)

glitch23 (557124) | more than 2 years ago | (#40969851)

Big data, and big data technologies may be a buzz word today, and you are probably right most people don't need them. However, Big Data is a very, very real problem. I design and run systems which crunch 60 plus gigabits of data per second. So no, a few "well crafted python scripts" will accomplish exactly nothing.

Agreed. The OP doesn't realize just how big Big Data can be, how diverse it can be (binary vs text, structured vs unstructured, real-time or historical, etc.), and how much can be generated each day if he/she thinks that some scripts will fix the problem. When companies like EMC, Splunk, LogRhythm, Tibco, Q1 Labs, etc. exist to analyze and collect data for their customers and they have to throw millions into R&D then you know it's not just a fad.

Re:How 'bout Big Salespeople (1)

EmperorOfCanada (1332175) | more than 2 years ago | (#40970311)

I agree that big data is often crazy big. I wonder how some of this data is even moved around. 60 gigabits sounds amazing. But if the system is set up with map reduce or other cool tap into the data often something can be simply crafted that will produce stunningly useful data. Other times what appears to be big data turns out to be not that big. It was the salesman who made his solution sound more ingenious than it was.

And yes carefully crafted python scripts can often perform interesting data analysis on petabytes, even on a laptop. Brute force no, statistical sampling, yes.

If your discrete math/information theory skills are up to snuff you can usually avoid brainbendingly(for the CPU/GPU) difficult math and boil it down to something more elegant. If you properly apply some higher math to a carefully selected subset you often can make shockingly precise and accurate statements about the whole to a known level of certainty. So I am not saying some idiot with a few python scripts, but if you have the talent around you might be able to avoid a multi million dollar consulting company. Not applicable to all data but unless you are talking about specific medical records or financial transactions interesting generalizations are often worth much more than simple subtotals. Worth trying out first.

There are a few people with massive data sets. But often what seems huge to the people buying these systems is actually quite small. They might say, "We have 1000 stores inventory and sales data; that is huge." But it is really a few hundred gigs with a few hundred megs of data being generated per day. It would only be massive if you printed it out. I witnessed one consulting company charging 5000 every time to print out a report that was based on a fairly simple SQL statement. Their excuse was that it wasn't in the original spec and was maintenance requiring a DBA a whole day. The database was a few gigs; the report a few dozen pages.

Re:How 'bout Big Salespeople (0)

Anonymous Coward | more than 2 years ago | (#40971185)

If you let a third party 'own' your data, don't be surprised when they exploit the monopoly over you.

Re:How 'bout Big Salespeople (1)

dkf (304284) | more than 2 years ago | (#40971873)

I agree that big data is often crazy big. I wonder how some of this data is even moved around. 60 gigabits sounds amazing. But if the system is set up with map reduce or other cool tap into the data often something can be simply crafted that will produce stunningly useful data.

It all depends. The data is often very large at the point of collection and between there and the first point of analysis; it's only after that point that you can start to get the quantities down to a saner level. Even then it remains hard to ensure that you can actually search the data; you don't want the data to just sit there, you want to be able to do something useful with it. Yes, you can try putting some sort of tap on it as it is flowing past, but then you're always wondering whether you're monitoring the right thing; you might be ignoring something critical and you'd never know.

The Big Data trend is about capturing all of the data (or a much larger fraction than before) and looking through what you've got afterwards. Some of this is done with the like of Hadoop, some with commercial software. There are both SQL and NoSQL DBs in the mix. And more than a few python scripts I'd be willing to bet. (This is all independent of where the data is actually stored; that depends on the particular application, but you don't get into this sort of thing without thinking carefully about what you're doing. The physical constraints on moving that much data around mean you can't blunder in without a clue.)

Re:How 'bout Big Salespeople (1)

microTodd (240390) | more than 2 years ago | (#40971835)

I was going to post a reply but you are spot-on. The REAL Big Data folks I've seen...no, this wouldn't work. You can't just write a python script and send to to a single graphics package.

Because as you said...1TB isn't just a single box. Its a cluster of SAN arrays spread out over an entire datacenter. Simply getting a look at the entire dataset is a challenge in itself.

Re:How 'bout Big Salespeople (0)

Anonymous Coward | more than 2 years ago | (#40972389)

I work on big data projects. If you want the technical side, the question is how many disk controllers do you have? You can buy tons of SAN, z/OS tape or thousands of commodity servers. We can mine all day long in Teradata, how much do we spend on licensing? DB2 has the same capabilities, so does Oracle and every other Enterprise server. Chordiant Decision Manager, SAS, Tibco, FICO, Blaze .. the tools have been around forever. How many tools do you have for ETL and the SMEs to support that activity? Now, if you can store data semi-structured and parse that data using commonly known tools without a capital outlay in the millions, what would you call this approach? Nothing in Big Data is new, but the cost benefit equation is definitely changing. As least that is what I see as a drone in the megacorp.

Big data has been around for a long time (0)

Anonymous Coward | more than 2 years ago | (#40968759)

IBM knows a thing or two about storage.

http://en.wikipedia.org/wiki/IBM_1360 [wikipedia.org]

Obama is Doomed (-1)

Anonymous Coward | more than 2 years ago | (#40968773)

And Butt-Fu*ked.

Well.

He deserves every moment and second in Hell.

LOL

fucke]r (-1)

Anonymous Coward | more than 2 years ago | (#40968791)

Hoarders!! (1)

zenlessyank (748553) | more than 2 years ago | (#40968829)

I think they have a show on A&E all about it!!!

mSo3 up (-1)

Anonymous Coward | more than 2 years ago | (#40969157)

Reply to this artical (-1)

Anonymous Coward | more than 2 years ago | (#40969193)

Shanghai Shunky Machinery Co.,ltd is a famous manufacturer of crushing and screening equipments in China. We provide our customers complete crushing plant, including cone crusher, jaw crusher, impact crusher, VSI sand making machine, mobile crusher and vibrating screen. What we provide is not just the high value-added products, but also the first class service team and problems solution suggestions. Our crushers are widely used in the fundamental construction projects. The complete crushing plants are exported to Russia, Mongolia, middle Asia, Africa and other regions around the world.
http://www.sandmaker.biz
http://www.shunkycrusher.com
http://www.jaw-breaker.org
http://www.jawcrusher.hk
http://www.c-crusher.net
http://www.sandmakingplant.net
http://www.vibrating-screen.biz
http://www.mcrushingstation.com
http://www.cnstonecrusher.com
http://www.cnimpactcrusher.com
http://www.Vibrating-screen.cn
http://www.stoneproductionline.com
http://www.hydraulicconecrusher.net

Obese data (1)

minstrelmike (1602771) | more than 2 years ago | (#40969921)

Obese data means being too big too fail. That's why it's such an attention-getter these days.

tro7l (-1)

Anonymous Coward | more than 2 years ago | (#40970353)

part oF GNAA if *BSD but FreeBSD

It only got big (1)

Kohath (38547) | more than 2 years ago | (#40970649)

Because it was so cromulent.

the cloud... (1)

crutchy (1949900) | more than 2 years ago | (#40970691)

... is just another name for the ignorance we cling to so desperately to avoid having to actually solve problems

Tasha Yar (0)

Anonymous Coward | more than 2 years ago | (#40971825)

We all know big Data got bigger when Tasha Yar took advantage of is anatomically correct and fully functional manhood.

Fourth Paradigm: Data-Intensive Scientific Discove (1)

elyons (934748) | more than 2 years ago | (#40971953)

For a good read on this problem, I highly recommend the Fourth Paradigm: http://research.microsoft.com/en-us/collaboration/fourthparadigm/ [microsoft.com] .

This is a free ebook download from Microsoft and uses a variety of leaders in data driven science to write chapters about a variety of scientific disciplins and what "big data" means to them. The first chapter is especially enlightening! Blurb about the book:

Increasingly, scientific breakthroughs will be powered by advanced computing capabilities that help researchers manipulate and explore massive datasets.

The speed at which any given scientific discipline advances will depend on how well its researchers collaborate with one another, and with technologists, in areas of eScience such as databases, workflow management, visualization, and cloud computing technologies.

In The Fourth Paradigm: Data-Intensive Scientific Discovery, the collection of essays expands on the vision of pioneering computer scientist Jim Gray for a new, fourth paradigm of discovery based on data-intensive science and offers insights into how it can be fully realized.
Load More Comments
Slashdot Login

Need an Account?

Forgot your password?

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>