Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Too Much Data? Then 'Good Enough' Is Good Enough

timothy posted more than 2 years ago | from the ready-when-it-ships dept.

Data Storage 56

ChelleChelle writes "While classic systems could offer crisp answers due to the relatively small amount of data they contained, today's systems hold humongous amounts of data content — thus, the data quality and meaning is often fuzzy. In this article, Microsoft's Pat Helland examines the ways in which today's answers differ from what we used to expect, before moving on to state the criteria for a new theory and taxonomy of data."

cancel ×

56 comments

Sorry! There are no comments related to the filter you selected.

MS was right after all (0, Offtopic)

slashdotdottter (2226092) | more than 2 years ago | (#36326320)

640K is the best memory size after all, and more than that is too much...
Really, can't slashdot select better articles? Like the declatation that SpaceX will be creating moon rocket [tinyurl.com] that will land on moon in 2020.
I visited their site today and that hit me as a surprise...

GOATSE ALERT (1, Informative)

commisaro (1007549) | more than 2 years ago | (#36326376)

GOATSE ALERT

Re:GOATSE ALERT (1, Insightful)

drb226 (1938360) | more than 2 years ago | (#36326416)

a tinyurl almost always means goatse. Honestly, trolls, you can do better. (pick a more obscure, or even homemade, url shortener)

Re:GOATSE ALERT (2, Informative)

Anonymous Coward | more than 2 years ago | (#36326558)

I do,
I have for example the http://www.thoughts.com/geekatwork/at-work.
I am putting goatse onto various wikis, blogs, etc. Works like charm.
Lately I somwhat busy so I use that tinyurl link
(but that link isn't that simple, it contains 'data:' blob which decodes to tiny html page that embeds goatse.
So that should fool extensios that resolve shortener links.
Anyway, todays trolling got me 279 victims. I can do better that that sometimes.
(In total I got around 10000 victims on goatse.ru so far, and handful of hundreds on tubgirl, some random gay porn link, hai2you, but I prefer to post goatse.

I also keep a collection of responses (both positive and negative) - aka troll food.

Favorites:
"Ugh. Goatse. NSFW. Asshole (poster and picture, both)."
"Seriously ... new account to post that ... what a douche!"
"You're a fucking douchbag." - "That is the most accurate comment yet"
"I hope you die in a fire before you are old enough to contaminate the gene pool."
"Death to all assholes - Let's put you first into the guillotine"
"Asshole... Ginormous asshole, in fact."
"Ugh. Goatse. You asshole."
"Better than you, you arse bandit."

Hate:
"I hate your guts."
"WTF you fucking asshole."
"Damn! Mod this fucker to hell"
"Fucking troll, do not click there"
"It would be more interesting if I had a piece of pipe and your face, in close proximity so I could smash your face beyond recognition,"
"You fucker" - "I had the same thought as you. What a fucking asshole. The link is nsfw."
"Bravo teeny bopper. You're a really mature mother fucker (or do you prefer father fucking? Damn you homo erotic shittter)."
"Wait! I think I hear your mommy calling to give your tongue a good soap washing. And maybe she'll execute you too"
"You fucking piece of shit!" , "You sorry piece of shit." , "You cunt.", "Fuck you."
"It's because of Assholes like you that I can no longer trust URL shorteners"
"I did not even bother to look, but this same idiot has been doing this for weeks now. Fuck off asshole."
"What a retard..... enough said...."

Funny:
"Didn't click it, but the magic 8-ball says goatse."
"Thanks, I'm reading slashdot in class like a good student and just got tubgirl'd."
"not gonna click it to find out, but I'd be surprised if parent's link wasn't goatse... It appears you would be correct sir. Why oh why do I always forget.."
"Watching second monitor, there was something wrong with the other screen. Control + w. Phew..."
"Doh! One has to also recognize data urls. *sigh*"
"That's somewhat clever, but some of us do know what base-64 encoding is."
"Can you not afford normal entertainment?"
"Hey family! Come look! They're opening the Google Talk client! Now, click here...... (sees goatse)"
"I tried to post warnings about the goaste loving jerk yesterday but was modded into oblivion as a karma whore"
"Turn on TinyUrl previews. It saves lives."
"Posting your picture online again?"
"Really? Are you not tired of this yet?"
"High likelyhood of being a Goatse link. Proceed with caution"

Emotion:
"i WAS eating lunch you ass!"
"Oh dear god my eyes. Haven't seen THAT awful image in a while."
"My eyes are burning... argh! Damn you!"
"MY EYES... dude i am at work here "S "
"WARNING: Don't click on the parent's link! Damn goatse! The first I experienced, no less.
"Oh goddammit. I didn't need that right before bed."
"goatse warning! I'm still recovering."
"Please friends, I beg of you, do not click that link! Do not look at that image, whatever you do! It is a bad image! It is a goatse image."

Frustration:
"Can someone make a fucking goatse blocker firefox plugin please? This is pissing me off now."
"I am sick and tired of that crap on /. "

Philosophy:
"Goatse trolls are getting better these days..."
"Why the sudden coordinated campaign for Goatse? Is someone making money off this?"
"You're right, this is the most coordinated troll campaign in a long time. Multiple accounts, multiple pages."
"Urgh...dammit, am I the only one thinking the goatse trolls are getting worse lately than they have been in the past five years?"
"Who found a way to monetize goatse at this late date? If we got half the effort of that campaign on real stuff we'd all have better software by now."
"Boy Goatsex is out in force today... - Every topic is littered with them..."
"You can't actually expect the Slashdot users to actually know enough not to respond to a goatse troll, right ?"
"Can we start banning people who post that hiding it behind a url shortening link like goo.gl?"

Admiration:
"You are one dedicated troll."
"Well played, sir. Well played."
"A link that redirects to a page containing goatse? How clever of you!"
"Congrats. It's been a long time since I saw goatse."
"Thank you for that informational link"
"Interesting use of Data URLs for Goatse linking."
"Link is self portrait of ME"
"Goatse URL - Haven't seen that guy in a while"
"Well played, sir. It's been a while since I've been Goatse'd"

Misc:
"The fuck is a goatse? it's some dude pulling his arse open."
"Could not someone at slashdot write a small script to blacklist url's that have been flagged troll? I'll do it if you pay me a slave wage..."
"Parent should be modded down. Link is NSFW and mentally scarring."
"Just post the damn url, i'm not going to click on a tinyurl link and get goatse'd or something.."
"Someone please mod this guy down... Don't click his link."
"Mod to -1, please. this guy is an 'asshole'.... (yes, you guessed it)"
"Don't click the link! Goatse wannabe."
"Danger, goatse"

Re:GOATSE ALERT (0)

Anonymous Coward | more than 3 years ago | (#36327184)

Here's another one for your collection: why do you even bother? Are you 12 years old?

Re:GOATSE ALERT (0)

Anonymous Coward | more than 3 years ago | (#36333608)

None of that. It's proof that Russians are essentially worthless- They are spoilers that have nothing to actually create, they can only play the role of spoiler- They are sociopaths that make great drunken criminals.... You're well on you're way, pal. Or, could just be Russians making up for small peckers, either way, I'm sure we'll see you in New York soon selling broken electronics and scamming people (if you're any good, that is).

Re:GOATSE ALERT (1)

jpate (1356395) | more than 2 years ago | (#36326418)

i more or less assume that that's what all "shortened" links are ;)

Re:MS was right after all (1)

GameboyRMH (1153867) | more than 3 years ago | (#36330694)

Your trolling skills are weak. You chose to use a URL shortener with a preview service.

http://preview.tinyurl.com/5szfvml [tinyurl.com]

Learn to troll or face repeated humiliation.

Re:MS was right after all (1)

hellop2 (1271166) | more than 3 years ago | (#36335972)

The parent made an account, and spent the time to come up with a half-way plausible post just to goatse people.

What's the psychological term for that? Psychopath?

Obligatory (2)

Lifyre (960576) | more than 2 years ago | (#36326350)

Obviously 640k was "Good Enough"

Seriously though he makes a good point. If you have so much information that it isn't stored consistently, with varying standards, or is an open field to be populated by an individuals perceptions. The example of all of the different colors of green (Green, emerald, asparagus, chartreuse, olive, pear, shamrock) is a great example of how one piece of information can be expressed in multiple ways. While you can define the color by using the hex code for it that isn't exactly an elegant or user friendly method of input or output.

He talks about various ways to handle these types of information from limiting input options to finding patterns and using those to "correct" the data.

Re:Obligatory (3, Insightful)

Fluffeh (1273756) | more than 2 years ago | (#36326466)

It's not that there is too much data. That's not a problem at all. From my own experience (I work as a senior analyst for a multinational retailer employing around 200,000 people) it is rather that there isn't a single plan to utilize all the data we have available. Every time we introduce a new system or change the way we do something, the project inevitably drops a new table into our data warehouse. Now, this may seem like an acceptable way to do things, but after this has happened twenty times, it is nigh impossible to run a query that will return data from all these tables in any sort of reasonable time.

Would it cost more time, effort and money to properly introduce the new data to proper fact tables each time? Of course. However, the benefits would be that we could stop pretending that "we have too much data these days..." - because we don't. We just have too much mess with our data and it becomes unusable.

In the example above (different descriptions for green) the base system may need these particular terms, but if the data needs to be aggregated or used in another system, then the jobs that pass this to your data repository need to make those changes to adapt the data to work with the rest of your data warehouse. Having said that, if the new system is being developed inhouse, then during development the question should be asked "Can we store the color information in RGB right off the bat and adapt our own system to mask these values behind pretty descriptions?" rather than having to later do it via an ETL.

Re:Obligatory (4, Insightful)

icebike (68054) | more than 3 years ago | (#36326698)

It's not that there is too much data. That's not a problem at all.

Often, (more often then not, I contend), there is indeed just too much data.

Because we have all these marvelous computerized data capture system doesn't mean the data is necessary, useful, or worth keeping. However, someone always comes along in the project design stage and insists the millisecond by millisecond weight of a bag of popcorn weighed in real time as it is being filled is going to provide a wealth of data for the design of future bagging systems and materials handling in general.

The scale was only there to assure that 10 pounds were in the sack and to shut the hopper. Then some fool found out it measured ever few milliseconds and recorded the data.

So the project manager gets brow beaten into recording this trash which invariably never gets used for anyone for any purpose at any time, as those who lobbied for it wander off to sabotage other projects and never revisit the cesspool they created.

This happens way way more than you might imagine in the real world these days.

It used to be projects had to fight for every byte of data collected, there were useful sinks identified for every field. But with falling storage costs the tendency is to simply keep shoveling it in because its easier than dealing with the demands by those "researchers" looking for another horse to ride.

Re:Obligatory (4, Insightful)

StuartHankins (1020819) | more than 3 years ago | (#36326850)

+1 Insightful. I would argue that -- just like you have a lifecycle for software development -- you have a lifecycle for nontrivial amounts of data. Some data is useful in detail for a short term, but wherever possible it should be more coarsely aggregated as time progresses, and you should get sign-in from executives that it can be dumped after a period of time.

Where I work, I estimated the cost to upgrade our SAN to continue to store a set of large tables which helped everyone understand the cost in real terms. People tend to think once the data is imported or created that it's a small incremental cost to house it from that point forward, but backup times and storage along with execution plan costs increase with size. There is a performance benefit to this trimming; partitioning and check constraints will only get you so far.

What is difficult to gauge in advance sometimes is how the data will be used -- some things are obvious in the short-term, but as the company looks to different metrics or to shine some light on an aberration, you really need to be able to determine how quickly you can dump the detail. Get signoff then add some padding so you are conservative when you destroy. Make a backup "just in case" and delete it after a few months. The good news in my work is that changing your mind later to adapt to the new requirements means expectations are already set to change the way it works "from this point forward". There are many fields of work that do not have that luxury, because of the time or cost to gather detail again.

Re:Obligatory (1)

postbigbang (761081) | more than 3 years ago | (#36327332)

I think there's a Monty Python episode like that, called "Fade to Black". Data ought to have a half-life. Otherwise, it will conquer the world. No one throws stuff out. I wonder how many SAN "accidents" are just frustrated DBAs.

off topic - sig (1)

TaoPhoenix (980487) | more than 3 years ago | (#36328564)

"---- Teach Peace. It's Cheaper Than War."

Nice sig. Now the Republicans who spent too much are trying to blame it on Obama.

Re:off topic - sig (1)

Magic5Ball (188725) | more than 3 years ago | (#36328670)

Peace is easy as long as everyone lives under my rules.

Re:off topic - sig (1)

zach_the_lizard (1317619) | more than 3 years ago | (#36331498)

Both major parties are complicit in the spending and war disasters. Obama may not have started the wars in Afghanistan and Iraq, but he's certainly done little to end them. He has added his own excursion in Libya with vague threats to Syria and others, so who knows.

Re:Obligatory (1)

sbjornda (199447) | more than 3 years ago | (#36327072)

isn't a single plan to utilize all the data we have available.

Fluffeh, if you haven't done so already, I think you would enjoy taking John Zachman's course in Enterprise Architecture. If one were building an addition on one's house, would one just start hammering things in place or would one look at the existing plans first? Too many IT projects just look at their own plans and don't look at the larger plans they should fit into. And in most IT shops those larger plans don't exist anyway. So we just hammer things into place and wonder why the data doesn't work out.

--
.nosig

Re:Obligatory (1)

guruevi (827432) | more than 3 years ago | (#36329748)

With good design or dumping all your data daily to a platform designed for such data analysis your problems would be solved (of course, given enough money, time and brain resources).

The problem imho is that we collect too much cruft and we're simply unsure what to do with it. The systems for good data queries have been designed be they SQL, NoSQL or some specific BI solution. The problem is that most DBA's don't know how to use the collection of them correctly, there are in many cases no "data architects" on either the dev or database teams and the consultants (which 90% of consultants these days you should actually be calling sales consultants or sales engineers that work for another company) which are rarely hired for these jobs off course push to sell whatever they get the highest commission for.

So we collect data because it's useful in one way but if we want to get it out another way people start scratching their heads and saying they have too much data, while actually no, you have too little experience or knowledge to handle this data. If they hire a consultant or a company to 'help' them they get their data out in that other way but usually strictly designed and limited to the job they asked for so if somebody comes along and wants another type of reporting then they have to go back to the drawing board and design yet another layer to get stuff out.

I have seen well designed data warehouses and they are not at all scary, none of them have too much data and they pull in data from several different sources in different formats and can offer them back into a consistent format. The key is not to just push in all the data at once (big mistake a lot of companies make - we have the solution, dump everything in it overnight the way we have it now) but define data source by data source and define consistent data types and tags for each row/table/xml you pull in while making the system available and listening to people as to how they get and want data out. It took one of the companies 10 years to get about 100 of their major data feeds (each feed containing 100's if not 1000's of tables) defined and into the system and they're still working on it as new data sources are made, in the mean time it was usable and people have been using it since the beginning, the data just gets richer as it goes along and the architects get feedback as to what the users want and what they're missing.

Re:Obligatory (2)

smallfries (601545) | more than 3 years ago | (#36328972)

It's not a new idea, it's been explored before and it only works in certain cases. Take a look at Ontologies are overrated [shirky.com] . From the section called "Mind Reading":

You can't do it. You can't collapse these categorizations without some signal loss. The problem is, because the cataloguers assume their classification should have force on the world, they underestimate the difficulty of understanding what users are thinking, and they overestimate the amount to which users will agree, either with one another or with the catalogers, about the best way to categorize. They also underestimate the loss from erasing difference of expression, and they overestimate loss from the lack of a thesaurus.

Not just large sets of data (1, Insightful)

Daetrin (576516) | more than 2 years ago | (#36326392)

The data quality and meaning of this summary is rather fuzzy. I have no clue what exactly they're talking about. No, i haven't RTFA yet, but the summary isn't making it very clear if TFA is something i'd be interested in or not.

Re:Not just large sets of data (2)

QRDeNameland (873957) | more than 2 years ago | (#36326438)

From a quick skim, it seemed to be yet another "SQL/RDBMS is dying because we have too much heterogeneous data to handle", and a rather ambling and long-winded one at that.

Re:Not just large sets of data (1)

Anonymous Coward | more than 3 years ago | (#36326654)

Can you recommend other articles that have a better take on it? I'm fascinated by this stuff, but I find it difficult to explain to people why this is useful, and why it's a growing field.

This is one from O'Reilly [oreilly.com] that is decent.

Re:Not just large sets of data (1)

tqk (413719) | more than 3 years ago | (#36328100)

Can you recommend other articles that have a better take on it?

Or, for the darker side, grep /. for "NSA whisleblower".

Re:Not just large sets of data (1)

buswolley (591500) | more than 3 years ago | (#36327024)

Really. This seemed to be one of the better articles I've seen featured on /. in the last year.

Re:Not just large sets of data (1)

JumpDrive (1437895) | more than 3 years ago | (#36327834)

I agree with your accessment .
This was a long ambling caffienated rambling, about too much data to handle. To some degree I think it does a disservice when the author doesn't state up front in what context he prescribes for this data agglomeration. RDBMS is not going away anytime soon. There are too many people who are jumping on this as a hot topic and start speaking/thinking metaphorically about data, when they have very little information about true data and understanding of relationships.
It's kind of like learning mathematical integration without knowing how to add and multiply.
There are many , many applications where you need a distinct and verifiable/structured data source.

It would be nice if he went into maybe doing away with prescriptive schema that is MS Office.

Re:Not just large sets of data (2)

interkin3tic (1469267) | more than 3 years ago | (#36328594)

I skimmed the article, and I can say this much: they mean a fairly specific type and use of data.

Too much data from scientific results? Only the researcher him or herself would ever say there's "too much data." Everyone else says "not enough data." Everyone. At all times. Especially his committee and reviewers. Even when I've worked so hard for so long for so little money. After all, THEY'RE not the ones who are sacrificing their happiness, time, effort, hairline, and relationships to...

Uh, I mean, yeah, the article is vaguely worded and doesn't apply universally, and my life sucks.

Here's the one line summary of TFA: (5, Informative)

billrp (1530055) | more than 2 years ago | (#36326410)

SQL DBs are not appropriate for storing, processing, querying, and browsing unstructured documents.

Re:Here's the one line summary of TFA: (1)

Magic5Ball (188725) | more than 3 years ago | (#36328684)

I think the important insight is that Parkinson's Law applies not just to the quantity of data, but also to the varieties of data.

Duh! Of course! (0)

Anonymous Coward | more than 2 years ago | (#36326414)

Of course he complains, he probably doesn't have ADHD. Todays data-sets and information overload age is adjusted, guessed for people with ADHD. Such people can actually make sense of this humongus data. Because of their lack of attention they probably created some sort of native algorithms to "innovate" necessary lost data and with their attention deficit they lose a lot of information on intake => resonable sized relevant data!

And there was much rejoicing... "yay." (4, Interesting)

VortexCortex (1117377) | more than 2 years ago | (#36326440)

A bunch of rambling self-evident or speculative statements, followed by conclusion:

Conclusion

NoSQL systems are emerging because the world of data is changing. The size and heterogeneity of data means that the old guarantees simply cannot be met. Fortunately, we are learning how to meet the needs of business in ways outside of the old and classic database.

Which was apparent to everyone, and missed the real point: We have lots of data, and we're too impatient to wait for it to be aggregated, synchronized and processed. There goes 10 minutes of my life I'll never get back.

Here's a hint: People working on the solutions to this problem work in the financial sector and in quantum physics.

Re:And there was much rejoicing... "yay." (1)

Anonymous Coward | more than 3 years ago | (#36326890)

Here's a hint: People working on the solutions to this problem work in the financial sector and in quantum physics.

Or journalists [wikipedia.org] . Or intelligence agencies [wikipedia.org] . Or any business that's large enough to have information silos [wikipedia.org] . Or transportation departments [dot.gov] . Or internet startups [nytimes.com] .

Re:And there was much rejoicing... "yay." (1)

Anonymous Coward | more than 3 years ago | (#36327322)

Which was apparent to everyone, and missed the real point: We have lots of data, and we're too impatient to wait for it to be aggregated, synchronized and processed.

And that the management don't do their jobs, the business process engineering consultant is too expensive, the company is no longer a monolithic, singular entity with a clear hierarchical organization and the number of collaborations between small organizations and organizational units in a merged company increases as the most efficient way of performing a task is sought after. The article tries to push a problem of human organization to the fields of technology.

Re:And there was much rejoicing... "yay." (0)

Anonymous Coward | more than 3 years ago | (#36330988)

There goes 10 minutes of my life I'll never get back.

I made it about 5 minutes in and couldn't take it any more.

CAPTCHA: endured, ehhh, not so much.

Too Long; Do not Read (5, Interesting)

Comrade Ogilvy (1719488) | more than 2 years ago | (#36326456)

The researcher is just throwing together a bunch of problems that have existed, in some fashion, for a very long time, and concludes with open questions rather than even vague proposals for solutions. So I would say this article is both too detailed, and not detailed enough.

Transaction geek? (1)

MojoRilla (591502) | more than 2 years ago | (#36326510)

From TFA:

As a transaction geek, I've spent decades working on systems that provide a guaranteed form of consistency.

Uh...so you spent decades working on systems which are not needed for many problems (many problems don't need transactions, especially mostly read web publishing problems, which is a strength of no-sql), and now you are upset that people are not using your systems?

Re:Transaction geek? (1)

Anonymous Coward | more than 3 years ago | (#36327172)

You are talking about removing the C of ACID. Remove any one of those 4 and you get speed. The guys who came up with ACID knew it.

http://en.wikipedia.org/wiki/ACID [wikipedia.org]

However some problems lend themselves to not having ACID involved. Such things include Data Warehousing, and web views of transient data.

Other problems you *want* it to work perfect and every time. Such as when you are a business and printing out bills. Some of the data missing not so good...

Now in say a discussion forum. You say you want the top most comments to show up. But middle ones could be missing. Otherwise you end up with conversations that do not make as much sense. What would you use there?

Use the right tools for the right jobs and all...

Illiterate title (1)

Aardpig (622459) | more than 2 years ago | (#36326528)

Should be 'Too Many Data'. Morans.

Correct answer: fuck you and your data! (1)

Alex Belits (437) | more than 2 years ago | (#36326588)

The article makes an assumption that all data in the world consists of marketing surveys and transcripts of phone wiretaps.

Confused and incomplete (3, Interesting)

lucm (889690) | more than 3 years ago | (#36326838)

This article is confusing because most of the verbiage is made up by the author (such as "inside" or "locked" data). It is also misleading because it seems to indicate that structured and unstructured data usage is the same. Well it's not - a very large proportion of unstructured data is blog posts and emails but the amount of search and aggregation that is performed on this type of information outside of a few major companies (such as Google) is very low, which makes this usage a niche and not a trend maker.

The reality is that there are three categories of data that are relevant for databases: numbers, text and spatial. Everything else, which falls under the umbrella of "binary", is very unlikely to benefit from a database engine; only the metada can be manipulated and this metadata falls under one of the other categories and is a very good target for ETL. And so far nobody came up with a reliable way to search binary, such as video or audio, without relying on heavy indexing, metadata or any kind of transformation that takes binary and make it text data.

If a piece of data cannot be searched or aggregated, it does not belong in a database, it belongs on a filesystem. Anything can be done with blob columns but performance is usually not very good because the database engine cache is not designed for large objects. NoSql or not.

Also there is so much happening with storage infrastructure, such as sub-volume tiering or block-level replication, any analysis of data that does not take a look at storage is flawed.

Re:Confused and incomplete (1)

tqk (413719) | more than 3 years ago | (#36328278)

The reality is that there are three categories of data that are relevant for databases: numbers, text and spatial. Everything else, which falls under the umbrella of "binary", is very unlikely to benefit from a database engine; only the metada can be manipulated ...

Ya know, my email client, via its ~/.mailcap assignments, manages to handle blobs fairly well. What's wrong with your tech?

Never blame the technology. Blame the bum who's misusing it. Not saying that's you. But if mutt can do it, why can't Larry's Oracle, et al?

Re:Confused and incomplete (1)

WarwickRyan (780794) | more than 3 years ago | (#36328558)

You can search inside video files and pictures with your email client? Where do I sign up?

Re:Confused and incomplete (1)

tqk (413719) | more than 3 years ago | (#36330718)

You can search inside video files and pictures with your email client?

You've video and picture files your db can't open up correctly? Why?

Re:Confused and incomplete (1)

dkf (304284) | more than 3 years ago | (#36329570)

The reality is that there are three categories of data that are relevant for databases: numbers, text and spatial. Everything else, which falls under the umbrella of "binary", is very unlikely to benefit from a database engine; only the metada can be manipulated and this metadata falls under one of the other categories and is a very good target for ETL.

Actually, it depends on whether you can define relations over the data. The set of relevant relations will vary with the data type. For example, I can imagine it being possible to do searches over images, sounds or movies; there is fundamental structure there, relations are definable. That's not to say it is easy, or that we know the right set of relations, or that implementations are good yet, but to dismiss it as impossible? You jump too far.

Re:Confused and incomplete (1)

lucm (889690) | more than 3 years ago | (#36334456)

> For example, I can imagine it being possible to do searches over images, sounds or movies; there is fundamental structure there, relations are definable.

What you talk about is metadata. Defining an index of sorts to store patterns and checksum does allow one to establish relations between images - but the search is then performed on the said metadata, not on the binaries. A rule of thumb: if you must index data before you can search it, then you cannot search the data, you can only search the metadata.

Just think of a Google search. When you search for a keyword, you do not actually scan websites; you simply query a database where the keyword is associated with the url. In order for this to work, an indexing process must already have been performed.

Now it is possible to write an utility to extract patterns from an image; the police are using one of these to find kiddie pr0n on a suspect computer. However I am aware of no database engine that can perform this kind of stuff.

> dismiss it as impossible? You jump too far.

I don't think it is impossible, actually I am looking forward to a product that will actually perform search and aggregations on images and videos. It would make the Youtube experience much more interesting because relying on metadata is always the weakest link. This being said, I won't hold my breath because it is a very big step.

Any slashdotter coulda told him that. (5, Funny)

Anonymous Coward | more than 3 years ago | (#36326908)

We don't read articles, just skim the headline, maybe the submittal, and then a few top ranked posts.

That's Good Enough! (tm)

Statistics with Comp Sci a kick-ass combo. (2)

PerlPunk (548551) | more than 3 years ago | (#36328202)

This is why Statistics will become more and more important over time--it allows you to make inferences about populations that you couldn't possibly count. If you already know Comp Sci or or learned how to program on your own, go for a couple of Stats degrees. Along with your programming skills Stats will do you very well as the information age unfolds.

Another idiotic story (1)

Whuffo (1043790) | more than 3 years ago | (#36328462)

Using impure data to make real world decisions is just plain wrong. This is how 5 year olds end up on the "do not fly" list, how credit scores get incorrectly reported - add your own examples of how idiots read more from data than it contains.

So-called scientists saying it's OK to just take a guess only shows what scientists have become in this modern world. Once you get to that point, you may as well throw out the data and base your guess on whatever floats your boat. It wouldn't be any less valid - and no less "scientific" according to this bozo.

Boolean logic doesn't adequately describe our perception of reality, and trying to force reality into a true-false description is simplistic and doomed to fail. There's another valid state - "I don't know". And if the dataset is impure or inconsistent, then that's the only valid conclusion.

Re:only valid conclusion (1)

TaoPhoenix (980487) | more than 3 years ago | (#36328612)

It's the problem of Significant Figures for verbal data sets.

Last I recalled, you can only keep he number of significant figures equal to the fuzziest of the inputs. So you have 45.236 + 12.877 + "one million" ... means your answer can only keep the one significant figure of "a little over a million".

So for these non-verbal data sets, you get too many data fields, and misc people forget to put the stuff in ref1, someone puts a date instead of an invoice number in ref2, the vendor code is wrong in ID1, someone puts an employee instead of the lumber yard in ID2 etc. So then when the boss wants "gimme the total set of cases I need to go manage", you get bad searches.

Re:Another idiotic story (1)

N1AK (864906) | more than 3 years ago | (#36329200)

Using impure data to make real world decisions is just plain wrong.

To call someone an idiot because you're too blinkered to comprehend that the importance of accuracy can vary depending on the decision you appear like an ignorant idiot yourself.
There is nothing wrong with a search engine making an educated guess about the colour of shoes based on an expert system or similar methodology. It might be annoying when it gets it wrong, but most users would prefer to have the option than not. His point, and it was well made, was that with large and/or diverse data sources it is impossible to answer some important questions with 100.000000%+ reliability.
Good luck trying to make a decision on whether you should respond. I expect the answer will be "I don't know" unless you know how to interpret information like my post, your reaction to my post, your background knowledge and current circumstance in a way you define as 'pure data', because if you can't then making that decision is "just plain wrong".

Nothing new (3, Insightful)

Whuffo (1043790) | more than 3 years ago | (#36328706)

If the people that write these stories would familiarize themselves with Information Theory (Claude Shannon, in the 1940's) then they'd understand that you still can't make silk purses from sow's ears.

Yes, it's a lot of records. Yes, the data entry people made mistakes. All this really means is that there's more noise in the data. As the signal to noise ratio declines, the value of the results also declines. Making decisions based on noisy data isn't science, it's only guesswork. That's fine for weather forecasting (a similar problem) but expecting the results from the described data to be more accurate than weather forecasts is foolish. Remember: garbage in, garbage out.

LMFAO (1)

Anonymous Coward | more than 3 years ago | (#36328724)

Since when has MS ever had any OTHER opinion on ANYTHING!

Ambiguity Management (2)

AtomicSnarl (549626) | more than 3 years ago | (#36329884)

The problem being encountered is one I've faced often in 30 years of weather forecasting: Ambiguity Management.

The weather business deals with reams of data from thousands of sources and all the complexity of trying to follow a single swirl within a flowing river to figure out where it will be tomorrow. Decades of research and modeling have evolved into dozens of primary rule-based tools available to forecasters which are applicable to most situations. Objectively, you should be able to follow the rules, weed out the conflicting or contradictory ones, and get a reliable answer. Realistically, you don't. Why? Two reasons:

1. The dataset is incomplete.
2. The tools are imperfect.

You simply can't have perfect knowledge of all the relevant details in the atmosphere to feed a completely objective tool (computerized model or whatever) to get your perfect prediction. Like Rosanne Rosannadana's mother said, "It's Always Something!"

The trick then in being a good (aka reliable) weather forecaster then is how you manage the ambiguity of incomplete data filtered through inherently biased tools. Some weather stations run hot or cold, have local effects enhancing or reducing pressure or winds, etc, etc, etc. Good models account for this, but that's a static adjustment, not a dynamic one. Models run hot or cold, fast or slow, depending on their structure and assumptions, and they reval their strengths and weakness over time compared to other models and reality at verification time.

The basic forecasting questions are - Where is it, Where is it going, an what will happen when it gets there? Because the models are perfect (100% replication of output from identical starting states), but are always wrong (inherent model and data limitations), you make your money examining the consistency. The model(s) are running slow and cold recently due to the whatever event going on? Ok -- warm it up a few degrees and expecting things a few hours earlier than it forecasts tomorrow. Some models handle well in winter but get klutzy with large thunderstorm events. One model I worked with covered the world in clouds if you waited long enough. Solution? Don't trust it past X number of hours. And so on for the family of models through the decades and to today. Some models have high skill up to a certain point then it drops off quickly. Others show less skill, but are decent for the long haul. You get the idea. You can make a forecast using only one tool, but you can make a better one using several and sorting out their differences by using ambiguity management.

Needless to say, you needed a solid understanding of the physics and dynamics of the atmosphere to help make good decisions to do all this effectively. The modelers and users now data mining these huge collections of information likewise need a solid understanding of Statistics and the event mechanics they're examining to make any good sense of it all. At the very minimum, a large poster announcing "Coincidence is not Causation" needs to be in every office, otherwise you start getting breathless announcements about how underarm deodorant "causes" cancer because people eating hamburgers had a lower incidence rate by comparison.

Your Mileage May Vary -- a lot. That's the point.

Depends on the data (1)

moorhens (564268) | more than 3 years ago | (#36329978)

Doesn't "how much is too much" depend more on what sort of data you are talking about than the systems used to record and analyse it? Aircraft risk analysts would surely argue that they need all the data they can get to help prevent every instance of catastrophic failure. Biologists on the other hand are used to working with extraordinarily fuzzy data and still drawing valid conclusions

Too much data?? (0)

Anonymous Coward | more than 3 years ago | (#36330624)

Too Much Data?? Give me a break. Certainly, the less you have the more refined or distilled the information should be. But that amount of data is only a subset. Problem with most current databases is two fold -- improper storage and not understanding how to get the information back out. Current and former software systems were and are great data vacume cleaners. Sucking up every bit of information that came near. Dta was stored and stored and stored. Most never saw the light of day again. Why....? Many systems do not store the data correctly or in a manner that allows it to be retrieved. Storage for the sake of storage does a business little to no good. Also, the talent to be able to retrieve stored data in a manner that is usable and understandable is not there either. Working with small databases with small, well defined tables with few fields is easy. Working with larger amounts of data, some seemingly unrelated, is another matter all together.

Check for New Comments
Slashdot Login

Need an Account?

Forgot your password?