Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

The Importance — and Limits — of Very Large Data Sets

Soulskill posted more than 2 years ago | from the libraries-of-congress-per-fortnight dept.

Social Networks 17

New submitter kodiaktau writes "A recently presented paper discusses how large data sets can improve learning algorithms, but points out that researchers still need to account for bias and incompleteness before drawing conclusions. The paper also goes into the need for responsible business practices to manage these data sets. 'There's been the emergence of a philosophy that big data is all you need. We would suggest that, actually, numbers don't speak for themselves.' The full paper is available through SSRN. Of particular importance is their assertion that even huge data sets can and will be affected by filters or the analyst who is interpreting it. '[Study co-author Kate Crawford] notes that many big data sets — particularly social data — come from companies that have no obligation to support scientific inquiry. Getting access to the data might mean paying for it, or keeping the company happy by not performing certain types of studies.'"

cancel ×

17 comments

Sorry! There are no comments related to the filter you selected.

Well no sh*t. (-1)

Anonymous Coward | more than 2 years ago | (#37638176)

You don't say?

It's honest about a "main tenet" of stats (0)

Anonymous Coward | more than 2 years ago | (#37638182)

"Big data sets are never complete," Crawford says (from source article -> http://www.technologyreview.com/computing/38775/page1/ [technologyreview.com] ).

APK

P.S.=> Thus, you can NEVER, EVER have a perfect dataset, because you're never going to have every possible sampleable item, period...

... apk

duh? no, needs more than that... (0)

zMaile (1421715) | more than 2 years ago | (#37638264)

I was originally going to just say 'duh', but I guess it's something that a significant proportion of people may overlook. The analyst needs to have as much of an impartial view as possible to give data that is as unbiased as possible.

This is the most obv article ever on /. (0)

Anonymous Coward | more than 2 years ago | (#37638288)

Ever.

Re:This is the most obv article ever on /. (2, Funny)

Anonymous Coward | more than 2 years ago | (#37638486)

How sure are you your data-set is adequate to make that determination?

Re:This is the most obv article ever on /. (1)

jsnipy (913480) | more than 2 years ago | (#37640412)

Did you know when you google 'google', you get google.com?

There's lots of data (2)

MadKeithV (102058) | more than 2 years ago | (#37638362)

There's lots of data to support this article.

oh come on... (0)

Anonymous Coward | more than 2 years ago | (#37638430)

oh come on... an article on data availability not being available with a simile url. Meh!

This is a problem with most data! (3, Insightful)

garcia (6573) | more than 2 years ago | (#37638462)

From the blurb:

Getting access to the data might mean paying for it, or keeping the company happy by not performing certain types of studies.'"

Even if you're using data from public institutions you still may have to pay for it (to cover staff time to procure the data--especially if you're asking for something they don't normally provide, which is quite often). While there won't be any limitations on what you can do with the data once you have it, because of lack of knowledge of their own data/bases the provider may simply provide you with incomplete or likely inaccurate data anyway.

So yeah, welcome to the world of using data. Move along, nothing to see here.

Re:This is a problem with most data! (1)

oneiros27 (46144) | more than 2 years ago | (#37638746)

And even if you collect it yourself, if you're at an educational institution, you likely have to comply with IRB (institutional review board) rules if it involves people.

They often don't like you looking for certain types of patterns, or using the data in a way that might harm the people you're studying.

There's medical privacy rules, general privacy rules, etc. And even when not dealing with people, there's lots of moral issues in how you use the data. (and there's moral issues in sharing data -- some groups don't want to reveal info about endangered special location in too much detail, as it helps poachers. ... but if you have a dataset that resulted in the loss of lives to collect (maybe not intentionally), if you share it, it means people don't have to repeat the process to collect it.)

Of course we're still coping with the issues of providing proper credit & attribution for data, and standards for publishing data so that it can be re-used. I've been to lots of meetings in the last year that covered those sorts of issues -- DCC, BRDI, RDAP, DataCite, etc.

At least there IS very large social data sets (2)

G3ckoG33k (647276) | more than 2 years ago | (#37638592)

At least there IS very large social data sets.

Most sociologists today tend to describe the world using 'deep' interviews of 36 people in the surroundings of the campus, because that way they will get the result they wish to get.

A cynic description, yes, but not too far the truth. So, it is good to see there IS large data sets, somewhere.

Re:At least there IS very large social data sets (1)

Anonymous Coward | more than 2 years ago | (#37638760)

IS a set, ARE sets... Doesn't saying what you wrote out loud trigger any warning sirens? Also, are you trying for "a cynic's description" or "a cynical description" ?

Also, it would be nice if you had ended "A cynic description, yes, but not too far the truth" with a rationalization. Like, "based on my experience as a graduate assistant working in the sociology dept" or "based on my own exhaustive research" or even "based on what the voices in my head are telling me." Just how far from the truth is "too far" ?

Re:At least there IS very large social data sets (0)

Anonymous Coward | more than 2 years ago | (#37639612)

Based on experiences reading sociological papers.

Try http://scholar.google.se/scholar?hl=sv&q=sociology+deep+interview&as_ylo=2000&as_vis=0

Re:At least there IS very large social data sets (0)

Anonymous Coward | more than 2 years ago | (#37642894)

[needs citation]

Nothing new for me. (0)

Anonymous Coward | more than 2 years ago | (#37638670)

I work as a biostatistician and often analyse data from so-called next-generation sequencing technologies. The amount of data per biological sample from these fantastic machines is absurd - which makes the analysis a fun (computational) challenge in itself on an (almost) regular PC. A great example of this issue is - in my experience - that the biologists are hell-bent on getting more "data" per sample and not less data per sample and more samples. This, in reality, increases the signal to noise ratio instead of the contrary. The money simply goes into using the newest (and most expensive) machine instead of using older and cheaper technologies and getting more samples. This is just yet another point that "more data" does not always equal more confident conclusions - it sometimes has to be the right kind of "more data".

Forget about bias and incompleteness for a moment (1)

tinkerton (199273) | more than 2 years ago | (#37640148)

This statement
'There's been the emergence of a philosophy that big data is all you need. We would suggest that, actually, numbers don't speak for themselves.'

is not about bias and incompleteness. The person who is looking at the data needs to have the necessary concepts and it's a bad idea to call that bias. The data won't do the thinking for him(her). They've just found 3 new exoplanets in old Hubble data. The data hasn't changed and ha, but the people who are looking at them have.

Not a surprise (1)

gweihir (88907) | more than 2 years ago | (#37643124)

Those that claim a large dataset is all you need are typically bad scientists that happen to have access to such a dataset. Large datasets eliminate one thing, namely noise (random variations). Large datasets can be just as biased, incomplete and contaminated with data you do not suspect of being in there as small datasets. They are not in any way a better approximation of "the truth" than smaller datasets.

But every good scientist knew that anyways.

Check for New Comments
Slashdot Login

Need an Account?

Forgot your password?