Beta

Slashdot: News for Nerds

×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Researcher's Wikipedia Big Data Project Shows Globalization Rate

Soulskill posted more than 2 years ago | from the abstracted-webs-of-connectedness dept.

Databases 16

Nerval's Lobster writes "Wikipedia, which features nearly 4 million articles in English alone, is widely considered a godsend for high school students on a tight paper deadline. But for University of Illinois researcher Kalev Leetaru, Wikipedia's volumes of crowd-sourced articles are also an enormous dataset, one he mined for insights into the history of globalization. He made use of Wikipedia's 37GB of English-language data — in particular, the evolving connections between various locations across the globe over a period of years. 'I put every coordinate on a map with a date stamp,' Leetaru told The New York Times. 'It gave me a map of how the world is connected.' You can view the time lapse/data visualization on YouTube."

cancel ×

16 comments

Not "big data" (3, Insightful)

epiphani (254981) | more than 2 years ago | (#40339799)

Come on, 37G isn't big data. You'd have a hard time arguing 37TB is big data.

Cool stuff though.

Re:Not "big data" (0)

Anonymous Coward | more than 2 years ago | (#40339867)

It looks like it was 37GB of raw text, but the meta data would have expanded that.

Re:Not "big data" (0)

Anonymous Coward | more than 2 years ago | (#40341777)

Where do researchers get these big datasets from Facebook, Wikipedia, etc.? Surely they'd be IP-banned if they were using just a wget.

Re:Not "big data" (0)

Anonymous Coward | more than 2 years ago | (#40342799)

Presumably http://dumps.wikimedia.org/ through mirrors.

Re:Not "big data" (1)

DigiShaman (671371) | more than 2 years ago | (#40347911)

To me, "big data" is defined by the storage to transfer I/O bottleneck ratio.

the ending of that movie (1)

Janek Kozicki (722688) | more than 2 years ago | (#40339887)

looks exponential :)

Re:the ending of that movie (2, Interesting)

Anonymous Coward | more than 2 years ago | (#40340033)

Just like stars. If you consult a starmap, it's much denser near earth than further away. So looking at a star catalogue we'd be correct to surmise we're the center of the universe since all stars cluster around us right? Wrong.

Sampling bias. Starmaps are clustering stars around us because the stars in our vincinity are better sampled then those further away.

The movie looks exponential because the density of articles dealing with the present is higher than the the density of articles dealing with events long past. It's not surprising, since in 1900 nobody was editing wikipedia, and all entries that did made it there, came from secondary sources, rather than being edited in from primary sources.

Re:the ending of that movie (3, Insightful)

Anonymous Coward | more than 2 years ago | (#40341203)

"looks exponential :)"

As much as I'd like to think that meant the world is rapidly connecting, much more likely this is due to the fact that Wikipedia has only been around for a decade or so and people are inclined to write about things that are happening now (or have happened recently) than things that happened many years ago.

If Wikipedia had been available for the entire of those 200 years and had been consistently popular through that time and uniformly across the globe with no language bias then the resulting movie would say a lot about globalisation.

Quantity over quality (-1)

Anonymous Coward | more than 2 years ago | (#40340103)

Wikipedia, is widely considered a godsend for high school students on a tight paper deadline

At least they're young enough to not know any better.

To paraphrase Slashdot... (2, Insightful)

willoughby (1367773) | more than 2 years ago | (#40340939)

If you're using Wikipedia as a metric to measure anything, you're insane.

Re:To paraphrase Slashdot... (1)

fijiaaron (2647031) | more than 2 years ago | (#40362009)

He's using wikipedia to measure who's editing wikipedia. Considering it's one of the top collaborative sites, it's a pretty good source to determine how global inputs are spreading -- and since he's studying English language entries, he'd expect data to cluster around the USA. What he's trying to find out is how that diverges over time.

Study about perspectives not history (1)

k(wi)r(kipedia) (2648849) | more than 2 years ago | (#40341217)

From reading the NYT article, I understand this is a study of the English version of Wikipedia. That alone should raise a red flag about the significance of the study beyond being a survey of the interests or obsessions of Wikipedia editors.

It's useful only as a survey of a clearly unrepresentative sample of the world population. It's clearly biased against those that can't write English, itself a much smaller subset of those who can claim some fluency in English.

It tells us less about history and more about present attitudes twoard history. It's pretty much like compiling a list of the 100 greatest sci-fi movies by surveying Facebook users. Movies produced within the last decade or so will outrank the "classic" movies of the 70's and 80's. Avatar will likely be a more "significant" movie than Bladerunner or Aliens.

Re:Study about perspectives not history (0)

Anonymous Coward | more than 2 years ago | (#40343135)

Are you seriously trying to imply that it is physically impossible for a newer movie to be more "significant" than Bladerunner or Alien (not plural)?

"Sentiment" Analysis? (1)

Dr Herbert West (1357769) | more than 2 years ago | (#40345603)

I'd be interested in how this guy parses "positive" statements vs "negative" statements. English nuance is a tricky wicket, and unlike trying to analyze text from Twitter or Facebook ("Eeewwww, the Civil War is teh Suxorrzz") Wikipedia articles tend to maintain a neutral tone.

After reading the article (yeah, I know) and viewing the video, it seems like "negative" entries appear most often around periods of time when there's a lot of war. Interesting and obvious... but I'd like to know if periods of religious persecution or large scale social upheaval/conflict are accounted for, or show up on whatever "sentiment meter" he's using.

As a side note, I've been thinking about rolling my own-- the "free online" sentiment tools like sentiment140.com tend to miss out on a lot of nuance. Anyone here have any recommendations?
Check for New Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Create a Slashdot Account

Loading...