×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Common Crawl Foundation Providing Data For Search Researchers

Unknown Lamer posted more than 2 years ago | from the doesn't-archive-dot-org-do-that dept.

Open Source 61

mikejuk writes with an excerpt from an article in I Programmer: "If you have ever thought that you could do a better job than Google but were intimidated by the hardware needed to build a web index, then the Common Crawl Foundation has a solution for you. It has indexed 5 billion web pages, placed the results on Amazon EC2/S3 and invites you to make use of it for free. All you have to do is setup your own Amazon EC2 Hadoop cluster and pay for the time you use it — accessing the data is free. This idea is to open up the whole area of web search to experiment and innovation. So if you want to challenge Google now you can't use the excuse that you can't afford it." Their weblog promises source code for everything eventually. One thing I've always wondered is why no distributed crawlers or search engines have ever come about.

cancel ×
This is a preview of your comment

No Comment Title Entered

Anonymous Coward 1 minute ago

No Comment Entered

61 comments

First (-1, Offtopic)

Ethanol-fueled (1125189) | more than 2 years ago | (#38054896)

Hayah.

Let's Remake First (-1, Offtopic)

TaoPhoenix (980487) | more than 2 years ago | (#38055028)

Cut this $hit out.

After 7-10 years, the single greatest waste of time is First Post.

Let a legit First Post gain Double-Mod and Metamod Points, and then the next 20 comments won't suk.

Enough of this "old meme throwaway post shit.
Oh. Look. I lost a point. Sorry. I did a mini study, Quality First Posts are worth between 17 and 30 real points.

Re:Let's Remake First (0)

Anonymous Coward | more than 2 years ago | (#38055056)

That's what she said!

Re:Let's Remake First (-1)

Anonymous Coward | more than 2 years ago | (#38055462)

After 7-10 years, the single greatest waste of time is trying to educate niggers.

Enough of this "political correctness shit. And fuck closing quotes. And fuck proofreading. They're for coons!

How do you stop five niggers from raping a white woman? Throw 'em a basketball!
What are three things you cannot give a nigger? A black eye, a fat lip, and an education!
Why don't niggers ride motorcycles? Their lips flapping in the wind would beat the living shit out of them.
What is stamped on the inside of a nigger's lips? INFLATE TO 35 PSI.
Why do niggers have tinted car windows? They don't, it's the black rubbing off.
Why don't black kids play in sandboxes? Cats keep trying to bury them.
What do you say if you wake up in the middle of the night and it's dark and your TV is mysteriously floating in the air? Drop it, nigger!

Saves you on bandwidth (3, Informative)

CmdrPony (2505686) | more than 2 years ago | (#38054962)

But it's still a long way to go. They seem to have archive of what they have crawled. That's it. You processing all those pages on EC2 is still going to be extremely costly and time taking.

Re:Saves you on bandwidth (1)

CmdrPony (2505686) | more than 2 years ago | (#38054984)

Oh, and that is obviously for only simple stuff like what links to what. Google, Bing and other search engines are much, much more complicated than that now. And you don't have access to the usage and keyword data that Google and Bing have because of their enormous amount of users.

Re:Saves you on bandwidth (0)

Anonymous Coward | more than 2 years ago | (#38055208)

I also question how much you could do /efficiently/ with MapReduce. Anything link-analysis based is going to require data sharing after each iteration (unless using an approximation for convergence). 50 iterations of PageRank needs 50 (serial) MapReduces.

This is much more a task for MPI.

Re:Saves you on bandwidth (5, Insightful)

Gumber (17306) | more than 2 years ago | (#38055206)

Bitch moan, bitch moan. If I had a need for such a dataset, I think I'd be damn grateful that I didn't have to collect it myself. As for the cost of processing the pages, the article suggests that running a hadoop job on the whole dataset on EC2 might be in the neighborhood of $100. That's not that costly.

Re:Saves you on bandwidth (3, Interesting)

CmdrPony (2505686) | more than 2 years ago | (#38055406)

To be honest, if I wanted to work on such data and didn't have lots of money, I would actually prefer collecting it myself. Sure, with EC2 I can easily put more processing power and process it quickly, but I can get dedicated 100mbit server with unlimited bandwidth for around 60-70 dollars a month. It also has more space and processing power than EC2 at that price, and I can process the pages as I download them. That way I would build my database structure as I go, and I'm guaranteed with fixed cost a month.

Sure, if you're researcher and want to get quick results, then you can run a job for $100-200 against this dataset. One job. And it better not be anything complex, or you're paying more. In the end, if you're short on money, it would probably be better to do the crawling part yourself too. That isn't costly, it's just time taking.

Won't perform for that sort of money (2)

dutchwhizzman (817898) | more than 2 years ago | (#38056354)

Seriously, the EC2 cluster is already there, setting it up will cost you lots less than building it up from ground. Time costs money too on this planet. Also, most importantly, your 80 dollar box is not going to be able to store metadata on 5 billion web pages and process it at any reasonable IO speed at all.

Go build your own processing cluster and see how long it takes you to do that for less than what EC2 would charge. Once you're finished, you could make a business out of it and compete with Amazon. The last, obligatory step: Profit!

Re:Won't perform for that sort of money (2)

martin-boundary (547041) | more than 2 years ago | (#38056984)

Or not.

If you're an academic, running a single hadoop job like that is not as useful as it sounds. In research, you never know what you want until you do something and realize that's not it. To write a paper you'd want to run at least 10-20 full jobs, all slightly different.

Luckily, lots of unis have their own clusters (aka beowulfs - I can't believe I have to point that out on slashdot...). It would really be great if the data could be duplicated so people could run the jobs on their own local setups. Furthermore, that's how you ensure that research is duplicable. If there's only one source of data in the world then all your published results are at the mercy of the org that owns the data. Bad science.

Re:Saves you on bandwidth (2)

phantomcircuit (938963) | more than 2 years ago | (#38056394)

I can get dedicated 100mbit server with unlimited bandwidth for around 60-70 dollars a month

No actually you cant.

Re:Saves you on bandwidth (-1)

Anonymous Coward | more than 2 years ago | (#38055602)

You're such a narrow-minded nigger I can't believe it.

Is this an Amazon sponsor thingy? (1)

Taco Cowboy (5327) | more than 2 years ago | (#38054968)

I mean, hosting the stuffs on Amazon server is one thing - it gonna have to be hosted somewhere, but the thing that I feel uncomfortable is that if anyone wants to do any research on the info they end up have to pay Amazon.

Hmm ....

Re:Is this an Amazon sponsor thingy? (2)

CastrTroy (595695) | more than 2 years ago | (#38055032)

Must be a conspiracy set up by Amazon to get people to pay for vast amounts of compute time. Why now allow people to purchase copies of the data on hard disk or tape. 5 billion pages, at 100K each (high estimate perhaps) is 500 TB. If you zip it, you could probably get it down to 10 TB if you compress it with a good algorithm. Not "that much" if this is the kind of research you are interested in.

Re:Is this an Amazon sponsor thingy? (1)

CmdrPony (2505686) | more than 2 years ago | (#38055144)

I just did a calculation in Amazon EC2 site, and seems like you can micro instance for practically free for first year. With 15GB storage and 10GB out bandwidth it costs like $0.40 a month, and nothing if you take just 10GB storage. Guess you could do some simple stuff with that.

Re:Is this an Amazon sponsor thingy? (2)

Gumber (17306) | more than 2 years ago | (#38055162)

A conspiracy? You're going to have to pay someone for the compute time. It's not like a lot of people have big clusters lying around, so lot of people are going to opt to pay Amazon anyway.

As for selling access to the data on physical media, it doesn't look like there is anything to stop you from taking advantage of Amazon's Export Service to get the data set on physical media.

Re:Is this an Amazon sponsor thingy? (1)

cryfreedomlove (929828) | more than 2 years ago | (#38055640)

Must be a conspiracy set up by Amazon to get people to pay for vast amounts of compute time. Why now allow people to purchase copies of the data on hard disk or tape. 5 billion pages, at 100K each (high estimate perhaps) is 500 TB. If you zip it, you could probably get it down to 10 TB if you compress it with a good algorithm. Not "that much" if this is the kind of research you are interested in.

How much would that tape and tape drive or hard disk cost you to get started? How would that cost compare with the initial 750 hours of free compute time on EC2?

Re:Is this an Amazon sponsor thingy? (1)

Gumber (17306) | more than 2 years ago | (#38055096)

I don't get it. You are going to have to pay someone if you want to do any research on it. If you don't want to pay Amazon you could either crawl the data yourself, or pay the cost of transferring the data out of Amazon's cloud.

Re:Is this an Amazon sponsor thingy? (-1)

Anonymous Coward | more than 2 years ago | (#38055312)

He's just a crying fuck who wants to bitch about Amazon. Fucking Slashdot. I swear to fuck, if someone cured cancer and made the patients pay for the raw resources there would be some cunt around here who would cry that it's a rip off and poor people were being fucked in the name of greed.
 
It gets fucking old. Fast. If anything it makes me see these fucking leeches as common cunt bitches who should eat shit. Fuck them, they deserve a boot to the ass.

Re:Is this an Amazon sponsor thingy? (0)

Anonymous Coward | more than 2 years ago | (#38055354)

Cure cancer because you want to help ppl regardless of their money. Like Berners-Lee created this thing we're on now but didn't charge no one nuttin.

Re:Is this an Amazon sponsor thingy? (1)

CmdrPony (2505686) | more than 2 years ago | (#38055356)

Are you surprised? This site is established on the idea that you should receive everything for free and nobody should be paid for the work they do.

Re:Is this an Amazon sponsor thingy? (2)

zill (1690130) | more than 2 years ago | (#38055432)

I mean, hosting the stuffs on Amazon server is one thing - it gonna have to be hosted somewhere, but the thing that I feel uncomfortable is that if anyone wants to do any research on the info they end up have to pay Amazon.

Hmm ....

So you expect the researchers to Fedex you 100000 2TB harddrive to you upon request? We're talking about 200 petabytes of data here. It's gonna take forever to transfer no matter how wide your intertubes are. A shipping container of harddrives is literally the only way to move this much data in a timely manner.

Since there's no easy way to move the data, it only makes sense to run your code on the cluster where the data is currently residing at.

Interesting, however (3, Interesting)

CastrTroy (595695) | more than 2 years ago | (#38054976)

Interesting, However, wouldn't one need to index the data in whatever format they need in order to actually search and get useful results from it? You'd need to pay a fortune in compute time just to analyze that much data. It say's they've indexed it, but I don't see how that helps researchers who will want to run their own indexing and analysis against that dataset. Sure it means you don't have to download and spider all that data, but that's only a very small part of the problem.

Re:Interesting, however (4, Insightful)

Gumber (17306) | more than 2 years ago | (#38055192)

It may or may not be a small part of the problem, but it isn't a small problem to crawl that many web pages. This likely lets people save a lot of time and effort which they can then devote to their unique research.

Maybe it will cost a fortune to analyze that much data, but there isn't really anyway of getting around the cost if you need that much data. Besides, for what its worth, the linked article suggests that a hadoop run against the data costs about $100. I'm sure the real cost depends on the extent and efficiency of your analysis, but that is hardly "a fortune."

Re:Interesting, however (2)

AliasMarlowe (1042386) | more than 2 years ago | (#38056376)

It may or may not be a small part of the problem, but it isn't a small problem to crawl that many web pages.

Indeed, and there are more crawlers on the net than might be commonly supposed. Our home site is regularly visited by bots from Google, Bing, and Yandex, and occasionally by several others. The entire site (10s of GB) was slurped in a single visit by an unknown bot at an EC2 IP address recently. That bot's [botsvsbrowsers.com] user-agent string was not the same as the string used by the Common Crawl Foundation's bot.

Re:Interesting, however (2)

HipsterMcGee (2507766) | more than 2 years ago | (#38055508)

You're absolutely correct - although if they do have it indexed, it's certainly much easier on the researchers. Actually - I worked on this project: http://lemurproject.org/clueweb09.php/ [lemurproject.org] ... and I can tell you first hand, not only is it not easy to crawl that much data, but then to index it, it takes not only time but computing muscle, and lots, lots, lots of disks. It took us roughly 1 and 1/2 months to collect the data using a Hadoop cluster with 100 nodes running on it, and then roughly 2 months of compute power (using 24 nodes, so, roughly a couple of days via the wall clock time) to index the data. And then factor in the resources you need to have to experiment with that amount of data. The hardware and IT maintenance costs alone for this setup is probably going to be more than the costs to run your experiments via EC2.

Not interested in more cloud BS (0)

Anonymous Coward | more than 2 years ago | (#38054982)

Do not want. Google is already established and can do this much better. Do you want to try to compete against Google? Didn't think so.

Re:Not interested in more cloud BS (1)

Anonymous Coward | more than 2 years ago | (#38055256)

Which is why Google sucks. Nobody willing to compete except Microsoft. And well... that really isn't going to bring competition to the market.

Scam. (-1)

Anonymous Coward | more than 2 years ago | (#38055008)

This is a complete scam, and anyone who falls for it is a Peruvian moron.

It should be obvious (4, Interesting)

DerekLyons (302214) | more than 2 years ago | (#38055154)

One thing I've always wondered is why no distributed crawlers or search engines have ever come about.

Because being 'distributed' is not a magic wand. (Nor is 'crowdsourcing', nor 'open source', or half a dozen other terms often used as buzzwords in defiance of the actual (technical) meanings.) You still need substantial bandwidth and processing power to handle the index, being distributed just makes the problems worse as now you need bandwidth and processing power to coordinate the nodes.

Re:It should be obvious (3, Informative)

icebraining (1313345) | more than 2 years ago | (#38055412)

Except the editor is wrong, since distributed search engines do exist [wikimedia.org] .

Re:It should be obvious (2)

HipsterMcGee (2507766) | more than 2 years ago | (#38055558)

Distributed web crawlers exist as well...http://nutch.apache.org/

Re:It should be obvious (1)

inzy (1095415) | more than 2 years ago | (#38063474)

also, check out yacy - yacy.de

very powerful, decentralised, open source. excellent. i ran a node for a while on my vps, the results were good although it struggled on 384MB of RAM

Re:It should be obvious (0)

Anonymous Coward | more than 2 years ago | (#38056026)

there are a million servers, which come and go randomly, and have limited bandwidth.

there is a faulty asynchronous process which continually adds edges into a graph
redundantly mapped across the servers

the servers collectively execute one or more graph algorithms running idempotently over a period of time

the only problems are timeliness and trust

Fix GOOG's braindead pageranking system (3, Interesting)

quixote9 (999874) | more than 2 years ago | (#38055198)

Google's way of coming up with pageranks is fundamentally flawed. It's a popularity test, not an information content test. It leads to link farming. Even worse, it leads everyone, even otherwise well-meaning people, not to cite their sources so they won't lose pagerank by having more outgoing links than incoming ones. That is bad, bad, bad, bad, and bad. Citing sources is a foundation of any real information system, so Google's method will ultimately end in a web full of unsubstantiated blather going in circles. It's happening already, but we've barely begun to sink into the morass.

An essential improvement is coming up with a way to identify and rank by actual information content. No, I have no idea how to do that. I'm just a biologist, struggling with plain old "I." AI is beyond me.

Re:Fix GOOG's braindead pageranking system (1)

quixote9 (999874) | more than 2 years ago | (#38055220)

I should add: one ought to be actively rewarded for citing sources. Definitely not penalized.

Re:Fix GOOG's braindead pageranking system (1)

CmdrPony (2505686) | more than 2 years ago | (#38055244)

So, how exactly would you fix it? How do you determine what is good information, or relevant results? How do you rank them? Please describe your algorithm.

Re:Fix GOOG's braindead pageranking system (0)

Anonymous Coward | more than 2 years ago | (#38055314)

Sounds like you want to change your home page to Wikipedia but are too scared your wife will freak out when she can't find Google. Maybe you should start using DuckDuckGo its basically just a Wikipedia backed search engine anyway.

As for the rest of us, the internet will continue to be one big popularity contest for the time being, because that is how we find information. We don't want every little correct detail tailored to our specific needs, sure it would be nice to get something related to c++ when I Google the letter "C" but when I stop being able to tell people to Google for things because they get totally different results from me I'm going to be pissed off just as much.

Sure click history should be more important than outgoing vs incoming links but I don't think all the parameters should be written off completely just because you think it's "bad, bad, bad, bad, bad and bad."

Re:Fix GOOG's braindead pageranking system (0)

Anonymous Coward | more than 2 years ago | (#38057170)

That was a nice example of a fact free statement without facts. Makes me wonder why you beat your wife. :-)

Re:Fix GOOG's braindead pageranking system (1)

Twinbee (767046) | more than 2 years ago | (#38055302)

Surely it would be possible to tweak the algorithm so outbound links don't detract from the site, and keep things mathematically sound?

Re:Fix GOOG's braindead pageranking system (0)

Anonymous Coward | more than 2 years ago | (#38055384)

> not to cite their sources so they won't lose pagerank by having more outgoing links than incoming ones.
I don't think PageRank works the way you think it does. It's the stationary distribution of a random walk -- it's about the flux, not the end.

Re:Fix GOOG's braindead pageranking system (0)

Anonymous Coward | more than 2 years ago | (#38057446)

If google decides your well researched publication is linking to much sources your whole websites vanishes from the index.

Re:Fix GOOG's braindead pageranking system (1)

GuB-42 (2483988) | more than 2 years ago | (#38061736)

Pagerank is just part of the picture. Google use many other metrics to rank websites, but these metrics are kept secret. Also I don't know why having more outgoing links than incoming ones harms your pagerank. Your links will likely be less valuable for the referenced sites but it shouldn't change anything for you. And if you don't want to make your links trackable by google, just use "nofollow". Manipulating search engine results is not that easy.

It's got a long way to go. (1)

idbeholda (2405958) | more than 2 years ago | (#38055342)

I'm not trying to be mean, but just stating the facts. Out of the "billions" of crawled webpages, even common search phrases come up with results that are only a fraction of what can be pulled from a standard search with google, yahoo, bing, etc. That's not to say that this project is not without its merits. It's a good idea, but I believe its developers are starting in the wrong place. The real money to be made from this kind of undertaking is NOT to create a better search engine. This kind of project would be a financial boon if instead of search results, it instead searched for, and indexed raw data per user requests.

Just my two cents.

Wonderful news (0)

Anonymous Coward | more than 2 years ago | (#38055392)

Inciteful and informative, and wonderful news. Any challenge to the Seven Deadly Sinners is quite welcome. I refer to IBM, Oracle, Microsoft, Apple, Google, Adobe, and Facebook.

Wait, what? (5, Interesting)

zill (1690130) | more than 2 years ago | (#38055404)

From the article:

It currently consists of an index of 5 billion web pages, their page rank, their link graphs and other metadata, all hosted on Amazon EC2.

The crawl is collated using a MapReduce process, compressed into 100Mbyte ARC files which are then uploaded to S3 storage buckets for you to access. Currently there are between 40,000 and 50,000 filled buckets waiting for you to search.

Each S3 storage bucket is 5TB. [amazon.com]

5TB * 40,000 / 5 billion = 42MB/web page

Either they made a typo, my math is wrong, or they started crawling the HD porn sites first. I really hope it's not the latter because 200 petabytes of porn will be the death of so many geeks that the year of Linux on the desktop might never come.

Re:Wait, what? (2)

serialband (447336) | more than 2 years ago | (#38055680)

42 MB is not really that big for a "modern" webpage. People put a lot of images on their web pages these days. Add flash apps or forums to that, and many sites get quite big. Text only pages exist mainly in the realm of geeks. When you include sites like IBM, Apple, HP, Dell, etc... you're getting GBs of data.

Re:Wait, what? (3, Informative)

Amouth (879122) | more than 2 years ago | (#38056130)

for a modern "website" 42mb isn't large.. but for any single "webpage" it is quite large and not common - even with tones of images

Re:Wait, what? (1)

Anonymous Coward | more than 2 years ago | (#38055684)

200 petabytes of porn

We need a mascot for such an invaluable resource. I vote we call it Petabear

Re:Wait, what? (0)

CmdrPony (2505686) | more than 2 years ago | (#38055840)

Not filled buckets, filled 100MB files. So their data takes about 4-5TB storage space.

Re:Wait, what? (1)

Prof.Phreak (584152) | more than 2 years ago | (#38056104)

....so how much would it cost (dollars) to run a single map-reduce word-count against that?

Also, why not do torrent thing. e.g. 100gig torrent dumps, with more updated on regular basis?

Re:Wait, what? (2)

Aloisius (1294796) | more than 2 years ago | (#38056126)

Because:

a) They'd have to pay to seed it

b) The data changes frequently (it is a web crawler after all)

c) Not everyone has servers necessary to process that much data, while anyone can use hadoop on amazon

Re:Wait, what? (1)

Anonymous Coward | more than 2 years ago | (#38056078)

Hi, sorry, there is a typo on the CC website. There are currently 323694 items in the current commoncrawl bucket (commoncrawl-002), and each file is very close to 100MB in size( the total bucket size is 32.3 TB). There are also another 132133 items in our older bucket, which we will be moving over to current bucket shortly.

Re:Wait, what? (0)

commoncrawl (2507836) | more than 2 years ago | (#38056326)

"Write, read, and delete objects containing from 1 byte to 5 terabytes of data each. The number of objects you can store is unlimited". As we mentioned in a separate comment below, there are actually 323694 files in our current bucket, which will grow to 455827 once we consolidate data from an older bucket. Each archive file is 100MB in size (compressed), and the average doc size is around 10K (compressed). As we move towards a more focused and sustained crawl in 2012, the counts in the bucket will continue to see significant growth. We also hope to augment the raw crawl with additional metadata that should make it possible to avoid a complete bulk scan if so desired.

Re:Wait, what? (1)

zill (1690130) | more than 2 years ago | (#38056474)

Thanks for the clarification. To be fair nothing was unclear on commoncrawl.org, it was the i-programmer.info blog that mistakenly wrote "between 40,000 and 50,000 filled buckets".

Re:Wait, what? (1)

russianmafia (575213) | more than 2 years ago | (#38057122)

There are 40,000-50,000 buckets that contain one compressed 100Mbyte ARC file each which equals to 4-5TB of total data. So, 5TB / 5 billion pages = 1KB of compressed data per page.

Re:Wait, what? (1)

happyscientist (2508556) | more than 2 years ago | (#38063270)

If there were "40,000 - 50,000 filled buckets" at 5TB per bucket that would mean: 5TB * 40,000 = 200,000TB or 200PB. That doesn't sound reasonable.

crawl quality? (0)

Anonymous Coward | more than 2 years ago | (#38058652)

5 billion pages sounds a lot, but if it contains massive volumes of spam and duplicated posts then it is of less value.

Does anyone have a feel for the actual content density (if such a things exists)?

Check for New Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Sign up for Slashdot Newsletters
Create a Slashdot Account

Loading...