Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Huge Site Ranking Dataset Donated to the Common Crawl Foundation

Unknown Lamer posted about 2 years ago | from the fuzzy-feelings dept.

The Internet 23

Greg Lindahl writes "blekko is donating search engine ranking data for 140 million domains and 22 billion urls to the Common Crawl Foundation. Common Crawl is a non-profit dedicated to making the greatest (yet messiest) dataset of our time, the web, available to everyone, including tinkerers, hackers, activists, and new companies. blekko's ranking data will initially be used to improve the quality of Common Crawl's 8 billion webpage public crawl of the web, and eventually will be directly available to the public."

Sorry! There are no comments related to the filter you selected.

Can I Get A Copy ... (0)

Anonymous Coward | about 2 years ago | (#42336365)

Can I get a copy of the 'Leather Bound 1st Edition'? I'd prefer free shipping.

I didn't know (2)

plover (150551) | about 2 years ago | (#42336455)

I didn't realize the web wasn't available to everyone, including tinkerers, hackers, activists, and new companies. Thank $(DEITY) the Common Crawlers are here to make sure that my port 80 hasn't yet been pried from my cold, dead fingers.

Re:I didn't know (2)

Nyder (754090) | about 2 years ago | (#42336623)

I didn't realize the web wasn't available to everyone, including tinkerers, hackers, activists, and new companies. Thank $(DEITY) the Common Crawlers are here to make sure that my port 80 hasn't yet been pried from my cold, dead fingers.

I have no idea what is going on here. I am stoned (I live in Washington State, it's like law or something) so I'll admit that maybe I'm not in the right state for thinking. (I did a pun there, sorry)

I did actually go to the web site, saw they were hiring, and read the FAQ.

Still have no idea why they are doing what they are doing. Hoping someone will explain the purpose of Common Crawlers in terms I can understand and maybe a car analogy.

Re:I didn't know (1)

TheRealMindChild (743925) | about 2 years ago | (#42336899)

It is a data set about the internet. This saves you from having to crawl the web, analyze it, and build your own database.

Re:I didn't know (5, Informative)

L1s4 (2798519) | about 2 years ago | (#42336957)

The idea is to give everyone access to crawl data. If you work at a large search company, you have access to crawl data. You can also set up crawlers to get the data yourself, but that is expensive and having countless crawlers doing duplicative work is not ideal. Our idea is that there should be one common repository for crawl data that anyone can use. Researchers are using it for NLP, IR, sentiment analysis and many other things like measuring the adoption of metadata formats http://www.webdatacommons.org/ [webdatacommons.org] Educators are using it as a real world dataset to teach big data techniques in the classroom. Developers and entrepreneurs are using it for startups. Sorry I don't have a car analogy :) Feel free to email me if you have any other questions lisa at commoncrawl dot org

Re:I didn't know (1)

Nyder (754090) | about 2 years ago | (#42337729)

The idea is to give everyone access to crawl data. If you work at a large search company, you have access to crawl data. You can also set up crawlers to get the data yourself, but that is expensive and having countless crawlers doing duplicative work is not ideal. Our idea is that there should be one common repository for crawl data that anyone can use. Researchers are using it for NLP, IR, sentiment analysis and many other things like measuring the adoption of metadata formats http://www.webdatacommons.org/ [webdatacommons.org] Educators are using it as a real world dataset to teach big data techniques in the classroom. Developers and entrepreneurs are using it for startups.

Sorry I don't have a car analogy :) Feel free to email me if you have any other questions lisa at commoncrawl dot org

Thanks, that explains it better.

Re:I didn't know (0)

Anonymous Coward | about 2 years ago | (#42338511)

Car analogy: it is like how $250k spent on a bus is more efficient for more people than 120 20k cars would be.

Re:I didn't know (1)

plover (150551) | about 2 years ago | (#42337393)

I was actually mocking the slashdot story editor for claiming they were providing a copy of the web instead of providing a copy of web metadata obtained by crawling.

Here's your car analogy: Web crawling is like a guy driving down every street in town and taking a picture of every vehicle he sees, and analyzing them to figure out make, model, year, license plate number, etc. The guy can either choose to sell the information, or he can make it freely available under a creative commons license and publish it on the web.

So if you want to open a gas station and are wondering if you should install a diesel fuel pump, you can either drive all the streets yourself to count all the diesel cars, you can pay the one guy to find out how many diesel cars he counted, you can download a copy of the freely available car information, or you can make a guess. Only one of these choices costs nothing yet leads to making an informed decision.

Re:I didn't know (0)

Anonymous Coward | about 2 years ago | (#42336643)

Yeah, having to use an Amazon map-reduce node to access it makes it harder to access than just going to port 80. Make a torrent of it. A really big torrent.

Re:I didn't know (0)

Anonymous Coward | about 2 years ago | (#42337019)

I don't think you understand how big the data is.

Re:I didn't know (0)

Anonymous Coward | about 2 years ago | (#42338123)

I'm curious whether that website will include itself and its dataset in the dataset...

Kudos to the blekko team (0)

Anonymous Coward | about 2 years ago | (#42336599)

I've met some of the guys from blekko, and they do solid work, for the web as well as for their site. QED.

How does Common Crawl compare w/ Internet Archive? (0)

Anonymous Coward | about 2 years ago | (#42336689)

How does this project compare with the Internet Archive [archive.org] ?

Re:How does Common Crawl compare w/ Internet Archi (3, Funny)

Anonymous Coward | about 2 years ago | (#42336767)

How does this project compare with the Internet Archive [archive.org] ?

commoncrawl.org will be available on archive.org a lot longer than it will be available on commoncrawl.org

Re:How does Common Crawl compare w/ Internet Archi (3, Informative)

L1s4 (2798519) | about 2 years ago | (#42336881)

Hi I work at Common Crawl. Internet Archive is awesome and does really important work. The main difference between us and Internet Archive is that you can analyze our data. Internet Archive is a vault and is not available on a platform where you can run jobs against it. Because we put it on Amazon and other compute platforms, anyone can access our data and run jobs against it. If you wanted to do that with Internet Archive's crawl you would have to ask permission, get permission, and download it to your personal data center in order to analyze it. I don't know too many people with a personal data center :) Lisa

Re:How does Common Crawl compare w/ Internet Archi (0)

Anonymous Coward | about 2 years ago | (#42336933)

CommonCrawl is awesome! We have already used it for multiple projects at my work. Very cool!

Re:How does Common Crawl compare w/ Internet Archi (0)

Anonymous Coward | about 2 years ago | (#42337281)

Because we put it on Amazon and other compute platforms, anyone can access our data and run jobs against it. I don't know too many people with a personal data center :)
Lisa

Thanks. Hopefully you do know a few people with access to a data centre...like AWS.

Re:How does Common Crawl compare w/ Internet Archi (1)

colin_faber (1083673) | about 2 years ago | (#42338073)

Hi, Since you work there, do you have any idea how much data we're actually talking about here?

I always wanted to be ranked... (1)

3seas (184403) | about 2 years ago | (#42336753)

.... its better than being judged.

Blekko has horrible search results (0)

Anonymous Coward | about 2 years ago | (#42336839)

Blekko has pretty horrible search results compard to Google, Bing or even Duck Duck Go. So the raw data may be useful, the actual ranking data should not be trusted as being anywhere close to good.

Wikipedia already has a ranking of huge sites (0)

Anonymous Coward | about 2 years ago | (#42336965)

Here. [wikipedia.org]

Re:Wikipedia already has a ranking of huge sites (-1)

Anonymous Coward | about 2 years ago | (#42337709)

Alexa has a better list. Top 1 Million websites [alexa.com] , and you can download the csv file too.

Yacy (0)

Anonymous Coward | about 2 years ago | (#42338709)

What implications (if any) does this have for Yacy?

http://yacy.net/en/ [yacy.net] (the distributed search engine)

Check for New Comments
Slashdot Login

Need an Account?

Forgot your password?