MapReduce For the Masses With Common Crawl Data - Slashdot

Follow Slashdot stories on Twitter

×

MapReduce For the Masses With Common Crawl Data 29

Posted by timothy on Sunday December 18, 2011 @10:54PM from the gotta-be-here-some-place dept.

New submitter happyscientist writes "This is a nice 'Hello World' for using Hadoop MapReduce on Common Crawl data. I was interested when Common Crawl announced themselves a few weeks ago, but I was hesitant to dive in. This is a good video/example that makes it clear how easy it is to start playing with the crawl data."

This discussion has been archived. No new comments can be posted.

MapReduce For the Masses With Common Crawl Data

Load All Comments

Search 29 Comments Log In/Create an Account

Comments Filter:

- - Re: (Score:2)
    
    by geminidomino ( 614729 ) writes:
    
    Not sure if trolling (if not, well played), but it is.
    Citation [youtube.com]
Thanks for posting this.. (Score:2)

by kvvbassboy ( 2010962 ) writes:

This will be my first (and hopefully not last) headfirst dive into MapReduce.
- Re:Thanks for posting this.. (Score:5, Informative)
  
  by InsightIn140Bytes ( 2522112 ) writes: on Sunday December 18, 2011 @11:16PM (#38421056)
  
  Then you probably want to use it with some local data so you don't rack up huge bill. One Hadoop job on the whole dataset costs at least like $200, and that's for simple stuff.
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by kvvbassboy ( 2010962 ) writes:
    
    Warning heeded, but I saw this on a blog post at commoncrawl.org. [commoncrawl.org]
    This bucket is marked with Amazon Requester-Pays flag, which means all access to the bucket contents requires an an http request that is signed with your Amazon Customer Id. The bucket contents are accessible to everyone, but the Requester-Pays restriction ensures that if you access the contents of the bucket from outside the EC2 network, you are responsible for the resulting access charges. You don’t pay any access charges if you access the bucket from EC2, for example via a map-reduce job, but you still have to sign your access request. Details of the Requeser-Pays API can be found here: http://docs.amazonwebservices.com/AmazonS3/latest/dev/index.html?RequesterPaysBuckets.html [amazonwebservices.com]
    If I understood that right, at least getting started with the tutorial will not result in me coughing up $200. Correct me if I am mistaken.
    - Re: (Score:2)
      
      by InsightIn140Bytes ( 2522112 ) writes:
      
      You don't need to pay for accessing it, but you still need to pay for the processing power, storage and RAM in your EC2. Of course you can start by only accessing specific day like in the video so you don't need so much processing power for it, and hence pay less. But then you also won't be able to process 99.9% of the crawl data.
      - How Does One Profile a MapReduce Job? (Score:3)
        
        by MichaelCrawford ( 610140 ) writes:
        
        I think any total newbie that tried to process all the crawl data would soon find that his first attempt would not terminate until after The Heat Death of the Universe.
        Surely there must be some doc on how to make such jobs runs faster, use less memory as well as less storage?
      - Actually, entry Level EC2 is free for 1 year (Score:2)
        
        by tlambert ( 566799 ) writes:
        
        Actually, entry Level EC2 is free for 1 year, and has been since Nov. 2010.
        You don't need to pay for accessing it, but you still need to pay for the processing power, storage and RAM in your EC2
        See here:
        http://www.infoworld.com/d/cloud-computing/amazon-web-services-offers-ec2-access-no-charge-531 [infoworld.com]
        -- Terry
- Re: (Score:2)
  
  by symbolset ( 646467 ) * writes:
  
  MapReduce is an implementation of an algorithm first presented in a 1970's issue of the ACM. I would commend to startups membership and ownership of the patent-expired content composed therein. There's a lot of untapped potential in there yet - and much dross. If we will stand on the shoulders of giants though it's good to know where the giants were and what they did. Brin was a good scholar here, and Page gave something new. It was the fusion of old ideas and new that made Google. If you want to be t
- Re: (Score:2)
  
  by wmbetts ( 1306001 ) writes:
  
  So you're hoping to find child porn, ghost fetish stuff, or both?
- No, Really I Am Absolutely Serious (Score:1)
  
  by MichaelCrawford ( 610140 ) writes:
  
  The problem I've got is that searches with Google and the like turn up a lot of junk that I'm not looking for, with the file search engines like FilesTube simply ignoring the numeric years specified in my search queries.
  What I want to do is find PDF files of specific issues (Month and Year combinations) of certain magazine titles. But when I try these searches, the results contain a lot of years that I had not specified, with the year I did specify not falling anywhere in the resulting pages.
  There are all
  - Re: (Score:1)
    
    by larry bagina ( 561269 ) writes:
    
    "index of" and "parent directory" are good terms for finding virtual directory listings.
    - Heh. Now That's Really Funny. Thanks for the Tip! (Score:2)
      
      by MichaelCrawford ( 610140 ) writes:
      
      WGet is chugging away even when I speak. I'm gonna have to cough up for more storage.
      Here is an SEO tips for y'all. I didn't discover it, but I stumbled across it just now:
      placing the terms "index of", "parent directory", "name", "last modified", "size" and "description" on your web pages is a real good way to attract visitors.
      I wasn't able to turn up any actual Apache directory listings for Penthouse Pet of the Year Corinne Alphen. They were all your typical pr0n site that not only weren't presenting di
Regarding crawling (Score:3, Interesting)

by gajop ( 1285284 ) writes: on Monday December 19, 2011 @06:16AM (#38422562)

Hmm, similar article so I'll ask a question of personal nature.
I've recently created a crawler to collect certain information from a website, that would help me gather data sets for a small machine learning project.
While I've followed robots.txt and nofollow links, site's TOU was against it. After confirming with the admin, I was told that it's not allowed to gather information, as the site owns it (as it's written in the TOU).
The data however is publicly available, so you actually wouldn't have to agree to a TOU to collect the data, and as it's some data I wanted, I still concluded I should get a small sample (less than 1% of the total data, around 200MB) at least, to see if something's even possible to be done with it.
What are your thoughts /.? Should I have abandoned the attempt, have I done right or even should I disregard their plead and simply get as much as I please (during a long period of time, as to not hammer on it's bandwidth)?

Share
twitter facebook
- Re: (Score:2)
  
  by SmurfButcher Bob ( 313810 ) writes:
  
  Felony.
- Re: (Score:2)
  
  by AllyGreen ( 1727388 ) writes:
  
  If its publicly available, surely you can get the data elsewhere?
I have no idea (Score:1)

by Roachie ( 2180772 ) writes:

what this is.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Related Links Top of the: day, week, month.

93 commentsGartner Predicts Search Engine Volume Will Drop 25% by 2026, Due To AI Chatbots and Other Virtual Agents
46 commentsGoogle Considers Charging For AI-Powered Search
18 commentsGoogle: AI Content Is Not By Default Well Received By Its Algorithms
17 commentsMicrosoft CEO Says AI Will Help Google Extend Search Edge
14 commentsGoogle Search Gets AI-Powered 'Snapshots'

Organic chemistry is the chemistry of carbon compounds. Biochemistry is the study of carbon compounds that crawl. -- Mike Adams