Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

MapReduce For the Masses With Common Crawl Data

timothy posted more than 2 years ago | from the gotta-be-here-some-place dept.

Programming 29

New submitter happyscientist writes "This is a nice 'Hello World' for using Hadoop MapReduce on Common Crawl data. I was interested when Common Crawl announced themselves a few weeks ago, but I was hesitant to dive in. This is a good video/example that makes it clear how easy it is to start playing with the crawl data."

cancel ×

29 comments

Sorry! There are no comments related to the filter you selected.

Happy Data Mining from the Golden Girls!!11! (-1, Offtopic)

Anonymous Coward | more than 2 years ago | (#38420980)

Thank you for being a friend
Traveled down the road and back again
your heart is true you're a pal and a cosmonaut

And if you threw a party
Invited everyone you knew
You would see, the biggest gift would be from me
and the card attached would say,
Thank you for being a friend

Re:Happy Data Mining from the Golden Girls!!11! (0)

Anonymous Coward | more than 2 years ago | (#38422802)

I'm pretty sure the word "Traveled" was not in the original song.

Re:Happy Data Mining from the Golden Girls!!11! (1)

geminidomino (614729) | more than 2 years ago | (#38423354)

Not sure if trolling (if not, well played), but it is.

Citation [youtube.com]

Thanks for posting this.. (1)

kvvbassboy (2010962) | more than 2 years ago | (#38421024)

This will be my first (and hopefully not last) headfirst dive into MapReduce.

Re:Thanks for posting this.. (5, Informative)

InsightIn140Bytes (2522112) | more than 2 years ago | (#38421056)

Then you probably want to use it with some local data so you don't rack up huge bill. One Hadoop job on the whole dataset costs at least like $200, and that's for simple stuff.

Re:Thanks for posting this.. (1)

kvvbassboy (2010962) | more than 2 years ago | (#38421160)

Warning heeded, but I saw this on a blog post at commoncrawl.org. [commoncrawl.org]

This bucket is marked with Amazon Requester-Pays flag, which means all access to the bucket contents requires an an http request that is signed with your Amazon Customer Id. The bucket contents are accessible to everyone, but the Requester-Pays restriction ensures that if you access the contents of the bucket from outside the EC2 network, you are responsible for the resulting access charges. You don’t pay any access charges if you access the bucket from EC2, for example via a map-reduce job, but you still have to sign your access request. Details of the Requeser-Pays API can be found here: http://docs.amazonwebservices.com/AmazonS3/latest/dev/index.html?RequesterPaysBuckets.html [amazonwebservices.com]

If I understood that right, at least getting started with the tutorial will not result in me coughing up $200. Correct me if I am mistaken.

Re:Thanks for posting this.. (2)

InsightIn140Bytes (2522112) | more than 2 years ago | (#38421218)

You don't need to pay for accessing it, but you still need to pay for the processing power, storage and RAM in your EC2. Of course you can start by only accessing specific day like in the video so you don't need so much processing power for it, and hence pay less. But then you also won't be able to process 99.9% of the crawl data.

How Does One Profile a MapReduce Job? (2)

MichaelCrawford (610140) | more than 2 years ago | (#38421252)

I think any total newbie that tried to process all the crawl data would soon find that his first attempt would not terminate until after The Heat Death of the Universe.

Surely there must be some doc on how to make such jobs runs faster, use less memory as well as less storage?

Re:Thanks for posting this.. (-1)

Anonymous Coward | more than 2 years ago | (#38421442)

porch monkey

Actually, entry Level EC2 is free for 1 year (1)

tlambert (566799) | more than 2 years ago | (#38430684)

Actually, entry Level EC2 is free for 1 year, and has been since Nov. 2010.

You don't need to pay for accessing it, but you still need to pay for the processing power, storage and RAM in your EC2

See here:
http://www.infoworld.com/d/cloud-computing/amazon-web-services-offers-ec2-access-no-charge-531 [infoworld.com]

-- Terry

Re:Thanks for posting this.. (0)

Anonymous Coward | more than 2 years ago | (#38421636)

Definitely not $200. I did the tutorial and it cost less than a dollar.

Re:Thanks for posting this.. (1)

symbolset (646467) | more than 2 years ago | (#38421346)

MapReduce is an implementation of an algorithm first presented in a 1970's issue of the ACM. I would commend to startups membership and ownership of the patent-expired content composed therein. There's a lot of untapped potential in there yet - and much dross. If we will stand on the shoulders of giants though it's good to know where the giants were and what they did. Brin was a good scholar here, and Page gave something new. It was the fusion of old ideas and new that made Google. If you want to be the new Google, the ACM journals are a good start. Just remember to add something new too.

Re:Thanks for posting this.. (0)

Anonymous Coward | more than 2 years ago | (#38422714)

I'm not sure MapReduce is an algorithm so much as a class of algorithms, or design pattern. The algorithm implemented depends on the Map and Reduce operators used. MapReduce falls out very easily from consideration of which parts of an algorithm can be parallelized. Even embarrassingly parallel algorithms like a raytracer eventually have to combine their results to form a single image. This is all MapReduce is really expressing. I'm sure you can find it before 1970.

LIL KIM KILLS SELF !! (-1, Offtopic)

Anonymous Coward | more than 2 years ago | (#38421064)

And N Korea doesn't notice !!

Put that in your owndahl heater and smoke it !!

Long live the king !!

Hot Damn! Now I Can Find All The Pr0n Of My Misspe (-1, Offtopic)

MichaelCrawford (610140) | more than 2 years ago | (#38421104)

-nt Youth!

It's not hard at all to find PDFs of recently published Magazines of Ill Repute, but because what is for little boys and little girls is just an implicit fascination became an explicit one for me in 1976, the magazines that taught me about what awaited me at The Promised Land are very hard to find.

One can generally find videos as what are called The Classics are now available on DVD and sometimes even on Blu-ray, but Classic Magazines require manual scanning.

If any of you guys happens to own a copy of the February 1976 Club Magazine USA, I would really like to someday finish reading a short story in which some guy in full SCUBA gear makes love to a rapturously beautiful semitransparent female ghost. I was once a diver myself, and still have my NAUI Open Water Card, but never in my life have I found a wetsuit that was equipped for quite that same kind of diving.

Re:Hot Damn! Now I Can Find All The Pr0n Of My Mis (1)

wmbetts (1306001) | more than 2 years ago | (#38421114)

So you're hoping to find child porn, ghost fetish stuff, or both?

The models in that mag were all over 18 (-1, Troll)

MichaelCrawford (610140) | more than 2 years ago | (#38421178)

Nothing to do with necrophilia, but that was the first time that I ever learned that erotica could be presented in written form. I was heavily into that stuff throughout my teenage years:

"Dear Pentouse, I Never Thought It Could Happen To Me. I was hanging out at /. when one of the rare lady Slashbots moderated me up to +5, Erect."

No, Really I Am Absolutely Serious (0)

MichaelCrawford (610140) | more than 2 years ago | (#38421234)

The problem I've got is that searches with Google and the like turn up a lot of junk that I'm not looking for, with the file search engines like FilesTube simply ignoring the numeric years specified in my search queries.

What I want to do is find PDF files of specific issues (Month and Year combinations) of certain magazine titles. But when I try these searches, the results contain a lot of years that I had not specified, with the year I did specify not falling anywhere in the resulting pages.

There are all kinds of ways I could use a regular expression to turn up the download sites for every magazine I want to find, but to the best of my knowledge none of the currently available search engines can take regular expressions. For example a copy of an old magazine that is available for download will typically be labeled with the file size, so I could include "[0-9]* *[Mm][Bb]" in my query. That would distinguish magazines available for download from those that are merely discussed online.

Just last night I sat up all night long looking for old magazines, only to turn up two from the era of my interest. There is no end to the pr0n that is available online, but I don't find most of it at all interesting anymore. What I do find interesting is my goal of completely recovering the magazine collection that my ex demanded that I pitch before she would agree to come visit me for the first time.

Re:No, Really I Am Absolutely Serious (1)

larry bagina (561269) | more than 2 years ago | (#38421342)

"index of" and "parent directory" are good terms for finding virtual directory listings.

Heh. Now That's Really Funny. Thanks for the Tip! (1)

MichaelCrawford (610140) | more than 2 years ago | (#38422442)

WGet is chugging away even when I speak. I'm gonna have to cough up for more storage.

Here is an SEO tips for y'all. I didn't discover it, but I stumbled across it just now:

placing the terms "index of", "parent directory", "name", "last modified", "size" and "description" on your web pages is a real good way to attract visitors.

I wasn't able to turn up any actual Apache directory listings for Penthouse Pet of the Year Corinne Alphen. They were all your typical pr0n site that not only weren't presenting directory listings, but none of the sites I looked at had any photos of her, scantily clad or otherwise.

Directory listings for well-known models though, turned right up.

Re:Heh. Now That's Really Funny. Thanks for the Ti (0)

Anonymous Coward | more than 2 years ago | (#38423708)

add -.htm? and -.php to your searches

Greetings for South Korea! (-1)

Anonymous Coward | more than 2 years ago | (#38421196)

We are dancing in the street! What a wonderful day!

America, Fuck Yeah! (1)

Luke727 (547923) | more than 2 years ago | (#38421340)

Gary Johnston: We're dicks! We're reckless, arrogant, stupid dicks. And the Film Actors Guild are pussies. And Kim Jong Il is an asshole. Pussies don't like dicks, because pussies get fucked by dicks. But dicks also fuck assholes: assholes that just want to shit on everything. Pussies may think they can deal with assholes their way. But the only thing that can fuck an asshole is a dick, with some balls. The problem with dicks is: they fuck too much or fuck when it isn't appropriate - and it takes a pussy to show them that. But sometimes, pussies can be so full of shit that they become assholes themselves... because pussies are an inch and half away from ass holes. I don't know much about this crazy, crazy world, but I do know this: If you don't let us fuck this asshole, we're going to have our dicks and pussies all covered in shit!

You know you're old when (0, Interesting)

Anonymous Coward | more than 2 years ago | (#38421662)

more than 50% of any given sentence sounds like gibberish. And yet you know someone somewhere is as excited as you were when you got your first floppy drive...

Regarding crawling (3, Interesting)

gajop (1285284) | more than 2 years ago | (#38422562)

Hmm, similar article so I'll ask a question of personal nature.

I've recently created a crawler to collect certain information from a website, that would help me gather data sets for a small machine learning project.
While I've followed robots.txt and nofollow links, site's TOU was against it. After confirming with the admin, I was told that it's not allowed to gather information, as the site owns it (as it's written in the TOU).

The data however is publicly available, so you actually wouldn't have to agree to a TOU to collect the data, and as it's some data I wanted, I still concluded I should get a small sample (less than 1% of the total data, around 200MB) at least, to see if something's even possible to be done with it.

What are your thoughts /.? Should I have abandoned the attempt, have I done right or even should I disregard their plead and simply get as much as I please (during a long period of time, as to not hammer on it's bandwidth)?

Re:Regarding crawling (1)

SmurfButcher Bob (313810) | more than 2 years ago | (#38426216)

Felony.

Re:Regarding crawling (1)

AllyGreen (1727388) | more than 2 years ago | (#38426442)

If its publicly available, surely you can get the data elsewhere?

I have no idea (1)

Roachie (2180772) | more than 2 years ago | (#38427240)

what this is.

Check for New Comments
Slashdot Login

Need an Account?

Forgot your password?

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>