Beta
×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Yahoo Releases Open Source Hadoop Distribution

timothy posted more than 5 years ago | from the spread-it-out-in-little-chunks dept.

Programming 49

ruphus13 writes "Yahoo has been a vociferous Apache Hadoop user and supporter for several years now, and uses it extensively within its Search technologies. Hadoop has been gaining popularity in the Cloud Computing space, with companies like the NYTimes converting 4TB and 11 million articles to PDFs in under 24 hours using Hadoop and EC2 in late 2007. Hadoop has been made available in Amazon's cloud and Yahoo has now released its own Hadoop version. From the article: 'At today's Hadoop Summit in Silicon Valley, Yahoo! announced the availability of the Yahoo! Distribution of Hadoop, a source-only version of Apache Hadoop that Yahoo! uses within its own search engine. [Hadoop] is an open source software framework that helps process very large data sets, and is widely used in large-scale data mining applications as well as in search tools at sites like Facebook and many others. For developers and users interested in Hadoop, it's worth noting that the Yahoo! Distribution of Hadoop has been widely tested and developed at Yahoo! for years now.'"

cancel ×

49 comments

Sorry! There are no comments related to the filter you selected.

No Frist Prost (-1, Troll)

Anonymous Coward | more than 5 years ago | (#28286081)

This news is so boring nobody is even bothering to FRIST PSOT!

g0 fuk y0ur$e1f (-1, Troll)

Anonymous Coward | more than 5 years ago | (#28286185)

suck my niggerdick mac fag

Goatse releases giant turd (-1, Troll)

Anonymous Coward | more than 5 years ago | (#28286097)

More useful [goatse.fr]

open source software (-1, Troll)

Anonymous Coward | more than 5 years ago | (#28286171)

is used to convert 1 million documents from what was likely an open standard to a closed proprietary standard. lol.

Re:open source software (2, Informative)

blueskies (525815) | more than 5 years ago | (#28295381)

Fail!

ISO 32000-1:2008: PDF [wikipedia.org]

Because stitching together numerous TIFF files on your own is so much better!

Timely article (3, Informative)

C18H27NO3 (1282172) | more than 5 years ago | (#28286291)

Perhaps the Ask Slashdot inquirer in this [slashdot.org] thread will find this news usefull.

Hadoop? (5, Insightful)

ickleberry (864871) | more than 5 years ago | (#28286345)

Can we bring back the ordinary, sensible pre-Web 2.0 names please?

Re:Hadoop? (2, Informative)

fancellu (712538) | more than 5 years ago | (#28286423)

Its the name of the main developer's kid's toy elephant.

Re:Hadoop? (3, Funny)

ickleberry (864871) | more than 5 years ago | (#28286473)

Wow. I'd never have thunk it. Thought it was short for "I had a poop"

Re:Hadoop? (2, Funny)

whereiswaldo (459052) | more than 5 years ago | (#28289277)

... and the Slashdot mob's credibility spirals downward...

Re:Hadoop? (0)

Anonymous Coward | more than 5 years ago | (#28286485)

Here is what I do:

refer to it by its name, backwards: Poo'd,Ha!

Re:Hadoop? (4, Insightful)

Just Some Guy (3352) | more than 5 years ago | (#28286915)

Like Yahoo!?

Re:Hadoop? (0)

Anonymous Coward | more than 5 years ago | (#28288417)

I don't take no lip from yahoos with such an attitude like yours, boy.

Re:Hadoop? (2, Funny)

phantomfive (622387) | more than 5 years ago | (#28286969)

Only if you are vociferous about it.

Hadoop is awesome (5, Informative)

fancellu (712538) | more than 5 years ago | (#28286363)

Not only is it used by Yahooo, but also by Facebook, who get 15TB of new data a day to handle. Checkout the very useful free vids from Cloudera. http://www.cloudera.com/hadoop-training-thinking-at-scale [cloudera.com] You can download a canned VM preloaded with Hadoop/Pig/Hive goodness, even a copy of Eclipse preconfigured. http://www.cloudera.com/hadoop-training-virtual-machine [cloudera.com]

Re:Hadoop is awesome (0)

Anonymous Coward | more than 5 years ago | (#28286975)

15TB a day? bullshit.

Re:Hadoop is awesome (2, Insightful)

Anonymous Coward | more than 5 years ago | (#28287439)

It's actually 25TB of photos per week [facebook.com] . He was only out by a factor of 4. Of course, this data isn't indexed, so he had no point.

Re:Hadoop is awesome (2, Informative)

DamnStupidElf (649844) | more than 5 years ago | (#28287733)

"Hive/Hadoop cluster at Facebook stores more than 2PB of uncompressed data and routinely loads 15 TB [facebook.com] of data daily."

Re:Hadoop is awesome (2, Interesting)

zerocool^ (112121) | more than 5 years ago | (#28296385)

We also use it extensively at Rackspace Email division. We generate about 200GB/day of logs from postfix and dovecot installs, and hadoop with mapreduce allows us to pull all sorts of metrics and diagnostic information in very short timeframes. It helps our customer facing support reps, as well as allows us to give more demanding customers the statistics and metrics that they want, plus it helps us with capacity planning and a bunch of other stuff.

And it's designed to run on commodity hardware.

http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data [highscalability.com]

~Wx

Re:Hadoop is awesome (0)

Anonymous Coward | more than 5 years ago | (#28298707)

lemme check... yahoo... amazon... nope. not a single mention of google.

that hadopi thing must be worhtles.

HBase 0.20 (1)

Stile 65 (722451) | more than 5 years ago | (#28286425)

I think HBase 0.20 is being released today as well, with a new and much faster file format, better memory management and better availability.

Obligatory Java is Slow Comment (1, Funny)

Anonymous Coward | more than 5 years ago | (#28286483)

Java is slow. How could it possibly be used to process so much data.

Re:Obligatory Java is Slow Comment (-1, Troll)

Anonymous Coward | more than 5 years ago | (#28286553)

Slowly. Slow, like how you penetrate an anus with your penis slow.

Re:Obligatory Java is Slow Comment (2, Informative)

dintech (998802) | more than 5 years ago | (#28290575)

Java has it's faults but being slow is no longer one of them. You should do some googling.

Re:Obligatory Java is Slow Comment (1)

mini me (132455) | more than 5 years ago | (#28299535)

Hence the need for a 10,000-machine Hadoop cluster to do the work of a single machine running a C++ application. Or something like that.

Yahoo! and OSS (5, Insightful)

Alethes (533985) | more than 5 years ago | (#28286561)

Yahoo! really does get a lot of flack around here, but I have to say, they have contributed quite a bit of free and open-source software for developers to use. The list of of APIs and web services that are available is quite impressive and many of them are better than Google's similar offerings (BOSS vs Google's AJAX search, for example). For anybody who's interested, I really recommend checking out the Yahoo! Developer Network [yahoo.com] site.

Re:Yahoo! and OSS (2, Interesting)

linguizic (806996) | more than 5 years ago | (#28286687)

THANK YOU!!!! I have found YDN enormously useful.

It's also worth noting that Yahoo has made major contributions to PHP as Rasmus is a Yahoo himself.

Re:Yahoo! and OSS (5, Interesting)

hairyfeet (841228) | more than 5 years ago | (#28287737)

And folks like to make fun of Yahoo search, but after switching from Google I just can't ever even think about going back. The more/concept tab(that is the blue button below the search box) is just too nice to give up.

Example- i just picked up "Blacksite:Area 51" for $5. I type in "blacksi" and there it is. From "Blacksite:Area 51" in the search box under more/related I have cheats,patch.system reqs, PS3.Xbox360, Midway games west,multiplayer modes, squad based shooters, release date by region, etc. Just from typing "blacksi" and picking area 51 from the drop down I have all those different avenues related to my search right there at the top where they are easy to get at. It really lets me hone in on an area, and in some cases like movies it finds me interviews with the director which i often don't even know who directed a particular flick.

So those that haven't tried their search in a few years really ought to give it a whirl. The more/related concept tab at the top makes search so easy to drill down. Plus Yahoo has an opt out [yahoo.com] for ad matching if you are concerned about privacy. I looked and I don't think Google even has an "opt out" short of using ABP. So give it a go, its free and you might find the more/concepts button as useful as I do. And competition is always a good thing, right?

Re:Yahoo! and OSS (1)

sznupi (719324) | more than 5 years ago | (#28288001)

From your description and from trying it out I seem to have an impression that it's not really different from Google Search features. Have you tried it semi-recently? It also has autocompletion/suggestion and related searches.

Re:Yahoo! and OSS (1)

hairyfeet (841228) | more than 5 years ago | (#28289423)

I just did [google.com] as a matter of fact, and IMHO it just isn't as good. The related searches area is at the bottom, I had to type the entire word 'blacksite" before anything came up, and the more/related concepts combo tab at the top of Yahoo Search [yahoo.com] give me 26 different results compared to 8 for related searches for Google.

For me it all comes down to speed and ease of use. With Yahoo Search it is simply easier to "springboard" into other avenues related to my search than it is with Google. When I am searching for a game/movie I want more than just the wiki and a couple of concepts. For example when I looked up "The Dark Knight" in Yahoo under the "more/related concepts" tab I found really good interviews with the cast, the director, and a nice overview of Heath Ledger's film career. While I might have found the Gary Oldman interview or the story about Heath Ledger, I wouldn't have thought thought to look up Christopher Nolan or Christian Bale, as I didn't know either by name.

So for me Yahoo makes it easier to get to the information and to find related information quickly. And in the end, isn't that what we all want from our searches?

Re:Yahoo! and OSS (1)

micheas (231635) | more than 5 years ago | (#28290589)

I would assume that yahoo tracks your search history so they can give you semi personalized results, just like google does.

This would result in the search engine that you frequently use normally giving you better results.

It also explains why sometimes when you cannot find something with google or yahoo changing to the search engine that you infrequently use gets the result easier, as you are getting a more generic less personalized search.

Re:Yahoo! and OSS (1)

hairyfeet (841228) | more than 5 years ago | (#28293215)

Actually Yahoo has an Opt Out [yahoo.com] which is both cookie based and login based, and since I have opted out I don't think that could have any effect. If you think that is true clear your browser's cache and then run an identical search with two tabs, one for Google and one for Yahoo. I think you'll find the more/concepts tab at the top simply gives better results than Google. it really seems to me that Yahoo has really been stepping up to the plate in the last couple of years when it comes to their search engine.

But of course we see this all the time in business. When you are #1 there isn't as big a drive to push it, as where are you gonna go? And I'm sure there are plenty in Google that think "if it ain't broke don't fix it" which considering they are top of the heap would be a legitimate argument to make. But for me the more/concepts tab at the top simply makes Yahoo Search a better springboard for drilling down quickly. I can find not only the data I want, but related items I didn't even know about, like the Christopher Nolan and Christian Bale interviews. So while I appreciate both and think Gmail makes for a better spam dump than Yahoo, for search I simply think Yahoo is better than Google and much better than Bing. It gets me where I need to go and it gets me there fast. And in the end isn't that what we really want our search engines to do?

Re:Yahoo! and OSS (1)

micheas (231635) | more than 5 years ago | (#28295521)

Google tracks the search terms from by ip address, ostensibly so that they can customize search results.

It would make sense for yahoo to do the same thing.

The opt out is only for ad tracking, and you need a cookie to make it stick, which implies that they do the same ip address tracking of search queries as google.

For example, assume that everyone at the San Jose Earthquakes (a professional soccer team in the USA) offices uses google, it would make sense for the search engine to learn that the term football, when coming from that IP address, even if it does not have a cookie, is probably referring to what Americans call soccer. So when ever google gets a query from that ip address it skews football to mean football as the rest of the world means it as opposed to US football.

So if you were searching for something about American Football from the San Jose Earthquakes offices, using the less used search engine might produce better results in this case.

Yahoo's search assist is interesting, but I don't think I would consistently use it unless I used Yahoo long enough to train myself to click on the little down button.

Google's Web history is also sort of interesting, but also a pretty blatant reminder of how little privacy there is in the world of map reduce clusters with petabytes of semi personally identifiable information. (it just lets you see what google already knows about you.)

Re:Yahoo! and OSS (1)

DragonWriter (970822) | more than 5 years ago | (#28302121)

I just did as a matter of fact, and IMHO it just isn't as good. The related searches area is at the bottom

That depends on how you use Google Search. If you have the Search Options on, and have "Related searches" selected, the related searches area is at the bottom.

For example when I looked up "The Dark Knight" in Yahoo under the "more/related concepts" tab I found really good interviews with the cast, the director, and a nice overview of Heath Ledger's film career. While I might have found the Gary Oldman interview or the story about Heath Ledger, I wouldn't have thought thought to look up Christopher Nolan or Christian Bale, as I didn't know either by name.

Heath Ledger and Christian Bale come up as part of the related searches on Google, no matter whether or not I use quotes for "The Dark Knight". Christopher Nolan doesn't (though if you do "crew of the dark knight", he does), though he's mentioned in the short preview section on several of the search results, so one way or the other the Google Search points you in that direction. So I'm not sure why this would be a point in favor of Yahoo! over Google.

Re:Yahoo! and OSS (1, Funny)

Anonymous Coward | more than 5 years ago | (#28286709)

Well I! love Yahoo! because I! believe that all proper nouns, as well as first person pronouns, should be followed by an exclamation mark. Imagine if we! all followed suit -- there would be Google! and Microsoft! and Linux! And Slashdot! The world would be a much more exciting place!

Re:Yahoo! and OSS (0)

Anonymous Coward | more than 5 years ago | (#28302133)

Open APIs != Open Source

Why is this a big deal? (2, Interesting)

Eric Smith (4379) | more than 5 years ago | (#28287795)

Does the world need another Hadoop distribution? In a case like this, isn't a "distribution" just a fork going by a different name that has a more positive connotation? there some good reason they did it this way rather than just pushing their changes upstream to Apache? Did Apache not want them?

I'll admit to knowing basically nothing about Hadoop, but if I saw the same article with "Hadoop" replaced by "GCC", "Postfix", or "OpenOffice", I wouldn't see it as being a good thing.

Re:Why is this a big deal? (0)

Anonymous Coward | more than 5 years ago | (#28288689)

I'll admit to knowing basically nothing about Hadoop

Then why are you posting a reply to this story? Do you feel the need to voice your opinion on everything you're clueless about? Do you see me going onto a chemistry message board and say "I know nothing about chemistry, but regardless, I don't know why we need another way to do this 'Fischer-Tropsch' process you keep talking about" ?

Re:Why is this a big deal? (0)

Anonymous Coward | more than 5 years ago | (#28288871)

Shut up, dick.

Re:Why is this a big deal? (0)

Anonymous Coward | more than 5 years ago | (#28289287)

Are you saying it wasn't you who went to the chemistry message board and said that about the Fischer-Tropsch process? Then who was it? It's really been bugging me.

Re:Why is this a big deal? (3, Informative)

shadow42 (996367) | more than 5 years ago | (#28288869)

As far as I can tell, the distribution Yahoo is offering is just the vanilla Hadoop, but with Yahoo's patches on top of it. Yahoo is very involved in Hadoop's development (the project's founder is now employed by them), so a lot of their patches get incorporated back into Hadoop's source tree. Most of the changes Yahoo made are just performance/stability patches that haven't been incorporated into an official release yet. You could probably get the same distribution just by grabbing SVN trunk.

Re:Why is this a big deal? (2, Interesting)

linguizic (806996) | more than 5 years ago | (#28288903)

Does the world need another Linux distribution? The folks at Ubuntu thought so, and they've made an indelible mark on Linux. Just like Yahoo! is doing with Hadoop [slashdot.org] .

Hai gais (1)

papasui (567265) | more than 5 years ago | (#28288441)

Usually I only need to google new technology terms that I haven't heard before. Today I had to google vociferous. I was thinking it sounded like a condition that you need to take Levitra for. It didn't really make sense in the sentence but thinking about Yahoo suffering from erectile dysfunction has it's own childish humor when your on your 5th beer.

Doesn't make much sense for most tasks (1)

melted (227442) | more than 5 years ago | (#28289063)

I've evaluated Hadoop (and Cloudbase, HBase and a few other things) for transaction log mining purposes and found it to be VERY inefficient. Basically, if your machine has a decent RAID array (by "decent" I mean 500-700MB/sec linear read throughput, and 300-500MB/sec write throughput), you will need 12-15 8 core Hadoop boxes to even come close to a single machine's performance. This, IMO, is fucked up. I expected it would be much more efficient than it is.

Therefore, my conclusion was that Hadoop only makes sense when you can't solve a problem any other way and are prepared to pay through the nose for hundreds or thousands of machines to alleviate its performance shortcomings.

Caveat lector - my biggest Hadoop cluster consisted of 20 8-core nodes, with 32GB RAM per node and GigE interconnect.

Re:Doesn't make much sense for most tasks (1)

romcabrera (699616) | more than 5 years ago | (#28289257)

You should try it on the cloud. Amazon's EC2 crunching data from S3.
There are many study cases available. Yup, it is no silver bullet, but has its uses.

Re:Doesn't make much sense for most tasks (1)

hesaigo999ca (786966) | more than 5 years ago | (#28291851)

So which do you prefer using performance wise then...of all the ones you tested...?
Just asking

Re:Doesn't make much sense for most tasks (1)

melted (227442) | more than 5 years ago | (#28314357)

They aren't comparable. Cloudbase is a simple SQL layer on top of Hadoop that operates on flat files. HBase is an open source BigTable (i.e. you can't really do SQL with it). Hive is kind of like Cloudbase. In the end all of these systems have a common strength/weakness - Hadoop. The strength is that they scale, if you're willing to pay, the weakness is that their scalability seems to be piss poor on smaller clusters.

In the end we went with bare Hadoop operating directly on LZO compressed log chunks. Log chunks are written to HDFS by a distributed grid of daemons. For ETL and preliminary cleanup, this works fine on 1TB of log data per day. If we had less than that, we'd import directly into Aster DB, which is what we use for DW and ad-hoc data analysis.

Re:Doesn't make much sense for most tasks (1)

hesaigo999ca (786966) | more than 5 years ago | (#28340335)

: )

Yahoo! APIs Rock like the BOSS at BuildaSearch! (-1, Offtopic)

Anonymous Coward | more than 5 years ago | (#28289331)

Being a core developer at BuildaSearch, I have messed with tons of APIs and believe or not Yahoo! APIs are for the most part superior to Google APIs. Taking it a step further Bing APIs are also better than any Google Search API. It is a sad day for Google.

Check for New Comments
Slashdot Login

Need an Account?

Forgot your password?

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>