×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Intel Launches Its Own Apache Hadoop Distribution

Soulskill posted about a year ago | from the if-you-want-something-done-right-do-it-yourself dept.

Databases 18

Nerval's Lobster writes "The Apache Hadoop open-source framework specializes in running data applications on large hardware clusters, making it a particular favorite among firms such as Facebook and IBM with a lot of backend infrastructure (and a whole ton of data) to manage. So it'd be hard to blame Intel for jumping into this particular arena. The chipmaker has produced its own distribution for Apache Hadoop, apparently built 'from the silicon up' to efficiently access and crunch massive datasets. The distribution takes advantage of Intel's work in hardware, backed by the Intel Advanced Encryption Standard (AES) Instructions (Intel AES-NI) in the Intel Xeon processor. Intel also claims that a specialized Hadoop distribution riding on its hardware can analyze data at superior speeds—namely, one terabyte of data can be processed in seven minutes, versus hours for some other systems. The company faces a lot of competition in an arena crowded with other Hadoop players, but that won't stop it from trying to throw its muscle around."

cancel ×
This is a preview of your comment

No Comment Title Entered

Anonymous Coward 1 minute ago

No Comment Entered

18 comments

Big Data (0)

Anonymous Coward | about a year ago | (#43018763)

It's Big

Big Data != big (1)

oneiros27 (46144) | about a year ago | (#43019305)

Not always. It's been used as such a buzzword that it's come to be used any time when the amount or complexity becomes a limit to what you're trying to do.

So in the case of NRT (near real time), it might be a relatively small amount of data. Or it might be that there's enough different formats of data or other complexity that it's a problem.

And it's also discipline specific ... I've heard of groups complaining about 50GB being a lot of data ... because they're dealing with tens of thousands of Excel spreadsheets. For those in astronomy, 50GB is nothing; you have to get up into the multi-TB range before you have to worry ... and that's still small for some disciplines who deal in PB of data.

Speed (2)

stewsters (1406737) | about a year ago | (#43018843)

How does that compare to something like spark [spark-project.org]?

Re:Speed (5, Informative)

Anonymous Coward | about a year ago | (#43019227)

The performance claim in the summary seems to come from page 15 of this presentation [intel.com], where the speedup for a 1TB sort (presumably distributed) is 4 hours -> 7 minutes. I can't find the details for that test, but most of the speedup comes from using better hardware - faster CPU and network adapter, and SSDs instead of HDDs - while they get a 40% speedup from using their Hadoop distribution over some other Hadoop distribution, which is a fairly modest gain.

The biggest performance benefit of Spark comes from avoiding disk and network access, so improving those bottlenecks will presumably reduce Spark's lead over Hadoop somewhat. But it's hard to say how well Spark would do with this particular hardware and test setup. I would guess it's still much faster than their Hadoop distribution. (Note: I'm a Spark power user but not an expert in its performance.)

Re:Speed (1)

Anonymous Coward | about a year ago | (#43020033)

Yeah, the details in that presentation describe something far less impressive than the top-line "4 hours -> 7 minutes" claim. You are absolutely correct that only a very modest amount of the ~35x speedup claimed is attributable to the Intel Hadoop distribution itself, with the bulk of the speedup coming from significant hardware upgrades across the cluster. Spark wouldn't benefit from the hardware changes in exactly the same way, but it would still see significant gains from upgrading the cluster hardware. At the same time, Spark can achieve the same 35x speedup or better across a wide range of jobs even while making no upgrades to cluster hardware (and the most obvious and cost-effective path to upgrade hardware in a Spark cluster is the comparatively simple option of increase the available RAM in each node.)

I have no expectation that Intel Hadoop will generate raw performance anywhere close to that of Spark.

Re:Speed (1)

Anonymous Coward | about a year ago | (#43022467)

Approximated results from the presentation:
- Hadoop 1.0.3, old Xeon, HDD, 1G Ethernet -> 240 minutes
- Hadoop 1.0.3, new Xeon, HDD, 1G Ethernet -> 120 minutes
- Hadoop 1.0.3, new Xeon, SSD, 1G Ethernet -> 24 minutes
- Hadoop 1.0.3, new Xeon, SSD, 10G Ethernet -> 12 minutes
- Hadoop 2.1.1, new Xeon, SSD, 10G Ethernet -> 7 minutes

The only useful conclusion is that changing Hadoop version from 1.0.3 to 2.1.1 can give you 40% reduction of duration. I wonder how it would work for other hardware configurations.

Re:Speed (3, Informative)

Anonymous Coward | about a year ago | (#43019265)

It's impossible to say without the details of apples-to-apples comparisons, but superficially, none of the announcements of "improved Hadoop" from Intel, Greenplum, Hortonworks, etc. is all that impressive in comparison to Spark even if you assume that none of their improvements can or will be integrated into Spark. Take, for example, a couple of the claims that Intel is making for their new Hadoop distribution. First, the "four hour job reduced to seven minutes" claim is the same ballpark 30-40x claim made for some of the other "improved Hadoop" offerings. For each of these, I'd be surprised if 30-40x speedup could be expected in the general case, and not just for some quite specific use cases. In contrast, Spark achieves 30-40x speedups across a wide range of jobs, and often does significantly better. Second, Intel claims an 8.5x speedup for Hive queries. That is much less than speedups that are routinely achieved with Shark (Hive on Spark), and the best-case scenarios for Shark speedups are more than a full order of magnitude greater than Intel's claim.

In short, the "improved Hadoop" distributions do make significant advances over current Hadoop, but they don't really do anything to change my mind that in-memory data clusters are the way forward and away from many of the limitations of Hadoop/MapReduce, or that Spark is the leading implementation of such an in-memory cluster computing framework. At this point, the main advantages the various Hadoops have over Spark are in the areas of the maturity of the technology and the coverage and usefulness of management and integration layers on top of the basic cluster computing framework. As long as it remains disk-oriented and doesn't retain the working dataset in memory, I wouldn't expect Hadoop to close the raw performance gap with Spark.

neat, but (2)

masternerdguy (2468142) | about a year ago | (#43018961)

So they've migrated an open solution to a vendor locked in solution? Sweet.

Re:neat, but (1)

wlj (204164) | about a year ago | (#43018997)

The (stated) speed-up could be nice, but:

(1) how locked-in is it (just some tuning, serious modification, what?)
(2) have they actually released it?

Re:neat, but (2)

networkBoy (774728) | about a year ago | (#43019189)

Even if it's completely locked in, you data isn't.

Simple really, if you have Intel hardware use this distro to take advantage of it, otherwise use the Apache one. No reason AMD or nVidia can't do the same...
-nb

Re:neat, but (0)

Anonymous Coward | about a year ago | (#43023043)

Well, they considered putting the sort into the kernel, but they didn't want to suffer a blistering attack by the maniac "in charge". : )
Besides, a smart guy like you will have it ported over to BSD in a week. Right?

Because AES is the true bottleneck in hadoop (3, Insightful)

citizenr (871508) | about a year ago | (#43019065)

...

Re:Because AES is the true bottleneck in hadoop (1)

Anonymous Coward | about a year ago | (#43019453)

AES-NI is not just AES processing, it includes significant improvements to general vector processing instructions.

Re:Because AES is the true bottleneck in hadoop (0)

Anonymous Coward | about a year ago | (#43020057)

Still doesn't rescue Intel Hadoop from Amdahl's Law.

Re:Because AES is the true bottleneck in hadoop (0)

Anonymous Coward | about a year ago | (#43023131)

It doesn't solve FTL travel either. But at least I know what Amdahl's law is now.
It also makes me wonder why engineers that are trying to actually do something are constantly arguing with people that don't, over laws.
One could argue that engineers should pay attention in physics class, or conversely, theorists should get their hands dirty once in a while.

The real dillemma for the Slashdot crowd is when it comes time for your surgically implanted chip, are you going to choose intel or amd.

Re:Because AES is the true bottleneck in hadoop (1)

dkf (304284) | about a year ago | (#43032275)

One could argue that engineers should pay attention in physics class, or conversely, theorists should get their hands dirty once in a while.

Could we have both? A bit of realism on both sides would be nice...

Lock-down as a defense? (0)

Anonymous Coward | about a year ago | (#43020015)

Abstracted architecture must really irritate them, since the basic equivalent functionality of AMD and ARM processors isn't even noticeable by most end-users. It looks like high-end corporate customers are all that they have left.

After reading the articles about ISPs clamping down and monitoring customer Internet traffic, the move to the "cloud" for many, and large hardware companies ditching standards to create proprietary systems, it feels like the mainframe days are on their way back. What's next, back to being charged for online minutes? With a minute-based metric tracking system in place, that'll open the door for the government to step in and add an Internet tax to each minute spent online. They'll probably use the FCC as a proxy so that it doesn't look so much like a tax.

Check for New Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Sign up for Slashdot Newsletters
Create a Slashdot Account

Loading...