Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Book Review: Hadoop Beginner's Guide

samzenpus posted about a year ago | from the read-all-about-it dept.

Books 57

First time accepted submitter sagecreek writes "Hadoop is an open-source, Java-based framework for large-scale data processing. Typically, it runs on big clusters of computers working together to crunch large chunks of data. You also can run Hadoop in "single-cluster mode" on a Linux machine, Windows PC or Mac, to learn the technology or do testing and debugging. The Hadoop framework, however, is not quickly mastered. Apache's Hadoop wiki cautions: "If you do not know about classpaths, how to compile and debug Java code, step back from Hadoop and learn a bit more about Java before proceeding." But if you are reasonably comfortable with Java, the well-written Hadoop Beginner's Guide by Garry Turkington can help you start mastering this rising star in the Big Data constellation." Read below for the rest of Si's review.Dr. Turkington is vice president of data engineering and lead architect for London-based Improve Digital. He holds a doctorate in computer science from Queens University of Belfast in Northern Ireland. His Hadoop Beginner's Guide provides an effective overview of Hadoop and hands-on guidance in how to use it locally, in distributed hardware clusters, and out in the cloud.

Packt Publishing provided a review copy of the book. I have reviewed one other Packt book previously.

Much of the first chapter is devoted to "exploring the trends that led to Hadoop's creation and its enormous success." This includes brief discussions of Big Data, cloud computing, Amazon Web Services, and the differences between "scale-up" (using increasingly larger computers as data needs grow) and "scale-out" (spreading the data processing onto more and more machines as demand expands).

Dr. Turkington writes, "One of the most confusing aspects of Hadoop to a newcomer is its various components, projects, sub-projects, and their interrelationships."

His 374-page book emphasizes three major aspects of Hadoop: (1) its common projects; (2) the Hadoop Distributed File System (HDFS); and (3) MapReduce.

He explains, "Common projects comprise a set of libraries and tools that help the Hadoop product work in the real world."

The HDFS, meanwhile, "is a filesystem unlike most you may have encountered before." As a distributed filesystem, it can spread data storage across many nodes. "[I]t stores files in blocks typically at least 64 MB in size, much larger than the 4-32 KB seen in most filesystems." The book briefly describes several features, strengths, weaknesses, and other aspects of HDFS.

Finally, MapReduce is a well-known programming model for processing large data sets. Typically, MapReduce is used with clusters of computers that perform distributed computing. In the "Map" portion of the process, a single problem is split into many subtasks that are then assigned by a master computer to individual computers known as nodes (and there can be sub-nodes). During the "Reduce" part of the task, the master computer gathers up the processed data from the nodes, combines it and outputs a response to the problem that was posed to be solved. (MapReduce libraries are now available for many different computer languages, including Hadoop.)

"The developer focuses on expressing the transformation between source and result data sets, and the Hadoop framework manages all aspects of job execution, parallelization, and coordination," Dr. Turkington notes. He calls this "possibly the most important aspect of Hadoop. The platform takes responsibility for every aspect of executing the processing across the data. After the user defines the key criteria for the job, everything else becomes the responsibility of the system."

In this 11-chapter book, the first two chapters introduce Hadoop and explain how to install and run the software.

Three chapters are devoted to learning to work with MapReduce, from beginner to advanced levels. And the author stresses: "In the book, we will be learning how to write MapReduce programs to do some serious data crunching and how to run them on both locally managed and AWS-hosted Hadoop clusters." ["AWS" is "Amazon Web Services."]

Chapter 6, titled "When Things Break" zeroes in on Hadoop's "resilience to failure and an ability to survive failures when they do happen.much of the architecture and design of Hadoop is predicated on executing in an environment where failures are both frequent and expected." But node failures and numerous other problems still can arise, so the reader is given an overview of potential difficulties and how to handle them.

The next chapter, "Keeping Things Running," lays out what must be done to properly maintain a Hadoop cluster and keep it tuned and ready to crunch data.

Three of the remaining chapters show how Hadoop can be used elsewhere within an organization's systems and infrastructure, by personnel who are not trained to write MapReduce programs.

Chapter 8, for example, provides "A Relational View on Data with Hive." What Hive provides is "a data warehouse that uses MapReduce to analyze data stored on HDFS," Dr. Turkington notes. "In particular, it provides a query language called HiveQL that closely resembles the common Structured Query Language (SQL) standard."

Using Hive as an interface to Hadoop "not only accelerates the time required to produce results from data analysis, it significantly broadens who can use Hadoop and MapReduce. Instead of requiring software development skills, anyone with a familiarity with SQL can use Hive," the author states.

But, as Chapter 9 makes clear, Hive is not a relational database, and it doesn't fully implement SQL. So the text and code examples in Chapter 9 illustrate (1) how to set up MySQL to work with Hadoop and (2) how to use Sqoop to transfer bulk data between Hadoop and MySQL.

Chapter 10 shows how to set up and run Flume NG. This is a distributed service that collects, aggregates, and moves large amounts of log data from applications to Hadoop's HDFS.

The book's final chapter, "Where to Go Next," helps the newcomer see what else is available beyond the Hadoop core product. "There are," Dr. Turkington emphasizes, "a plethora of related projects and tools that build upon Hadoop and provide specific functionality or alternative approaches to existing ideas." He provides a quick tour of several of the projects and tools.

A key strength of this beginner's guide is in how its contents are structured and delivered. Four important headings appear repeatedly in most chapters. The "Time for action" heading singles out step-by-step instructions for performing a particular action. The "What just happened?" heading highlights explanations of "the working of tasks or instructions that you have just completed." The "Pop quiz" heading, meanwhile, is followed by short, multiple-choice questions that help you gauge your understanding. And the "Have a go hero" heading introduces paragraphs that "set practical challenges and give you ideas for experimenting with what you have learned."

Hadoop can be downloaded free from the Apache Software Foundation's Hadoop website.

Dr. Turkington's book does a good job of describing how to get Hadoop running on Ubuntu and other Linux distributions. But while he assures that "Hadoop does run well on other systems," he notes in his text: "Windows is supported only as a development platform, and Mac OS X is not formally supported at all." He refers users to Apache's Hadoop FAQ wiki for more information. Unfortunately, few details are offered there. So web searches become the best option for finding how-to instructions for Windows and Macs.

Running Hadoop on a Windows PC typically involves installing Cygwin and openSSH, so you can simulate using a Linux PC. But other choices can be found via sites such as Hadoop Wizard and Hadoop on Windows with Eclipse".

To install Hadoop on a Mac running OS X Mountain Lion, you will need to search for websites that offer how-to tips. Here is one example.

There are other ways get access to Hadoop on a single computer, using other operating systems or virtual machines. Again, web searches are necessary. The Cloudera Enterprise Free product is one virtual-machine option to consider.

Once you get past the hurdle of installing and running Hadoop, Garry Turkington's well-written, well-structured Hadoop Beginner's Guide can start you moving down the lengthy path to becoming an expert user.

You will have the opportunity, the book's tagline states, to "[l]earn how to crunch big data to extract meaning from the data avalanche."

Si Dunn is an author, screenwriter, and technology book reviewer.

You can purchase Hadoop Beginner's Guide from amazon.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.

cancel ×
This is a preview of your comment

No Comment Title Entered

Anonymous Coward 1 minute ago

No Comment Entered


Confused! (0)

war4peace (1628283) | about a year ago | (#43162025)

Wasn't Java being frowned upon for being insecure and so on? But at the same time, Hadoop (Java-based) is praised?
What's the right path in this "Don't use Java!"/"DO use Java!" debacle?

Re:Confused! (5, Insightful)

jandrese (485) | about a year ago | (#43162085)

Hadoop is not a browser plugin.

Re:Confused! (1)

war4peace (1628283) | about a year ago | (#43164449)

I never worked with Java, so honestly I didn't know. All I hear left and right is "Java sux because vulnerabilities". So I was wondering.
But judging from the smug answers below, I made the impardonable mistake of not knowing EVERYTHING. Oh well.

Re:Confused! (0)

Anonymous Coward | about a year ago | (#43167415)

News for "nerds." You should expect many things here you may not understand.

Re:Confused! (1)

war4peace (1628283) | about a year ago | (#43169731)

...which is a good reason for asking questions. It it NOT a good reason for others to be dicks about it.

Re:Confused! (0)

Anonymous Coward | 1 year,22 days | (#43287749)

You must be new here.

Re:Confused! (1)

mesterha (110796) | about a year ago | (#43168439)

Why is this insightful? Shouldn't this comment have a low moderation value. A high moderation leads people into wasting their time look at this post and into a parent post who needed some help. Why should anyone else care.

Of course this is not the fault of the parent or the poster. Is moderation really this crappy? I guess moderators want to moderate something in this article and there just aren't any good posts.

Re:Confused! (1)

admdrew (782761) | about a year ago | (#43162139)

Wasn't Java being frowned upon for being insecure and so on?

Could you elaborate, or provide sources?

Hadoop provides an answer to very specific questions involving large amounts of data, and isn't intended to be a database or other storage mechanism.

Re:Confused! (0)

Anonymous Coward | about a year ago | (#43162541)

Indeed, it's intended for lightweight ad-hoc analysis, so their decision to standardize on a heavyweight software engineering language continues to surprise me.

Re:Confused! (1)

E IS mC(Square) (721736) | about a year ago | (#43163539)

Not really. Hadoop as a platform can be used for very heavyduty ETL as well as analytics needs. What you are confused is some of the tools within the framework (like Hive, Impala etc) that _also_ helps in lightweight ad-hoc analysis.

Also, HBase specifically addresses a need of a proper database - though not of a relational type.

Re:Confused! (0)

Anonymous Coward | about a year ago | (#43162141)

Your thinking of the Java Browser Plug-in.

You can use python and c++ for haddop programs (2)

mjwalshe (1680392) | about a year ago | (#43162461)

There are alternatives to Java for writing your MR programs if you have better things to do that worry about class paths etc :-) http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/ [michael-noll.com]

And back in the 80's BT used PL/1 (plus a bit of Fortran 77) to do Map reduce as the core of a billing system for the dialcom systems -even manged ro sell it to other telcos as a replacement for the Dialcom product.

Re:Confused! (2)

roman_mir (125474) | about a year ago | (#43162709)

Well, I think [slashdot.org] that great amount of confusion results from the way Java is marketed. For a website that has a larger than normal number of technical people, /. as an aggregate still displays fundamental misunderstanding of what Java is and what it is not.

For example Java is not Javascript. A browser sandbox that runs Java plugin has nothing to do with server side Java applications, that for example can run in servlet containers and servers like Tomcat Apache, Jetty, iPlanet, Resin (by Caucho), Enhydra, and such.

Also Java is a platform, but often people complain about how many different paradigms it supports, how many various libraries and servers and different ideas and packages exist around it, as if anybody forces any developer or a company to work with all or any of that stuff. Again, complaints like: J2EE is too heavy, so Java sucks, well, you don't need J2EE for your dinky application and actually you don't need it if you develop a non-dinky application but still support various tech that a serious application requires (transactions, multi-threading and multi-user support, etc.)

For people who apparently pride themselves being 'above marketing', developers somehow fell victims to various marketing around Java and didn't bother to check for themselves what the hell all this is about.

Re:Confused! (1)

QRDeNameland (873957) | about a year ago | (#43163495)

For people who apparently pride themselves being 'above marketing', developers somehow fell victims to various marketing around Java and didn't bother to check for themselves what the hell all this is about.

As a developer who never had many reasons to do very much with (actual) Java, I have to say I lost of respect for the platform when Oracle started bundling the runtime with the Ask toolbar on Windows. Because nothing says "serious computing platform" like having to make sure it doesn't install a toolbar on every update. /rant

Re:Confused! (1)

roman_mir (125474) | about a year ago | (#43163547)

As a developer who never had many reasons to do very much with (actual) Java, I have to say I lost of respect for the platform when Oracle started bundling the runtime with the Ask toolbar on Windows.

- I don't know, as a thinking human I have to say that what Oracle does with Java browser plugin reflects poorly on Oracle and has nothing to do with Java. As I said, the browser plugin is irrelevant for majority of what Java is actually used for.

Personally I don't even have Java browser plugin on any of my machines because I don't have a use for it. At the same time I have developed plenty Java applications (as in Java language running on a JVM that resides on a server and has some form of application container on top of the JVM where the application is actually running).

So you are exactly what I am talking about - a victim of a marketing campaign, whichever form it takes.

Re:Confused! (1)

QRDeNameland (873957) | about a year ago | (#43173983)

As an FYI, it's not the browser plugin that has the opt-out toolbar installation, it's the JVM itself. Do I think it's truly a sign of Java's strength/weakness as a platform? No, but as you say, it's horrible marketing.

Re:Confused! (1)

roman_mir (125474) | about a year ago | (#43175753)

You mean the JRE that is part of the plugin installation? It cannot be the JRE itself, it must the the installer of the plugin. I don't install JVM this way or Java plugin (at all), I just download the necessary installation package for the JVM/SDK as needed. If I want Oracle SDK I get it from here [oracle.com] and when it installs, it does not install any 'ask bar' or anything like that, so I am not even clear as to how people get these things. I download the SDK or JVM for a Linux distro, which is either a binary that will decompress or just a tarball.

What you are saying is that when you are prompted for a browser plugin download, as part of the plugin you get the JRE (which is the Runtime Environment) and you also get some Ask bar or whatever.

But this scheme by Oracle then results in people getting angry with Java of all things and AFAIC this is similar to being angry at (for example) the C / C++ language and the x86 architecture because Microsoft ActiveX platform is sometimes used as a virus vector.

That's why I say there is so much confusion around Java. Java IS a language AND a platform, it's like C and x86 hardware architecture that can run a binary compiled from C. If somebody sells a computer full of Ask bars etc., the average user will not be inclined to believe that the C language and the CPU architecture are horrible (they may or may not be, but that would not be on the mind of a computer user).

But the marketing of Java is so poor (for such a useful platform, AFAIC), that it creates this type of ridiculous notion around it.

Re:Confused! (1)

QRDeNameland (873957) | about a year ago | (#43176389)

No, I mean the JRE itself. (Remember, I'm talking about Windows here.) Granted, this is not the SDK I'm talking about, just the standard JRE installed in Windows, not the browser, but I just updated it on this machine, and I had to uncheck [tenthcave.com] the option to install the Ask toolbar. (link is not my blog, btw.)

I'm fully aware of the distinctions between Java, the browser plugins, Javascript, etc., and I realize that the toolbar is not part of the install on 'real' OSs. But the Windows installer is what most users see of Java.

Re:Confused! (1)

roman_mir (125474) | about a year ago | (#43176745)

Well, I went looking at this and it's not the JRE that does this, it's the Oracle installer for Java. JRE is not the installer, it's the run time environment.

Here is one of these stories about this issue [java.net].

But yes, on Linux or Unix I don't get any of this nonsense, it's just a tarball that I untar into a directory, set the path and run the JVM for example. The binary installers for Linux or Unix are simple shell scripts with the same tarball basically as part of the script, it doesn't do any of this stuff.

As I said, this is a terrible marketing move, does huge amount of disservice to the entire concept of Java by bundling things with installers that clearly shouldn't be there. It's not enough that people are confused about what Java is or is not (because of such a huge number of things that people just call 'Java'), but adding this nonsense to installers is just evil. Personally I have various issues with Oracle, would have really preferred if IBM bought Sun's assets rather than Oracle.

Re:Confused! (1)

QRDeNameland (873957) | about a year ago | (#43177007)

All due respect...but even as a developer, to say it's not the JRE but the JRE's installer that bundles the toolbar, is pretty pedantic. I really just wanted to point out this comes with standard install of Java itself, not the browser plugin.

Re:Confused! (1)

TimHunter (174406) | about a year ago | (#43164261)

I can guarantee you that developers who use Java for serious development work never worry about the Ask toolbar.

Re:Confused! (1)

QRDeNameland (873957) | about a year ago | (#43174061)

Agreed, but the OP's point was about Java suffering from bad marketing, and IMHO, the opt-out toolbar is a prime example.

Re:Confused! (0)

Anonymous Coward | about a year ago | (#43174291)

Why? Do none of them have java installed on a windows machine? My development work is pretty serious (not realtime or anything, but at least 25% of you care if I get it right), and I have to uncheck that toolbar install every time java updates. So what the hell are you talking about?

Small numbers for Big Data? (2, Interesting)

istartedi (132515) | about a year ago | (#43162037)

How many people on the planet actually manage "Big Data"? Isn't that the kind of thing that happens as a happy accident when your humble web site becomes the next FaceBubbleSpace? You can't plan for that.

Sure, there are other places where it happens--large corporations, governments, maybe some academic studies.

Really though, I have a hard time imagining that there are really a lot of people who deal with BD. Does anybody have numbers on it? What's the definition, anyway? Is Slashdot's archive BD?

Re:Small numbers for Big Data? (1)

jandrese (485) | about a year ago | (#43162107)

Anybody who is talking about data mining is already thinking about Big Data problems. These are everywhere, from correlating shopping habits based on receipts and customer loyalty cards to looking for terrorists by their travel patterns.

Re:Small numbers for Big Data? (0)

Anonymous Coward | about a year ago | (#43162165)

The wiki article on big data does a good job answering your questions.

Re:Small numbers for Big Data? (1)

istartedi (132515) | about a year ago | (#43164127)

My takeaway from the article is that the definition of BD is a moving target as the capability of hardware grows.

If "traditional" approaches to data fail to scale, why not just start with BD methods in the first place? In that case, BD is a meaningless term as it simply becomes "a better way of handling data". OTOH, if there's a high technical hurdle between "traditional" and BD methods, then you have an incentive to stay traditional until you're confronted with the problem. Therein lies the crux of my concern, namely, "how much demand is there in the market for these skills?".

If BD methods are simply going to replace "traditional" and the learning curve isn't too steep then the answers is obvious: you should study BD methods if you want to have anything to do with data.

OTOH, if the learning curve is steep and only a handfull of organizations have these problems, then the answer is: "Only a few thousand people on the planet need these skills. Wait until you're in a situation where you need to ramp up on it".

If you don't need it now, may be the answer is to learn just enough so that you won't be lost in the future--kind of like learning a few phrases of a foreign language. OTOH, it could be like when I was graduating high school and these guys asked me if I wanted to take a PL/1 course with them. I reasoned (correctly) that I would never have a need for it.

Now at this point I can hear the "you prefer ignorance?" rebuttals. I've dealt with this before. We all have to triage technologies. Selective ignorance isn't a fault--it's a skill.

Re:Small numbers for Big Data? (1)

MillerHighLife21 (876240) | about a year ago | (#43162177)

There's an entire field dedicated to Data Warehousing who's entire focus is Big Data. Large companies with auditing requirements have to keep mountains of historical data. Business Intelligence is largely based on analyzing huge segments of data.

As storage gets cheaper and options for going through large amounts of data become more widely available, companies invariably store more data. The biggest difference is that while you previously would have simply chosen not to track certain types of data in your database...now you might.

Re:Small numbers for Big Data? (3, Insightful)

Sarten-X (1102295) | about a year ago | (#43162223)

Big Data is however big you need it to be. It's not a certain size, or speed, or software, but rather a philosophy.

Simply put, Big Data methodologies are to gather all the data that can be gathered, and store it on a nice cheap database, without concern for storage efficiency. When a question arises, analyse the relevant data for an answer. This is in contrast to more traditional methods, where data is gathered only to solve expected questions.

Slashdot's archives were not generated from a Big Data approach. They store only comments and a few sparse details (to my knowledge). However, they can still be used in a Big Data system to some effect, if they happen to store the information that's needed (such as IP address, timestamps, and keywords, if the question is to track political opinions by geography over time).

It's not what you store or how you store it, but how you decide what to store. What makes Big Data approaches useful is that they store everything from the beginning, so as business needs change, the data from the past is likely just as useful as new data. What makes Big Data difficult is that the databases must be properly capable of storing all the gathered data as fast as it arrives, and must do so cheaply.

Re:Small numbers for Big Data? (0)

Anonymous Coward | about a year ago | (#43162225)

At my work we use it but joke that we're 'sort of large data' and even that's a stretch. We put about 100GB a day into an 8-node cluster, pulled from SQL server, and use the cluster to do data transform operations on data sets that our SQL server can't handle gracefully while also acting as a production OLTP server (usually aggregating a months worth of data or something along those lines). Typically we'll put the data back into another SQL data warehouse once crunched. I've seen and worked on SQL systems that can handle the data sets we work with, but those were LARGE servers that cost a lot and required very deep DBA knowledge to keep tuned right. Hadoop lets us just throw the data into the cluster and decide later how we want to crunch it.

Re:Small numbers for Big Data? (0)

Anonymous Coward | about a year ago | (#43162257)

BD: banking transaction data, digital object metadata, real time data such as inflight data, satellite data, traffic data, billing systems

Do I need to go on?

Re:Small numbers for Big Data? (1)

admdrew (782761) | about a year ago | (#43162277)

There's actually more of a need for it than you'd think at first glance. Any business that handles upwards of a few million records of some sort of data, and then needs to *do* something with that data, could potentially benefit from this.

I've worked at small (10,000) businesses (all tech-related), and only the smallest places didn't have the amount of data to warrant taking a look at something like this.

Also, what about those interested in *someday* working for the Googles and the Facebooks of the world? All of the developers I've met have had some sort of professional or personal interest in map/reduce problems, and Hadoop provides a (relatively) easy/accessible and free way to get hands on with an actual distributed computing platform.

Re:Small numbers for Big Data? (1)

admdrew (782761) | about a year ago | (#43162313)

Dang, didn't look closely at the preview, some formatting killed off part of my message. The 2nd paragraph should be:

I've worked at small (*less than* 100 employees), medium (*less than* 1000), and large (*greater than* 10,000) businesses (all tech-related), and only the smallest places didn't have the amount of data to warrant taking a look at something like this.

Re:Small numbers for Big Data? (1)

E IS mC(Square) (721736) | about a year ago | (#43163937)

Not everything big data is limited to the latest twitter or facebook.

Just to give a one example of many possible, think of any processing that involves millions of records daily, and you need to process them, aggregate, analyze, dice and slice across various attributes on daily/monthly/quarterly/yearly frequency. This could be a financial firm, a retail chain, or anything where a lot of transactions happen daily. And you might be surprised that it's not just large corporations or governments.

Also, when there are permutations and combinations, input size increases dramatically.

But the biggest benefit it offers is to be able to do all this at very low cost - and that's the key factor of it's rise and popularity. Sure, big corporations were able to do a lot of such heavy task because they could buy hardwares worth millions of $$. But big data (hadoop) suddenly makes small players equipped with the same tools as big players and that is game-changing.

I am not defending big data just because it's cool, but because for a small company like ours, we can now think of doing things that were beyond our reach because of heavy cost of adding more storage and more processing power. It's very easy to understand the importance If you are into data warehouse / data analytics industry. On the other hand, if you are just creating few web apps, you may find it hard to understand the big deal.

Re:Small numbers for Big Data? (1)

istartedi (132515) | about a year ago | (#43164333)

I'm not sure why you got modded down, because your answer is somewhat thought provoking and not trollish at all.

You reminded me of a situation I saw involving monitoring networks. The solution to the data overload there was a "roll your own" database and AFAIK it could not be querried with a full set of SQL commands; but was faster and able to handle our data better than off-the-shelf Open Source solutions. This was years before I heard the term "big data". We may have been moving in that direction without knowing it, and now that I think about it I can see how what is now called BD might be more common than I thought...

Re:Small numbers for Big Data? (0)

Anonymous Coward | about a year ago | (#43164573)

How many people on the planet actually manage "Big Data"? Isn't that the kind of thing that happens as a happy accident when your humble web site becomes the next FaceBubbleSpace? You can't plan for that.

Sure, there are other places where it happens--large corporations, governments, maybe some academic studies.

Really though, I have a hard time imagining that there are really a lot of people who deal with BD. Does anybody have numbers on it? What's the definition, anyway? Is Slashdot's archive BD?

Typically, when you have large amounts of shit data, and someone wants to get some bullshit results, you call it "big data", write some terrible code, and hope no one checks your work too closely.

You'll do things like "entity resolution", which means having someone who has no idea what they're looking at glom a bunch of crap data together based on some vaguely plausible rules. Or there's "unstructured text" which means you however many million files, you might sit down with an analyst and work out some rules and grammars, and then people will never see those rules again and just take whatever you spit out as gospel.

Hadoop is an attempt to take the existing state of the art, huge awk scripts are common, that would fail relentlessly, and at least make this garbage run to completion. But what really makes it Big Data is if you can cook up some bullshit results because no one can possibly go back and check your work to prove that you're full of shit.

Re:Small numbers for Big Data? (1)

fatphil (181876) | about a year ago | (#43165581)

> Is Slashdot's archive BD?

43 million comments. Let's say a 1KB ballpark average size. Auxiliary data probably negligible compared to that, so let's double it and round up.


That is a _puny_ database by "Big Data" standards. Every table apart from the comments themselves could be cached in RAM on a modern server, and the majority of comments would never need to be fetched off disk - a single SSD at that - so almost everything important could be cached.

Of course, you'd never want an architecture like that, as you need large number of concurrent clients, but you asked about the data, and the data isn't big.

Yay Packt! (1)

turkeyfeathers (843622) | about a year ago | (#43162169)

Remember: when you need a book reviewed on Slashdot, make sure you publish it on Packt.

Re:Yay Packt! (0)

Anonymous Coward | about a year ago | (#43168911)

Where is your book review? Right. Don't be a dick.

LMAO... HFDS? (1)

Anonymous Coward | about a year ago | (#43162249)

It's HDFS ... apparently the book didn't teach the reviewer much. "Hadoop Distributed File System."

Re:LMAO... HFDS? (0)

Anonymous Coward | about a year ago | (#43165685)

It's HDFS ... apparently the book didn't teach the reviewer much. "Hadoop Distributed File System."

Hey, at least they were consistent? They even made sure to come up with a name to match their acronym: Hadoop File Distribution System.

Re:LMAO... HFDS? (1)

sagecreek (2860541) | about a year ago | (#43166247)

Yup, that was my bad. It IS Hadoop Distributed File System (HDFS) and NOT Hadoop File Distribution System (HFDS). I had it right in front of me and still typed it wrong from some of my notes. I'll see if I can get it fixed. Thanks.

Re:LMAO... HFDS? (0)

Anonymous Coward | about a year ago | (#43167145)

Don't eat that stuff (HFDS), it makes you obese.

Re:LMAO... HFDS? (1)

sagecreek (2860541) | about a year ago | (#43166549)

It helps to have friends in high places. The HDFS correction has been made. So all who LMAO'ed because of my typo are now free to LYABO. Thanks for pointing out the mistake.

If it's a tool... (1)

Anonymous Coward | about a year ago | (#43162357)

If it's just a bleeping tool for data processing, what should the user need to be a Java developer to use it?

Re:If it's a tool... (1)

Anonymous Coward | about a year ago | (#43162563)

Excellent point...but that is something to ask the Hadoop developers, not the author. You can use Hadoop without writing Java. You cannot use Hadoop without knowing in depth about the complete mess that is Java build, linking, etc.

Re:If it's a tool... (1)

stewsters (1406737) | about a year ago | (#43163481)

I think the real problem with the java build is people like to use older technology. It can be as easy as:

1) $> gradle run

2) $> ... there is no step 2..

3) $> ... take that configure;make;./a.out

Link to Chapter 1 of this book (0)

Anonymous Coward | about a year ago | (#43163407)

This book is in Safari Books Online, my place of employment.

Here's a link which contains a lot of chapter 1 so you can get a sense of the writing style:


This link also includes a brief discussion of Big Data with some examples of what they mean by the term.

Stopped reading here: "Java-based" (0)

Anonymous Coward | about a year ago | (#43163733)

"Hadoop is an open-source, Java-based [...]"
Stopped reading here.

Java, .net, mono they are all pieces of shit.

Hadoop is not (0)

Anonymous Coward | about a year ago | (#43165291)

After reading the wiki page and finding I have the necessary skills to begin, why do I still feel intimidated?

"Do not taunt Hadoop. Hadoop is not your grandmother's DB. Hadoop ain't nothing to F*&K with. Hadoop is not for babies."

Review or outline? (0)

Anonymous Coward | about a year ago | (#43169487)

This review reads like an expanded outline of the book, complete with tag lines and so on. Why would I buy this book instead of the O'Reilly book I already have?

I like the line about "various components, projects, sub-projects, and their interrelationships" - pretty much sums up open source!

Check for New Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Sign up for Slashdot Newsletters
Create a Slashdot Account