×

Welcome to the Slashdot Beta site -- learn more here. Use the link in the footer or click here to return to the Classic version of Slashdot.

Thank you!

Before you choose to head back to the Classic look of the site, we'd appreciate it if you share your thoughts on the Beta; your feedback is what drives our ongoing development.

Beta is different and we value you taking the time to try it out. Please take a look at the changes we've made in Beta and  learn more about it. Thanks for reading, and for making the site better!

Pentaho 3.2 Data Integration

samzenpus posted more than 3 years ago | from the read-all-about-it dept.

Book Reviews 103

diddy81 writes "A book about the open source ETL tool Kettle (Pentaho Data Integration) is finally available. Pentaho 3.2 Data Integration: Beginner's Guide by María Carina Roldán is for everybody who is new to Kettle. In a nutshell, this book will give you all the information that you need to get started with Kettle quickly and efficiently, even if you have never used it before.The books offers loads of illustrations and easy-to-follow examples. The code can be downloaded from the publisher website and Kettle is available for free from the SourceForge website. In sum, the book is the best way to get to know the power of the open source ETL tool Kettle, which is part of the Pentaho BI suite. Read on for the rest of diddy81's review.The first chapter describes the purpose of PDI, its components, the UI, how to install it and you go through a very simple transformation. Moreover, the last part tells you step by step how to install MySQL on Windows and Ubuntu.

It's just what you want to know when you touch PDI for the first time. The instructions are easy to follow and understand and should help you to get started in no time. I honestly quite like the structure of the book: Whenever you are learning something new, it is followed by a section that just recaps everything. So it will help you to remember everything much easier.

Maria focuses on using PDI with files instead of the repository, but she offers a description on how to work with the repository in the appendix of the book.

Chapter 2: You will learn how to reading data from a text file and how to handle header and footer lines. Next up is a description of the "Select values ..." step which allows you to apply special formatting to the input fields, select the fields that you want to keep or remove. You will create a transformation that reads multiple text fields at once by using regular expressions in the text input step. This is followed by a troubleshooting section that describes all kind of problems that might happen in the setup and how to solve them. The last step of the sample transformation is the text file output step.

Then you improve this transformation by adding the "Get system info" step, which will allow you to pass parameters to this transformation on execution. This is followed by a detailed description of the data types (I wish I had all this formatting info when I started so easily at hand). And then it even gets more exciting: Maria talks you through the setup of a batch process (scheduling a Kettle transformation).

The last part of this chapter describes how to read XML files with the XML file input step. There is a short description of XPath which should help you to get going with this particular step easily.

Chapter 3 walks you through the basic data manipulation steps. You set up a transformation that makes use of the calculator step (loads of fancy calculation examples here). For more complicated formulas Maria also introduces the formula step. Next in line are the Sort By and Group By step to create some summaries. In the next transformation you import a text file and use the Split field to rows step. You then apply the filter step on the output to get a subset of the data. Maria demonstrates various example on how to use the filter step effectively. At the end of the chapter you learn how to lookup data by using the "Stream Lookup" step. Maria describes very well how this step works (even visualizing the concept). So it should be really easy for everybody to understand the concept.

Chapter 4 is all about controlling the flow of data: You learn how to split the data stream by distributing or copying the data to two or more steps (this is based on a good example: You start with a task list that contains records for various people. You then distribute the tasks to different output fields for each of these people). Maria explains properly how "distribute" and "copy" work. The concept is very easy to understand following her examples. In another example Maria demonstrates how you can use the filter step to send the data to different steps based on a condition. In some cases, the filter step will not be enough, hence Maria also introduces the "Switch/Case" step that you can use to create more complex conditions for your data flow. Finally Maria tells you all about merging streams and which approach/step best to use in which scenario.

In Chapter 5 it gets really interesting: Maria walks you through the JavaScript step. In the first example you use the JavaScript step for complex calculations. Maria provides an overview of the available functions (String, Numeric, Date, Logic and Special functions) that you can use to quickly create your scripts by dragging and dropping them onto the canvas. In the following example you use the JavaScript step to modify existing data and add new fields. You also learn how to test your code from within this step. Next up (and very interesting) Maria tells you how to create special start and end scripts (which are only executed one time as opposed to the normal script which is executed for every input row). We then learn how to use the transformation constants (SKIP_TRANSFORMATION, CONTINUE_TRANSFORMATION, etc) to control what happens to the rows (very impressive!). In the last example of the chapter you use the JavaScript step to transform a unstructured text file. This chapter offered quite some in-depth information and I have to say that there were actually some things that I didn't know.

In the real world you will not always get the dataset structure in the way that you need it for processing. Hence, chapter 6 tells you how you can normalize and denormalize data sets. I have to say that Maria took really huge effort in visualizing how these processes work. Hence, this really helps to understand the theory behind these processes. Maria also provides two good examples that you work through. In the last example of this chapter you create a date dimension (very useful, as everyone of us will have to create on at some point).

Validating data and handling errors is the focus of chapter 7. This is quite an important topic, as when you automate transformation, you will have to find a way on how to deal with errors (so that they don't crash the transformation). Writing errors to the log, aborting a transformation, fixing captured errors and validating data are some of the steps you go through.

Chapter 8 is focusing on importing data from databases. Readers with no SQL experience will find a section covering the basics of SQL. You will work with both the Hypersonic database and MySQL. Moreover Maria introduces you to the Pentaho sample database called "Steel Wheels", which you use for the first example. You learn how to set up a connection to the database and how to explore it. You will use the "Table Input" to read from the database as well as the "Table Output" step to export the data to a database. Maria also describes how to parameterize SQL queries, which you will definitely need to do at some point in real world scenarios. In next tutorials you use the Insert/Update step as well as the Delete step to work with tables on the database.

In chapter 9 you learn about more advance database topics: Maria gives an introduction on data modelling, so you will soon know what fact tables, dimensions and star schemas are. You use various steps to lookup data from the database (i.e. Database lookup step, Combination lookup/update, etc). You learn how to load slowly changing dimensions Type 1, 2 and 3. All these topics are excellently illustrated, so it's really easy to follow, even for a person which never heard about these topics before.

Chapter 10 is all about creating jobs. You start off by creating a simple job and later learn more about on how to use parameters and arguments in a job, running jobs from the terminal window and how to run job entries under conditions.

In chapter 11 you learn how to improve your processes by using variables, subtransformations (very interesting topic!), transferring data between transformations, nesting jobs and creating a loop process. These are all more complex topics which Maria managed to illustrate excellently.

Chapter 12 is the last practical chapter: You develop and load a datamart. I would consider this a very essential chapter if you want to learn something about data warehousing. The last chapter 13 gives you some ideas on how to take it even further (Plugins, Carte, PDI as process action, etc) with Kettle/PDI.

In the appendix you also find a section that tells you all about working with repositories, pan and kitchen, a quick reference guide to steps and job entries and the new features in Kettle 4.

This book certainly fills a gap: It is the first book on the market that focuses solely on PDI. From my point of view, Maria's book is excellent for anyone who wants to start working with Kettle and even those ones that are on an intermediate level. This book takes a very practical approach: The book is full of interesting tutorials/examples (you can download the data/code from the Pakt website), which is probably the best way to learn about something new. Maria also made a huge effort on illustrating the more complex topics, which helps the reader to understand the step/process easily.

All in all, I can only recommend this book. It is the easiest way to start with PDI/Kettle and you will be able to create complex transformations/jobs in no time!

You can purchase Pentaho 3.2 Data Integration: Beginner's Guide from amazon.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.

cancel ×
This is a preview of your comment

No Comment Title Entered

Anonymous Coward 1 minute ago

No Comment Entered

103 comments

Enough acronyms? (4, Insightful)

pyite (140350) | more than 3 years ago | (#32592464)

My goodness, would it kill you to state what an acronym stands for the first time you use it?

Re:Enough acronyms? (0, Troll)

Chas (5144) | more than 3 years ago | (#32592514)

You must be new here.

Do you need a de-acronymization of SQL?

Do you need a de-acronymization of XML?

Since it was established fairly on that PDI is Pentaho Data Integration (damn, first time I saw it I coulda sworn it was "pendejo"), I'm not really sure what exactly your bitch is.

Re:Enough acronyms? (1)

xouumalperxe (815707) | more than 3 years ago | (#32592552)

ETL (Extract, Transform and Load) possibly.

Re:Enough acronyms? (3, Insightful)

blackest_k (761565) | more than 3 years ago | (#32592702)

I'd settle for whats it for? and why i'd want to spend time learning how to use it?

Apparently its for beginners but beginners who already have a foundation to build on.

Re:Enough acronyms? (2, Insightful)

Anonymous Coward | more than 3 years ago | (#32593426)

I'd settle for whats it for? and why i'd want to spend time learning how to use it?

Apparently its for beginners but beginners who already have a foundation to build on.

I like my acronyms explained early and often - beginners need to know and experts need to be reminded, lest they drift off course. Still, it was in the title...

ETL is one of those things that you do a lot of in an enterprise where there are many different systems from many different vendors (not to mention development groups who hate each other). When stuff has to get transferred from one database to another, that's ETL. When stuff has to get transferred and the data isn't in quite the right format, that, too is ETL. Pulling stuff from spreadsheets, FTP sites, web services, mixing and matching, validating and converting. Tools like Pentaho DI are how you can dump the process onto less-technical staff, since it does a lot of work that would otherwise require custom programming. And even some that does, since not only does Kettle support JavaScript transformations and user-developed Java plugins, it's an open-source project.

Incidentally, Maria Carina Roldan was a guest author at the JavaRanch Big Moose Saloon (http://www.javaranch.com) several weeks ago. Those of us who hang out there had the opportunity to converse with her.

Re:Enough acronyms? (2, Informative)

Timothy Brownawell (627747) | more than 3 years ago | (#32593442)

I'd settle for whats it for? and why i'd want to spend time learning how to use it?

ETL in general tends to mean moving stuff between databases in large companies. Such as when you have lots of things that each run off of their own database, and then have a big "data warehouse" database that everything goes into (usually in a different format than the individual databases, designed for running reports out of instead of day-to-day use). Or when you replace one system with another, or buy another company and want to shut down whatever they were using in favor of your corporate standard system, etc,... and need to move stuff that's in one database in one format, into another database in a different format.

Re:Enough acronyms? (1)

Hognoxious (631665) | more than 3 years ago | (#32603088)

So it's what used to be called data conversion in the old days (which is last week, apparently)?

Re:Enough acronyms? (2, Informative)

Timothy Brownawell (627747) | more than 3 years ago | (#32603574)

So it's what used to be called data conversion in the old days (which is last week, apparently)?

Maybe. I'm actually in the Data Conversion department, where we use ETL tools to load one-off data dumps for new customers. I think it's that ETL describes the tools, and then data conversion is a subset of what you do with them (other subsets being things like for example EDI / Electronic Data Interchange).

Re:Enough book reviews? (2, Informative)

b4dc0d3r (1268512) | more than 3 years ago | (#32593590)

I'm going to expand on this one a bit. When it said data integration, I immediately found out that ETL might be Extract, transform, load. The only reason I know this is because I work for a TLA type company. Kettle seems to be the name of something that already has a name, "Pentaho Data Integration". I'm not sure why it has two names. It is also part of the Pentaho BI suite.

A good review would give us a link to this tool, so we can figure out if the book is even relevant. Otherwise the assumption is that everyone knows what it is an everyone is using it. http://kettle.pentaho.org/ [pentaho.org] There's a FAQ which deals with usage, not what it's about, and no overview. So despite finding the website myself I still have no idea what this thing does. Does it solve the problem of exporting data from MS SQL Server and re-loading it somewhere else? Cos that's what I need.

A good review would also indicate if it's a free and/or open source tool, so we can decide if we're even interested in the tool, let alone the book. The source is available and hosted on sourceforge, so that answers that. But there is a separate link under Products for PDI, with links to Buy. Is this a poor attempt at a slashvertisement? Why would I use kettle instead of PDI? Is there a difference? http://www.pentaho.com/products/data_integration/ [pentaho.com]

A good review would also identify the audience of the book, letting people know who might use it. It's a datbase tool - if I'm a Microsoft shop would I have any interest in reading about this?

Re:Enough book reviews? (2, Informative)

mattcasters (67972) | more than 3 years ago | (#32593888)

The simple answer is that Kettle is a generic name that is very hard to copyright. Pentaho Data Integration and Kettle are synonyms although Kettle used a bit more often to identify the open source project.

As for the pentaho.com website... you would think that the webcasts, papers, etc would be hard to miss but hey I guess if you don't need a data integration tool you probably don't know what it's for.

After I did a Kettle lightning talk at FOSDEM a few years ago I met a student who was working on a thesis. He had been gathering data in a database, originating from some electron microscope (or something like that) for the past 6 months. He said if he had known about Kettle he could have done it in a few weeks at most. The problem is that reaching certain non-technical audiences is a very tough call. Heck, it's even hard to convince those people that claim it's faster to code it all in Java/C/C++/Perl/Ruby or even bf. (see other threads below)

Re:Enough book reviews? (0)

Anonymous Coward | more than 3 years ago | (#32598522)

...and yet, you still fail to define exactly wth kettle is.

I'm not interested in webcasts and whitepapers yet. I'm still looking for a two paragraph summary that tells me what this thing makes easier and HOW it makes it easier. Said summery should be followed by a feature highlight list.

Does it use a special syntax to define a formatting schema to translate data from one thing to another?

The information on the page itself reads like marketing speak.

"but hey I guess if you don't need a data integration tool you probably don't know what it's for."

I wouldn't know if this tool is potentially useful even if I did need a data integration tool.

Re:Enough book reviews? (1)

jkauzlar (596349) | more than 3 years ago | (#32608604)

As for the pentaho.com website... you would think that the webcasts, papers, etc would be hard to miss but hey I guess if you don't need a data integration tool you probably don't know what it's for.

"My Humble Blog"? You sound like an arrogant prick. Not to mention your third paragraph explains why your second paragraph is wrong: if the student knew what Kettle was in the first place, he could've saved a lot of time.

And finally, it is standard practice to tailor any piece of communication to the audience to which it is being communicated. It's likely we ./ readers know what SQL is, but ETL or whatever is less widely known and asking for its definition is not out of line. If you have an interest in the success of Pentaho/Kettle, and it appears you do, then tell people what it can do for them and even help them to find ways to use it to make their lives easier. Even in your response to the comment below mine, you say 'if you have no need for data integration, you won't be looking for it'. The student you spoke of had a need for data integration, but didn't know what to look for.

Re:Enough acronyms? (0)

Anonymous Coward | more than 3 years ago | (#32592770)

Any summary on a BEGINNERS book on SQL or XML WOULD STILL DEFINE THE TERM.

That's the whole POINT of a BEGINNERS BOOK.

Did you buy your four digit id, or are you just cranky today?

Not to mention that PDI is far less ubiquitous than eitehr XML or SQL, and even googling it does not help since the definition as used here is on page fifteen of the results.

Re:Enough acronyms? (-1, Flamebait)

Chas (5144) | more than 3 years ago | (#32592802)

Any summary on a BEGINNERS book on SQL or XML WOULD STILL DEFINE THE TERM.

The term WAS defined. Hello? Reading comprehension?

Did you buy your four digit id, or are you just cranky today?

I have low tolerance for idiots and none for trollish anonymous coward bitch-boys like yourself.

Not to mention that PDI is far less ubiquitous than eitehr XML or SQL, and even googling it does not help since the definition as used here is on page fifteen of the results.

The article reviewer defined PDI early on. Not my fault if people can't read plain english.

Re:Enough acronyms? (3, Insightful)

TopherC (412335) | more than 3 years ago | (#32593184)

Normal, face-to-face conversation:

"I might be interested in this book, but don't yet know what ETL, Kettle, Pentaho, or BI refer to. Could you help me out please?"

"Sure! An ETL is a ..."

Slashdot:

would it kill you to state what an acronym stands for the first time you use it?

You must be new here.

Do you need a de-acronymization of SQL?

Do you need a de-acronymization of XML?

Since it was established fairly on that PDI is Pentaho Data Integration ...

(Note that PDI is the only TLA defined in the summary but it isn't actually used there.)

...
Did you buy your four digit id, or are you just cranky today? ...

I have low tolerance for idiots and none for trollish anonymous coward bitch-boys like yourself.

Re:Enough acronyms? (4, Insightful)

Wowlapalooza (1339989) | more than 3 years ago | (#32592864)

You must be new here.

Do you need a de-acronymization of SQL?

I'd wager most IT professionals know that one.

Do you need a de-acronymization of XML?

I'd wager most IT professionals know that one too.

Since it was established fairly on that PDI is Pentaho Data Integration (damn, first time I saw it I coulda sworn it was "pendejo"), I'm not really sure what exactly your bitch is.

Uh, because it's not obvious to non-data-warehousing weenies that semantically there's any intersection/equivalence between the set ("data", "integration") and some other (undefined) set consisting of terms beginning with the letters "e", "t" and "l"? Maybe?

Sure, any or all of this stuff can be Google'd/Wikipedia'ed/etc., but does one want to go through that for an article summary? Especially when it would have been soooo easy to just expand the acronym...

Expanding acronyms is standard writing practice (2, Insightful)

name_already_taken (540581) | more than 3 years ago | (#32592924)

Sure, any or all of this stuff can be Google'd/Wikipedia'ed/etc., but does one want to go through that for an article summary? Especially when it would have been soooo easy to just expand the acronym...

Especially when it's standard journalism (and general writing) practice to expand acronyms the first time they're used, particularly when they are obscure.

To expect every reader to either know the definition of the acronym, or to search Google for it is the height of arrogance. It's also a good way to turn off readers.

Re:Enough acronyms? (0)

Anonymous Coward | more than 3 years ago | (#32599718)

I'd wager most IT professionals know that one.

My goodness, would it kill you to state what an acronym stands for the first time you use it?

Re:Enough acronyms? (1)

DarrenBaker (322210) | more than 3 years ago | (#32593218)

Do you need a de-acronymization of SQL?

Do you need a de-acronymization of XML?

I might.... It all depends on what those things are.

Re:Enough acronyms? (3, Funny)

drooling-dog (189103) | more than 3 years ago | (#32593288)

Was looking for an expansion of "ETL", actually, but went away just glad that I don't work with you...

Time is valuable. Expand acronyms on first use (0)

Anonymous Coward | more than 3 years ago | (#32593402)

Yes, actually, I would like my memory refreshed on the exact expansion of SQL and XML, especially in a review of a tutorial product such as book that is supposed to help people solidify their knowledge about a topic. I have had a university class on databases. I have written a PostgreSQL-based order tracking system. I have maintained builds of PostgreSQL and other data base engines, but, going from memory I don't recall off the meaning of the "S" in SQL and would have to guess that it stands for Simple Query Language. I have occasionally but rarely used XML tools, and I also am a bit vauge on the meaning of X in XML (Extended? Markup Language). My time is valuable, and I don't want to have to look up the acronym for casual reading. So, yes, I would love to see a policy that all acronyms should be expanded the first time they're used in an article, which might also help encourage people to use single words when they are more appropriate, such as times when statements about a "Central Processing Unit" actually to all processors in a computer, not just the central ones (e.g, numerous micro-controllers) or to remind us of how much as has changed since some acronyms were coined, such as PCMCIA cards ("Personal Computer *MEMORY* Card International Association).

Re:Enough acronyms? (0, Flamebait)

Anonymous Coward | more than 3 years ago | (#32592534)

Get the fuck out. This site doesn't need to be dumbed down even further than it already is for numb nuts like yourself.

Re:Enough acronyms? (2, Funny)

Dave Emami (237460) | more than 3 years ago | (#32592546)

"Seeing as how the VP is such a VIP, shouldn't we keep the PC on the QT? 'Cause if it leaks to the VC he could end up MIA, and then we'd all be put out in KP."

But seriously, to answer one of the tags: ETL = Extract Transform and Load. Basically it's how transactional or other data gets into a data warehouse.

Re:Enough acronyms? (0)

Anonymous Coward | more than 3 years ago | (#32592570)

ETL - I believe that's "English as a Third Language"

At least PDI is defined right off the top: Pentaho Data Integration

There you go.

Re:Enough acronyms? (1)

Hognoxious (631665) | more than 3 years ago | (#32603152)

At least PDI is defined right off the top: Pentaho Data Integration

I find definitions to be more useful when they consist entirely of real words.

Excuse me while I (4, Funny)

ClosedSource (238333) | more than 3 years ago | (#32592622)

add PDI and ETL to my Resume. I wonder what they mean?

Re:Excuse me while I (0)

Anonymous Coward | more than 3 years ago | (#32601386)

Extract Transform Load - Tools used primarily is data warehousing / data integration. Non-Open Source tools include Ab Initio and Informatica. Closely related to tools like DataStage TX (formerly Mercator). These are used for large enterprise integrations.

Ex. Parent company plus 5 subsidiaries all running different order management systems. Management wants a single data store to run analytics against & create dashboards. You would use an ETL tool / process to...
get orders & related items out of each system [Extract]
cleanse, fit check, morph into another data model [Transform]
dump it all into an enterprise data warehouse [Load]

Re:Enough acronyms? (1, Funny)

Anonymous Coward | more than 3 years ago | (#32592944)

You should join our new group: Citizens Rejecting Acronym Prolifieration. ;)

Re:Enough acronyms? (0)

Anonymous Coward | more than 3 years ago | (#32595266)

I want in

Re:Enough acronyms? (0)

Anonymous Coward | more than 3 years ago | (#32593070)

Pedant mode: ON

PDI and ETL are initialisms but not acronyms. To be an acronym, an initialism has to be pronounced as one word rather than spelled out. NATO is an acronym because you say "nay-tow", but USA is an initialism because you say "yoo-ess-ay".

Re:Enough acronyms? (1)

Hognoxious (631665) | more than 3 years ago | (#32603206)

PDI and ETL are initialisms but not acronyms. To be an acronym, an initialism has to be pronounced as one word rather than spelled out.

Puhddy. Ettle.

Though I think pronouncing SQL the same as an inferior movie[1] that's a blatant attempt to cash in on a previous one's success is going too far.

[1] Except Aliens. It's better. And maybe Terminator II.

Re:Enough acronyms? (2, Insightful)

Pollardito (781263) | more than 3 years ago | (#32593598)

The worst part is that even if you google Kettle and get to their website, the front page for their product [pentaho.org] is a essentially a changelog and roadmap. There are FAQ links but even the "Beginners FAQ" (which should be "WTF is Kettle?" style Q&A) is a product troubleshooting guide.

I suspect that the same secrecy-obsessed person that built the product website also wrote this review

Re:Enough acronyms? (0)

Anonymous Coward | more than 3 years ago | (#32596456)

What The F*ck is WTF?

Re:Enough acronyms? (3, Insightful)

SplashMyBandit (1543257) | more than 3 years ago | (#32593926)

I hope the reviewer is suitably chastened by this experience. Understanding your likely reader is an very important skill in (technical) writing. Realizing that people come from all sorts of backgrounds should not be a surprise. Each of those people may be very intelligent, they just have a specialty that is not in the same field as the writer. Therefore it is the mark of a competent writer that they'll at least try to expand an acronym the first time they use it. An even better writer might even find a single sentence that explains the concept well. Poor writers (eg. many soft-science academics and marketers) often obfuscate simple concepts behind jargon and convoluted sentence construction. Their pronouncements can often be written in a much more straightforward way, although that would often reveal that the "Emperor has no clothes". The best writers write simply, use the least complicated word that fits the purpose, and consider possible conceptual pitfalls of readers so try to write unambiguously.

Re:Enough acronyms? (0)

Anonymous Coward | more than 3 years ago | (#32598774)

Who the hell is Moop?

Our last Pentaho experience.... (2, Insightful)

ducomputergeek (595742) | more than 3 years ago | (#32592602)

Was it made things three times more complicated than it needed to be. We needed to integrate one of our products with another and the other product's developer recommended Talend and Pentaho for the job. After two days of looking through the documentation it was complete overkill for what we needed. So we said screw it and directly mapped to their database using JDBC and Plan Ole XML as our transport layer. That only took a day to build.

Re:Our last Pentaho experience.... (0)

Anonymous Coward | more than 3 years ago | (#32592648)

Plan Ole XML

Oh, the same solution the British used on the Native Americans ...

Re:Our last Pentaho experience.... (3, Interesting)

Per Wigren (5315) | more than 3 years ago | (#32593194)

I totally agree.

At work I have built a large data warehouse pretty much from scratch with PostgreSQL and SQL-files, controlled by a set of Ruby scripts. It's simple, powerful, extremely flexible and plenty fast. It imports data from various sources (PostgreSQL, MySQL, MS SQL Server, CSV files on a remote SSH server, XML, custom logfiles, etc) with some HEAVY data cleaning and normalization. On top of that we have lots of autogenerated PDF-reports and a custom built report tool for all kinds of data.

Recently it was decided that we need a way for managers to generate "cubes" for quick generation of custom, one-off reports on all kinds of dimension of the data. After looking around a bit we settled with Mondrian, which is a part of the Pentaho suite.

O. M. F. G. What a mess.

It consists of a deep directory hierarchy with config files and duplicated jar files sprinkled all over. To do simple things like adding a database you have to edit a whole bunch of XML config files in various directories and I even had to copy a jar file from one directory to another. There is plenty of documentation but it's disorganized, overly verbose in the simple areas and overly terse (or nonexistent) in the moderately advanced areas.

After editing a config file you have to go through its web interface and press one "clean cache" button and one "reload config" button. Then you have to restart the app server and log in again to see your changes. They don't provide any commandline tools to do this. When starting out and building your new cubes there will be a lot of trial-and-error experimentation as the XML schema is somewhat archaic and underdocumented. When asking them on IRC for a way to automate this it took me a lot of explaining WHY I wanted to do this so often before they even attempted to answer. The answer involved copying and modifying .properties files in WEB-INF directories and writing a script that run curl on various URLs...

Seemingly they themselves already set up their datawarehouse and cubes long time ago and have totally forgotten about the experience for NEW users that have to do all this from scratch...

Anyone know about a decent alternative to Mondrian?

Re:Our last Pentaho experience.... (1)

mattcasters (67972) | more than 3 years ago | (#32594086)

Per, writing a ROLAP server is a non-trivial task. Mondrian is the only open source option for you at the moment. There is a MOLAP server called PALO however.

If you don't mind me saying so but on the one hand you seem to complain that a visual programming tool like Kettle is too hard to use. And at the same time you choose to ignore the tools to configure Mondrian properly. I'm sure there is some kind of pattern here.

Most people can start using Kettle in a matter of a few minutes to a few hours. I would like to argue that this is hardly the case for your duct tape solution. Your home brew SQL/Ruby/scripting mess is a great opportunity for a data integration consultant to clean up the mess once your gone or when things are no longer maintainable or adaptive. Since I've been in exactly this situation many times myself and also on behalf of other consultants: keep up the good work :-)

Matt

Re:Our last Pentaho experience.... (2, Insightful)

Per Wigren (5315) | more than 3 years ago | (#32594474)

I will now argue that my home brew "mess" is very simple and clean and it will take any person with decent shell, Ruby and SQL knowledge a VERY short time to get a FULL understanding of. Even my bosses know and appreciate that.

So, we already have the import, data cleaning, normalization and lots of aggregated tables in place and it's working fine. We don't want to change that. What we need is only a web interface that is easy for the non-technical managers and marketers to use. I can provide special tables and views for Mondrian in any way that it wants. No problem, except that it's helluva messy to set up. Unless you restart from scratch using the whole Pentaho stack, maybe.

Re:Our last Pentaho experience.... (0)

Anonymous Coward | more than 3 years ago | (#32595426)

I will now argue that my home brew "mess" is very simple and clean

Everyone argues that their own spaghetti code is better than someone else's spaghetti code.

Re:Our last Pentaho experience.... (1)

Hognoxious (631665) | more than 3 years ago | (#32610980)

I've never met anyone who writes spaghetti code who admits that it's spaghetti code.

Re:Our last Pentaho experience.... (1)

Karem Lore (649920) | more than 3 years ago | (#32600434)

The Pentaho stack includes two mondrian viewers, jpivot and analyzer. Jpivot is open and free, and you can set this up yourself. Analyzer is an EE feature which allows business users to create analysis views quickly and easily with a drag-drop interface.

Mondrian produces XMLA, so any XMLA client will be able to access Mondrian's output.

Re:Our last Pentaho experience.... (1)

dintech (998802) | more than 3 years ago | (#32599708)

If you don't mind me saying so but on the one hand you seem to complain that a visual programming tool like Kettle is too hard to use. And at the same time you choose to ignore the tools to configure Mondrian properly. I'm sure there is some kind of pattern here.

Disclaimer: My experience was 2 years ago.

Yes there is a pattern. The commonality is that the documentation for almost all of the pentaho suite is verbose but devoid of content. It's impossible to find documentation on the things you actually need documentation for. I tried to set up a ROLAP server that we could use to replace Cognos with, however I gave up because it wasn't worth the effort required to work everything out. There aren't enough hours in the day.

Re:Our last Pentaho experience.... (1)

Karem Lore (649920) | more than 3 years ago | (#32600444)

With access to Pentaho's knowledgebase and support you would have had first class support and access to a wealth of documentation.

Some of this documentation is available on Pentaho's open wiki: http://wiki.pentaho.org/ [pentaho.org]

Re:Our last Pentaho experience.... (1)

mattcasters (67972) | more than 3 years ago | (#32600496)

2 years is indeed a long time for a startup company. In that period we released a host of new versions for the 5 product pilars, improved usability dramatically and 2 Pentaho related books came out to help you on your path (with a third on the way).

What was once only possible is now fairly straightforward too.

Re:Our last Pentaho experience.... (2, Informative)

Karem Lore (649920) | more than 3 years ago | (#32600424)

you have to edit a whole bunch of XML config files in various directories

If you use the open-source version, sure. If you use the Pentaho BI Suite then no.

You have a central configuration console, schema workbench (available free) for schema design. You can clear the cache programatically by way of a URL or using the API (which can be fine grained down to tuple) or through the user or enterprise-console.

Before spouting such drivel, you should look at what exactly you are using and where you have gone wrong in your assumptions. Then, if you are still confused, contact support should you have a subscription.

Re:Our last Pentaho experience.... (0)

Anonymous Coward | more than 3 years ago | (#32602510)

OK the suite is a mess. I worked with this stuff for about 3 years integrating the Kettle piece into a web app designed to upload data from spreadsheets and build cubes from it. And I got to say Kettle by itself as an ETL tool (which means Extract Transform Load btw for the acronym challenged) is pretty good. It compares favorably with offerings from Cognos (DecisionStream) and is way better than SSIS (SQL Server Integration Services). The problem with Pentaho is everything else. Pentaho is a bundle of what could generously be described as stuff purporting to be a complete BI solution, which is something that loads data, builds cubes from it and presents the cubes over a web based reporting interface. I took a look at the latest version of the product a couple of months ago, and it may actually do that on a really good day.

However, Kettle by itself is a very useful ETL tool. It is easy to work with, entirely XML based, open-source, runs on both Windows and Unix servers and is as fast as products from the major players. The best thing to do with Pentaho is to download it, pull out the Kettle piece (which is a self-contained product) and forget about the rest of it. You will find that using an ETL tool, any ETL tool, is a much better way to load data than straight-up scripting for reasons far too numerous to mention.

Here's a thought: lay out a reporting data mart based on your main data warehouse and try using Kettle by itself to load it. And if you want to set up cube reporting, take a look at Mondrian. It's fairly easy to integrate the whole thing using JBoss.

Re:Our last Pentaho experience.... (1)

Hognoxious (631665) | more than 3 years ago | (#32603272)

Recently it was decided that we need a way for managers to generate "cubes" for quick generation of custom, one-off reports on all kinds of dimension of the data.

Just put everything (and I mean everything) into one cube, and let them slice and dice it as they choose.

I've heard that suggestion more than once, so I think you got off lightly.

Re:Our last Pentaho experience.... (1)

atomic777 (860023) | more than 3 years ago | (#32596306)

As with any complex tool, if you don't know why it's useful, or when it should be used, you're probably going to make a mess.

The visual nature of kettle masks its complexity due to the "pictures == easy, code == 3ll3t" bias. To simplify a bit, Kettle gives you the ability to create a multi-db, multi-data format "query plan", much as a DB optimiser would do when given a multi-table SQL statement with joins, filters, etc. The problem is that in kettle, you have to understand how to optimise that "query" yourself to write an efficient transform. Developers that truly understand how a database executes a query, let alone understand what query plan is good, become a rarer breed with each day.

In short, never give kettle to a developer that thinks of a database purely in terms of "put" and "get"

Re:Our last Pentaho experience.... (1)

swinginSwingler (161566) | more than 3 years ago | (#32599218)

I've actually used KETTLE (5 years ago, but still..) Buggy as hell. Would have taken me 1/10th the time to write the thing on my own. Unfortunately my manager at the time insisted that I use it.

Re:Our last Pentaho experience.... (1)

mattcasters (67972) | more than 3 years ago | (#32600538)

5 years ago, poor man! 5 years ago things were pretty wild. I open sourced Kettle in December 2005 so back then we weren't even with Pentaho yet.

Now we have over 40 developers and a dozen translators, a QA team, doc writers, continuous integration servers, a JIRA system, a wiki, product managers, a sales team, etc.

Thousands upon thousands of bugs have been fixed in the mean time and thousands of features have been implemented. Since then we released 27 stable versions!

Re:Our last Pentaho experience.... (0)

Anonymous Coward | more than 3 years ago | (#32603710)

Kettle as an ETL tool is excellent and I highly recommend. I've worked with it and Informatica. Kettle wins. Great job by Matt and the team.

PDI? (1)

AbbyNormal (216235) | more than 3 years ago | (#32592612)

So is PDI something like a database agnostic version of MSSQL DTS packages?

Re:PDI? (3, Informative)

xouumalperxe (815707) | more than 3 years ago | (#32592668)

It's a bit more than "database agnostic" as it can input from a load of non-db sources and output into a load of non-db sinks. I work at a pentaho shop, and one of our biggest projects involves, on the ETL front, parsing several gigs of apache logs per day and stuffing the (filtered) results into a db. We do that using Kettle.

Just Look it up in Wikipedia (5, Funny)

CajunArson (465943) | more than 3 years ago | (#32592614)

Seriously, I can't imagine how dumb some people are... complaining about acronyms that can easily be looked up on Wikipedia!

I mean, a quick search obviously reveals that ETL stands for Express Toll Lanes [wikipedia.org]. Any slashdotter should know that these lanes are used by the many cars generated by the numerous analogies dotting slashdot "discussions".
    And as for Pentaho... let's just break this word down into parts shall we? Penta is the root word for the number 5... duh! Of course, Ho is an accurate description of the only type of woman who will talk to the average slashdotter... assuming the slashdotter has a sufficient Benjamin supply.

    So let's put all of this together shall we? This book is obviously about how you can pick up 5 hoes on a highway quickly and efficiently. This is a life skill that I'm sure many slashdotters are keenly interested in acquiring. How the hell anyone could possibly complain that the reviewer didn't expressly spell out these stupidly obvious terms is frankly beyond me.

Re:Just Look it up in Wikipedia (0)

Anonymous Coward | more than 3 years ago | (#32593078)

This book is obviously about how you can pick up 5 hoes on a highway quickly and efficiently.

I think you should get extra points for pluralizing ho to hoes correctly.

Re:Just Look it up in Wikipedia (0)

Anonymous Coward | more than 3 years ago | (#32594676)

This book is obviously about how you can pick up 5 hoes on a highway quickly and efficiently.

I think you should get extra points for pluralizing ho to hoes correctly.

Is that correct? I could swear he missed an "r"

You got it wrong (2, Funny)

mangu (126918) | more than 3 years ago | (#32594878)

Considering that the author, María Carina Roldán, is Argentinian, it's obvious that "pentaho" is a misspelling for "pendejo". This book is about a latino asshole who drives an old truck very slowly in the express lane, ignoring all the honking cars behind him. The truck is slow because the radiator is boiling, its nickname is the "Kettle".

Re:You got it wrong (0)

Anonymous Coward | more than 3 years ago | (#32599734)

Now all we need is Manuel from Fawlty Towers and we're ready for the best reboot of Duel you could ever see.

Let me explain... (0)

Anonymous Coward | more than 3 years ago | (#32592722)

Pent (house) with ho's.

Buzzwordy self-importance (0, Offtopic)

Gothmolly (148874) | more than 3 years ago | (#32592744)

In my experience, ETL guys are the most obnoxious, self-important douches ever to walk the corridors of the building. Everything is "datamart this" and "database that", when all I can see is a handful of SQL hackers with a big budget and a loud boss.

Re:Buzzwordy self-importance (0)

Anonymous Coward | more than 3 years ago | (#32592908)

Sure they are .. but they can be a source of entertainment too! - just introduce them to a bunch of j2ee/enterprise archictects and witness hilarity ensue!

Enlightening! (4, Funny)

aquabat (724032) | more than 3 years ago | (#32592790)

Awesome review! Truly enlightening. Before I saw this article, I had absolutely no idea what Pentaho was, or why I would want it. Now, I know exactly what I'm getting both my friends for Christmas this year. I can't wait to discuss all 492 pages of this treasure with them in the new year.

Re:Enlightening! (1)

obender (546976) | more than 3 years ago | (#32592954)

I know exactly what I'm getting both my friends for Christmas this year

You could buy me one. I am more real than your imaginary friends: Ostap Bender [wikipedia.org]

Re:Enlightening! (1)

aquabat (724032) | more than 3 years ago | (#32593042)

I know exactly what I'm getting both my friends for Christmas this year

You could buy me one. I am more real than your imaginary friends: Ostap Bender [wikipedia.org]

Perhaps you'd also like the key to the apartment where the money is?

Rubbish (0, Flamebait)

Becausegodhasmademe (861067) | more than 3 years ago | (#32592854)

When it takes a good 10 minutes of trawling TFA and Wikipedia just to find out what ETI and PDI stand for and what a datamart is, you know that the product is hyped up just enough to be worthless.

Re:Rubbish (0)

Anonymous Coward | more than 3 years ago | (#32593026)

uhm .. the concepts are old. Maybe if you did real work out in the field rather then doing the same old shit over and over you might have learned a thing or two.

Re:Rubbish (0)

Anonymous Coward | more than 3 years ago | (#32593648)

Maybe the term 'ETL' isn't as popular as you think.

Ahh, I get it now... (1)

hAckz0r (989977) | more than 3 years ago | (#32593032)

Pentaho/Kettle is for "Market Intelligence", and that's why it took me 5 minutes to re-read the article numerous times and an additional 10 minutes of Googleing just to find out what this SlashDot story was about. Obviously I'm not smart enough to know anything "Intelligent", such as say, stating what the product/book is actually for? Analysing 'the market', or analysing 'whom I am marketing'?

Now I am just left with the thought, is this "Intelligence" effort trying to market me? If so they are doing a pretty lousy job of it, seeing that after reading the article and Googleing I am still at a loss to explain what I just read. I regularly read about advanced mathematics in Relativity and Quantum Physics for fun, but I am obviously too stupid to understand marketing.

Pentaho (1)

scorp1us (235526) | more than 3 years ago | (#32593346)

It is not nice to call Maria a ho, much less one of the penta variety. That's not just calling her a ho, but calling her a ho for 5 distinct reasons.

Re:Pentaho (1)

waambulance (1766146) | more than 3 years ago | (#32594126)

i pentaho once... her name was Steely Dan II. "Chewed to bits by a famished candiru in the Upper Baboonsasshole. And don't say 'wheeeeeeee!' this time."

Re:Pentaho (1)

mattcasters (67972) | more than 3 years ago | (#32594170)

Actually scrop1us, posters seem to think they are making some original joke.

However... Pentaho had indeed 5 founders (penta) and (I say this with all the respect in the world for my esteemed colleagues) they have every intention of selling themselves out.

So the given definition of 5 hoes is very close to the true meaning of the word Pentaho or so I have been told one drunken evening at Pentaho's bar, the Orlando Ale House.

Maria is indeed not part of this group of 5 esteemed gentlemen.

Matt

Re:Pentaho (0)

Anonymous Coward | more than 3 years ago | (#32596766)

Maria is good peeps. She provides a lot of help to folks on the support forums...

What does this actually do? (1)

vlm (69642) | more than 3 years ago | (#32593654)

I'm a bit mystified about chapter 8, which sounds a whole heck of a lot like "apt-get install mysql-server" for those whom can't apt-get.

From what little info I have, this software seems to summarize to a super complicated way to push data in and out of databases. The kind of thing normal people would whip up write-once-read-never-again perl scripts full of obscene regexes and mysterious one liners, but if you'd rather do it differently, here's this giant complicated system written in Java and XML with the verbosity of COBOL that'll do more or less the same thing, but more slowly and complicatedly, for people whom don't know what SQL is or even how to install mysql.

Somebody please P.R. me and explain what this thing is, or why I'd want it, or what in the world I'd do with it.

${NONSENSEWORD} ${VERSION} Data Integration (2, Informative)

Bob the Hamster (705714) | more than 3 years ago | (#32593688)

"A book about the open source ${ACRONYM} tool ${DICTIONARYWORD} (${NONSESNEWORD} Data Integration) is finally available. ${NONSESNEWORD} ${VERSION} Data Integration: Beginner's Guide by ${AUTHORNAME} is for everybody who is new to ${DICTIONARYWORD}. In a nutshell, this book will give you all the information that you need to get started with ${DICTIONARYWORD} quickly and efficiently, even if you have never used it before.The books offers loads of illustrations and easy-to-follow examples. The code can be downloaded from the publisher website and ${DICTIONARYWORD} is available for free from the SourceForge website. In sum, the book is the best way to get to know the power of the open source ${ACRONYM} tool ${DICTIONARYWORD}, which is part of the ${NONSESNEWORD} ${DIFFERENTACRONYM} suite.

Re:${NONSENSEWORD} ${VERSION} Data Integration (1)

aquabat (724032) | more than 3 years ago | (#32593894)

Last Monday's xkcd comic (xkcd.com/753/) conveys a similar concept, in the mouseover text.

About time (1)

OldCrasher (254629) | more than 3 years ago | (#32594006)

I am glad to see someone has got a book out about this package. If you need something like Pentaho, then writing simple translation scripts is probably not where you want to be. Kettle has a steep learning curve, but has proven to be reasonably reliable, and very flexible.

Extract Translate Load (1)

travisb828 (1002754) | more than 3 years ago | (#32594826)

ETL stands for Extract Translate Load. Basically you want to extract data out of your very normalized application database. Translate it into something that makes a little more sense for historical reporting and trending. Then load it into your data warehouse.

Re:Extract Translate Load (0)

Anonymous Coward | more than 3 years ago | (#32596238)

Correction: Although "Translate" is almost synonym with "Transform", Extract Transform Load is a more accurate widely accepted translation.

Kettle = Best part of Pentaho (1)

WarwickRyan (780794) | more than 3 years ago | (#32595248)

Seen lots of negative Pentaho experiences here, and I'd generally agree. It's one of those "Open Source" projects which forces you into buying their commercial version because they've made it way too complex.

Luckly the Pentaho project is an umbrella which contains a number of seperate products, most of which were developed independantly. Which results in there being a big difference in the quality of each component.

From my experiences, Kettle is a really nice tool for ETL. It is, IMHO, easier to use than Microsoft's Integration Services (its closest competitor). It's straight forward, performs well and importantly can be used without using the rest of Pentaho.

Re:Kettle = Best part of Pentaho (1)

mattcasters (67972) | more than 3 years ago | (#32600568)

Thanks for the thumbs up. Just a not thought: everything that is possible with the commercial (Enterprise Edition) version of Pentaho software is possible with the community edition. Please don't confuse use with certain other "Open Source" BI suites.

To stay on topic I would advice you to simply buy one of the Pentaho books before you get started!

Re:Kettle = Best part of Pentaho (1)

WarwickRyan (780794) | more than 3 years ago | (#32602850)

I understand that the capabilities are the same, I was getting at the fact that to get anywhere you need to hire Pentaho experts in, and then the only real choice was directly from Pentaho USA. Compare to Microsoft, where you could either learn it through books/, or hire a local consultant.

For me (as BI consultant) supporting the commercial operation behind the OSS project needs to be via licensing/support via customers, and via training. Same as it is with the Microsoft stack. The easier it is for people so as myself to get started the easier it is for us to start selling the product :)

Back when I was using the full Pentaho suite (a year ago), it was very hard to get all but the simplest demo working. Mind you my interest was mainly in dashboarding, and that's an area where Pentaho was (and maybe still is?) very weak in. General configuration of the app server was also annoying, but that's also more of an java/app server problem (XML Hell) than anything inherent in the project.

Tthe biggest single problem was outdated documentation, or should I say outdated discussions being returned from Google. That's quite a common problem mind, and something which is being lessened by sites so as stackoverflow.

What is nice is that the project moves along at a very fast pace. New features are constantly being added, and everything is constantly being improved. Will take a look at the latest version the next time I work on a BI project.

BTW thanks for your work, goes for yourself and all the other contributers. Even given my negative comments (they're meant constructively), you're easily the best open source BI platform and within the top-5 when including the commercial platforms.

Re:Kettle = Best part of Pentaho (1)

mattcasters (67972) | more than 3 years ago | (#32607460)

It's unfortunate but experience tells us that unless you sweeten the deal with extras like documentation, configuration/monitoring/EE software, repositories and the like, very few companies would buy anything. That experience is contrary to what I once believed.

So you can complain that you can't get your hands on nice documentation, the dashboard designer or the console, all part of the enterprise edition. However, when you really compare it to closed source software it's still a lot cheaper. This analyst report shows the difference: http://www.pentaho.com/lower_bi_costs/ [pentaho.com] Heck, you can get all that for free for 30 days to test-drive things.

The lack of consultants *is* a problem. However, there's Pentaho related work to be found out there and with 2 Pentaho books out and a third coming out in September I'm sure the problem is short-lived.

After the jokes fade (1)

nurb432 (527695) | more than 3 years ago | (#32595296)

Its nice to have real tangible documentation for this beast. it looked like it has a lot of promise and is powerful out of the box, without having to spend tons of $ on a commercial product but the documentation was dismal. 9 at least all that i have found.

ETL is not cheap, and if you have a small project, pretty much unattainable.

Pendejo! (0)

Anonymous Coward | more than 3 years ago | (#32602686)

What did they just call me?

Load More Comments
Slashdot Account

Need an Account?

Forgot your password?

Don't worry, we never post anything without your permission.

Submission Text Formatting Tips

We support a small subset of HTML, namely these tags:

  • b
  • i
  • p
  • br
  • a
  • ol
  • ul
  • li
  • dl
  • dt
  • dd
  • em
  • strong
  • tt
  • blockquote
  • div
  • quote
  • ecode

"ecode" can be used for code snippets, for example:

<ecode>    while(1) { do_something(); } </ecode>
Sign up for Slashdot Newsletters
Create a Slashdot Account

Loading...