Pentaho 3.2 Data Integration: Beginner's Guide - Roldan Maria Carina - E-Book

Pentaho 3.2 Data Integration: Beginner's Guide E-Book

Roldan Maria Carina

0,0
39,59 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Pentaho Data Integration (a.k.a. Kettle) is a full-featured open source ETL (Extract, Transform, and Load) solution. Although PDI is a feature-rich tool, effectively capturing, manipulating, cleansing, transferring, and loading data can get complicated.This book is full of practical examples that will help you to take advantage of Pentaho Data Integration's graphical, drag-and-drop design environment. You will quickly get started with Pentaho Data Integration by following the step-by-step guidance in this book. The useful tips in this book will encourage you to exploit powerful features of Pentaho Data Integration and perform ETL operations with ease.Starting with the installation of the PDI software, this book will teach you all the key PDI concepts. Each chapter introduces new features, allowing you to gradually get involved with the tool. First, you will learn to work with plain files, and to do all kinds of data manipulation. Then, the book gives you a primer on databases and teaches you how to work with databases inside PDI. Not only that, you'll be given an introduction to data warehouse concepts and you will learn to load data in a data warehouse. After that, you will learn to implement simple and complex processes.Once you've learned all the basics, you will build a simple datamart that will serve to reinforce all the concepts learned through the book.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 569

Veröffentlichungsjahr: 2010

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Pentaho 3.2 Data Integration Beginner's Guide
Credits
Foreword
The Kettle Project
About the Author
About the Reviewers
Preface
How to read this book
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Errata
Piracy
Questions
1. Getting Started with Pentaho Data Integration
Pentaho Data Integration and Pentaho BI Suite
Exploring the Pentaho Demo
Pentaho Data Integration
Using PDI in real world scenarios
Loading datawarehouses or datamarts
Integrating data
Data cleansing
Migrating information
Exporting data
Integrating PDI using Pentaho BI
Pop quiz – PDI data sources
Installing PDI
Time for action – installing PDI
What just happened?
Pop quiz – PDI prerequisites
Launching the PDI graphical designer: Spoon
Time for action – starting and customizing Spoon
What just happened?
Spoon
Setting preferences in the Options window
Storing transformations and jobs in a repository
Creating your first transformation
Time for action – creating a hello world transformation
What just happened?
Directing the Kettle engine with transformations
Exploring the Spoon interface
Viewing the transformation structure
Running and previewing the transformation
Time for action – running and previewing the hello_world transformation
What just happened?
Previewing the results in the Execution Results window
Pop quiz – PDI basics
Installing MySQL
Time for action – installing MySQL on Windows
What just happened?
Time for action – installing MySQL on Ubuntu
What just happened?
Summary
2. Getting Started with Transformations
Reading data from files
Time for action – reading results of football matches from files
What just happened?
Input files
Input steps
Reading several files at once
Time for action – reading all your files at a time using a single Text file input step
What just happened?
Time for action – reading all your files at a time using a single Text file input step and regular expressions
What just happened?
Regular expressions
Troubleshooting reading files
Grids
Have a go hero – explore your own files
Sending data to files
Time for action – sending the results of matches to a plain file
What just happened?
Output files
Output steps
Some data definitions
Rowset
Streams
The Select values step
Have a go hero – extending your transformations by writing output files
Getting system information
Time for action – updating a file with news about examinations
What just happened?
Getting information by using Get System Info step
Data types
Date fields
Numeric fields
Running transformations from a terminal window
Time for action – running the examination transformation from a terminal window
What just happened?
Have a go hero – using different date formats
Go for a hero – formatting 99.55
Pop quiz–formatting data
XML files
Time for action – getting data from an XML file with information about countries
What just happened?
What is XML
PDI transformation files
Getting data from XML files
XPath
Configuring the Get data from XML step
Kettle variables
How and when you can use variables
Have a go hero – exploring XML files
Have a go hero – enhancing the output countries file
Have a go hero – documenting your work
Summary
3. Basic Data Manipulation
Basic calculations
Time for action – reviewing examinations by using the Calculator step
What just happened?
Adding or modifying fields by using different PDI steps
The Calculator step
The Formula step
Time for action – reviewing examinations by using the Formula step
What just happened?
Have a go hero – listing students and their examinations results
Pop quiz – concatenating strings
Calculations on groups of rows
Time for action – calculating World Cup statistics by grouping data
What just happened?
Group by step
Have a go hero – calculating statistics for the examinations
Have a go hero – listing the languages spoken by country
Filtering
Time for action – counting frequent words by filtering
What just happened?
Filtering rows using the Filter rows step
Have a go hero – playing with filters
Have a go hero – counting words and discarding those that are commonly used
Looking up data
Time for action – finding out which language people speak
What just happened?
The Stream lookup step
Have a go hero – counting words more precisely
Summary
4. Controlling the Flow of Data
Splitting streams
Time for action – browsing new PDI features by copyinga dataset
What just happened?
Copying rows
Have a go hero – recalculating statistics
Distributing rows
Time for action – assigning tasks by distributing
What just happened?
Pop quiz – data movement (copying and distributing)
Splitting the stream based on conditions
Time for action – assigning tasks by filtering priorities with the Filter rows step
What just happened?
PDI steps for splitting the stream based on conditions
Time for action – assigning tasks by filtering priorities with the Switch/ Case step
What just happened?
Have a go hero – listing languages and countries
Pop quiz – splitting a stream
Merging streams
Time for action – gathering progress and merging all together
What just happened?
PDI options for merging streams
Time for action – giving priority to Bouchard by using Append Stream
What just happened?
Have a go hero – sorting and merging all tasks
Have a go hero – trying to find missing countries
Summary
5. Transforming Your Data with JavaScript Code and the JavaScript Step
Doing simple tasks with the JavaScript step
Time for action – calculating scores with JavaScript
What just happened?
Using the JavaScript language in PDI
Inserting JavaScript code using the Modified Java Script Value step
Adding fields
Modifying fields
Turning on the compatibility switch
Have a go hero – adding and modifying fields to the contest data
Testing your code
Time for action – testing the calculation of averages
What just happened?
Testing the script using the Test script button
Have a go hero – testing the new calculation of the average
Enriching the code
Time for action – calculating flexible scores by using variables
What just happened?
Using named parameters
Using the special Start, Main, and End scripts
Using transformation predefined constants
Pop quiz – finding the 7 errors
Have a go hero – keeping the top 10 performances
Have a go hero – calculating scores with Java code
Reading and parsing unstructured files
Time for action – changing a list of house descriptions with JavaScript
What just happened?
Looking at previous rows
Have a go hero – enhancing the houses file
Have a go hero – fill gaps in the contest file
Avoiding coding by using purpose-built steps
Have a go hero – creating alternative solutions
Summary
6. Transforming the Row Set
Converting rows to columns
Time for action – enhancing a films file by converting rows to columns
What just happened?
Converting row data to column data by using the Row denormalizer step
Have a go hero – houses revisited
Aggregating data with a Row denormalizer step
Time for action – calculating total scores by performances by country
What just happened?
Using Row denormalizer for aggregating data
Have a go hero – calculating scores by skill by continent
Normalizing data
Time for action – enhancing the matches file by normalizing the dataset
What just happened?
Modifying the dataset with a Row Normalizer step
Summarizing the PDI steps that operate on sets of rows
Have a go hero – verifying the benefits of normalization
Have a go hero – normalizing the Films file
Have a go hero – calculating scores by judge
Generating a custom time dimension dataset by using Kettle variables
Time for action – creating the time dimension dataset
What just happened?
Getting variables
Time for action – getting variables for setting the default starting date
What just happened?
Using the Get Variables step
Have a go hero – enhancing the time dimension
Pop quiz – using Kettle variables inside transformations
Summary
7. Validating Data and Handling Errors
Capturing errors
Time for action – capturing errors while calculating the ageof a film
What just happened?
Using PDI error handling functionality
Aborting a transformation
Time for action – aborting when there are too many errors
What just happened?
Aborting a transformation using the Abort step
Fixing captured errors
Time for action – treating errors that may appear
What just happened?
Treating rows coming to the error stream
Pop quiz – PDI error handling
Have a go hero – capturing errors while seeing who wins
Avoiding unexpected errors by validating data
Time for action – validating genres with a Regex Evaluation step
What just happened?
Validating data
Time for action – checking films file with the Data Validator
What just happened?
Defining simple validation rules using the Data Validator
Have a go hero – validating the football matches file
Cleansing data
Have a go hero – cleansing films data
Summary
8. Working with Databases
Introducing the Steel Wheels sample database
Connecting to the Steel Wheels database
Time for action – creating a connection with the Steel Wheels database
What just happened?
Connecting with Relational Database Management Systems
Pop quiz – defining database connections
Have a go hero – connecting to your own databases
Exploring the Steel Wheels database
Time for action – exploring the sample database
What just happened?
A brief word about SQL
Exploring any configured database with the PDI Database explorer
Have a go hero – exploring the sample data in depth
Have a go hero – exploring your own databases
Querying a database
Time for action – getting data about shipped orders
What just happened?
Getting data from the database with the Table input step
Using the SELECT statement for generating a new dataset
Making flexible queries by using parameters
Time for action – getting orders in a range of dates by using parameters
What just happened?
Adding parameters to your queries
Making flexible queries by using Kettle variables
Time for action – getting orders in a range of dates by using variables
What just happened?
Using Kettle variables in your queries
Pop quiz – database datatypes versus PDI datatypes
Have a go hero – querying the sample data
Sending data to a database
Time for action – loading a table with a list of manufacturers
What just happened?
Inserting new data into a database table with the Table output step
Inserting or updating data by using other PDI steps
Time for action – inserting new products or updating existent ones
What just happened?
Time for action – testing the update of existing products
What just happened?
Inserting or updating data with the Insert/Update step
Have a go hero – populating a films database
Have a go hero – creating the time dimension
Have a go hero – populating the products table
Pop quiz – Insert/Update step versus Table Output/Update steps
Pop quiz – filtering the first 10 rows
Eliminating data from a database
Time for action – deleting data about discontinued items
What just happened?
Deleting records of a database table with the Delete step
Have a go hero – deleting old orders
Summary
9. Performing Advanced Operations with Databases
Preparing the environment
Time for action – populating the Jigsaw database
What just happened?
Exploring the Jigsaw database model
Looking up data in a database
Doing simple lookups
Time for action – using a Database lookup step to create a list of products to buy
What just happened?
Looking up values in a database with the Database lookup step
Have a go hero – preparing the delivery of the products
Have a go hero – refining the transformation
Doing complex lookups
Time for action – using a Database join step to create a list of suggested products to buy
What just happened?
Joining data from the database to the stream data by using a Database join step
Have a go hero – rebuilding the list of customers
Introducing dimensional modeling
Loading dimensions with data
Time for action – loading a region dimension with a Combination lookup/update step
What just happened?
Time for action – testing the transformation that loads the region dimension
What just happened?
Describing data with dimensions
Loading Type I SCD with a Combination lookup/update step
Have a go hero – adding regions to the Region Dimension
Have a go hero – loading the manufacturers dimension
Have a go hero – loading a mini-dimension
Keeping a history of changes
Time for action – keeping a history of product changes with the Dimension lookup/update step
What just happened?
Time for action – testing the transformation that keeps a historyof product changes
What just happened?
Keeping an entire history of data with a Type II slowly changing dimension
Loading Type II SCDs with the Dimension lookup/update step
Have a go hero – keeping a history just for the theme of a product
Have a go hero – loading a Type II SCD dimension
Pop quiz – loading slowly changing dimensions
Pop quiz – loading type III slowly changing dimensions
Summary
10. Creating Basic Task Flows
Introducing PDI jobs
Time for action – creating a simple hello world job
What just happened?
Executing processes with PDI jobs
Using Spoon to design and run jobs
Using the transformation job entry
Pop quiz – defining PDI jobs
Have a go hero – loading the dimension tables
Receiving arguments and parameters in a job
Time for action – customizing the hello world file with arguments and parameters
What just happened?
Using named parameters in jobs
Have a go hero – backing up your work
Running jobs from a terminal window
Time for action – executing the hello world job from a terminal window
What just happened?
Have a go hero – experiencing Kitchen
Using named parameters and command-line arguments in transformations
Time for action – calling the hello world transformation with fixed arguments and parameters
What just happened?
Have a go hero – saying hello again and again
Have a go hero – loading the time dimension from a job
Deciding between the use of a command-line argument and a named parameter
Have a go hero – analysing the use of arguments and named parameters
Running job entries under conditions
Time for action – sending a sales report and warning the administrator if something is wrong
What just happened?
Changing the flow of execution on the basis of conditions
Have a go hero – refining the sales report
Creating and using a file results list
Have a go hero – sharing your work
Summary
11. Creating Advanced Transformations and Jobs
Enhancing your processes with the use of variables
Time for action – updating a file with news about examinations by setting a variable with the name of the file
What just happened?
Setting variables inside a transformation
Have a go hero – enhancing the examination tutorial even more
Have a go hero – enhancing the jigsaw database update process
Have a go hero – executing the proper jigsaw database update process
Enhancing the design of your processes
Time for action – generating files with top scores
What just happened?
Pop quiz – using the Add Sequence step
Reusing part of your transformations
Time for action – calculating the top scores with a subtransformation
What just happened?
Creating and using subtransformations
Have a go hero – refining the subtransformation
Have a go hero – counting words more precisely (second version)
Creating a job as a process flow
Time for action – splitting the generation of top scores by copying and getting rows
What just happened?
Transferring data between transformations by using the copy /get rows mechanism
Have a go hero – modifying the flow
Nesting jobs
Time for action – generating the files with top scores by nesting jobs
What just happened?
Running a job inside another job with a job entry
Understanding the scope of variables
Pop quiz – deciding the scope of variables
Iterating jobs and transformations
Time for action – generating custom files by executing a transformation for every input row
What just happened?
Executing for each row
Have a go hero – processing several files at once
Have a go hero – building lists of products to buy
Have a go hero – e-mail students to let them know how they did
Summary
12. Developing and Implementing a Simple Datamart
Exploring the sales datamart
Deciding the level of granularity
Loading the dimensions
Time for action – loading dimensions for the sales datamart
What just happened?
Extending the sales datamart model
Have a go hero – loading the dimensions for the puzzles star model
Loading a fact table with aggregated data
Time for action – loading the sales fact table by looking up dimensions
What just happened?
Getting the information from the source with SQL queries
Translating the business keys into surrogate keys
Obtaining the surrogate key for a Type I SCD
Obtaining the surrogate key for a Type II SCD
Obtaining the surrogate key for the Junk dimension
Obtaining the surrogate key for the Time dimension
Pop quiz – modifying a star model and loading the star with PDI
Have a go hero – loading a puzzles fact table
Getting facts and dimensions together
Time for action – loading the fact table using a range of dates obtained from the command line
What just happened?
Time for action – loading the sales star
What just happened?
Have a go hero – enhancing the loading process of the sales fact table
Have a go hero – loading the puzzles sales star
Have a go hero – loading the facts once a month
Getting rid of administrative tasks
Time for action – automating the loading of the sales datamart
What just happened?
Have a go hero – Creating a back up of your work automatically
Have a go hero – enhancing the automate process by sending an e-mail if an error occurs
Summary
13. Taking it Further
PDI best practices
Getting the most out of PDI
Extending Kettle with plugins
Have a go hero – listing the top 10 students by using the Head plugin step
Overcoming real world risks with some remote execution
Scaling out to overcome bigger risks
Pop quiz – remote execution and clustering
Integrating PDI and the Pentaho BI suite
PDI as a process action
PDI as a datasource
More about the Pentaho suite
PDI Enterprise Edition and Kettle Developer Support
Summary
A. Working with Repositories
Creating a repository
Time for action – creating a PDI repository
What just happened?
Creating repositories to store your transformationand jobs
Working with the repository storage system
Time for action – logging into a repository
What just happened?
Logging into a repository by using credentials
Defining repository user accounts
Creating transformations and jobs in repository folders
Creating database connections, partitions, servers, and clusters
Backing up and restoring a repository
Examining and modifying the contents of a repository with the Repository explorer
Migrating from a file-based system to a repository-based system and vice-versa
Summary
B. Pan and Kitchen: Launching Transformations and Jobs from the Command Line
Running transformations and jobs stored in files
Running transformations and jobs from a repository
Specifying command line options
Checking the exit code
Providing options when running Pan and Kitchen
Log details
Named parameters
Arguments
Variables
C. Quick Reference: Steps and Job Entries
Transformation steps
Job entries
D. Spoon Shortcuts
General shortcuts
Designing transformations and jobs
Grids
Repositories
E. Introducing PDI 4 Features
Agile BI
Visual improvements for designing transformations and jobs
Experiencing the mouse-over assistance
Time for action – creating a hop with the mouse-over assistance
What just happened?
Using the mouse-over assistance toolbar
Experiencing the sniff-testing feature
Experiencing the job drill-down feature
Experiencing even more visual changes
Enterprise features
Summary
F. Pop Quiz Answers
Chapter 1
PDI data sources
PDI prerequisites
PDI basics
Chapter 2
formatting data
Chapter 3
concatenating strings
Chapter 4
data movement (copying and distributing)
splitting a stream
Chapter 5
finding the seven errors
Chapter 6
using Kettle variables inside transformations
Chapter 7
PDI error handling
Chapter 8
defining database connections
database datatypes versus PDI datatypes
Insert/Update step versus Table Output/Update steps
filtering the first 10 rows
Chapter 9
loading slowly changing dimensions
loading type III slowly changing dimensions
Chapter 10
defining PDI jobs
Chapter 11
using the Add sequence step
deciding the scope of variables
Chapter 12
modifying a star model and loading the star with PDI
Chapter 13
remote execution and clustering
Index

Pentaho 3.2 Data Integration Beginner's Guide

Maria Carina Roldan

Pentaho 3.2 Data Integration Beginner's Guide

Copyright © 2010 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, Packt Publishing, nor its dealers or distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: April 2010

Production Reference: 1050410

Published by Packt Publishing Ltd.

32 Lincoln Road

Olton

Birmingham, B27 6PA, UK.

ISBN 978-1-847199-54-6

www.packtpub.com

Cover Image by Parag Kadam (<[email protected]>)

Credits

Author

María Carina Roldán

Reviewers

Jens Bleuel

Roland Bouman

Matt Casters

James Dixon

Will Gorman

Gretchen Moran

Acquisition Editor

Usha Iyer

Development Editor

Reshma Sundaresan

Technical Editors

Gaurav Datar

Rukhsana Khambatta

Copy Editor

Sanchari Mukherjee

Editorial Team Leader

Gagandeep Singh

Project Team Leader

Lata Basantani

Project Coordinator

Poorvi Nair

Proofreader

Sandra Hopper

Indexer

Rekha Nair

Graphics

Geetanjali Sawant

Production Coordinator

Shantanu Zagade

Cover Work

Shantanu Zagade

Foreword

If we look back at what has happened in the data integration market over the last 10 years we can see a lot of change. In the first half of that decade there was an explosion in the number of data integration tools and in the second half there was a big wave of consolidations. This consolidation wave put an ever growing amount of data integration power in the hands of only a few large billion dollar companies. For any person, company or project in need of data integration, this meant either paying large amounts of money or doing hand-coding of their solution.

During that exact same period, we saw web servers, programming languages, operating systems, and even relational databases turn into a commodity in the ICT market place. This was driven among other things by the availability of open source software such as Apache, GNU, Linux, MySQL, and many others. For the ICT market, this meant that more services could be deployed at a lower cost. If you look closely at what has been going on in those last 10 years, you will notice that most companies increasingly deployed more ICT services to end-users. These services get more and more connected over an ever growing network. Pretty much anything ranging from tiny mobile devices to huge cloud-based infrastructure is being deployed and all those can contain data that is valuable to an organization.

The job of any person that needs to integrate all this data is not easy. Complexity of information services technology usually increases exponentially with the number of systems involved. Because of this, integrating all these systems can be a daunting and scary task that is never complete. Any piece of code lives in what can be described as a software ecosystem that is always in a state of flux. Like in nature, certain ecosystems evolve extremely fast where others change very slowly over time. However, like in nature all ICT systems change. What is needed is another wave of commodification in the area of data integration and business intelligence in general. This is where Pentaho comes in.

Pentaho tries to provide answers to these problems by making the integration software available as open source, accessible, easy to use, and easy to maintain for users and developers alike. Every release of our software we try to make things easier, better, and faster. However, even if things can be done with nice user interfaces, there are still a huge amount of possibilities and options to choose from.

As the founder of the project I've always liked the fact that Kettle users had a lot of choice. Choice translates into creativity, and creativity often delivers good solutions that are comfortable to the person implementing them. However, this choice can be daunting to any beginning Kettle developer. With thousands of options to choose from, it can be very hard to get started.

This is above all others the reason why I'm very happy to see this book come to life. It will be a great and indispensable help for everyone that is taking steps into the wonderful world of data integration with Kettle. As such, I hope you see this book as an open invitation to get started with Kettle in the wonderful world of data integration.

Matt Casters

Chief Data Integration at Pentaho

Kettle founder

The Kettle Project

Whether there is a migration to do, an ETL process to run, or a need for massively loading data into a database, you have several software tools, ranging from expensive and sophisticated to free open source and friendly ones, which help you accomplish the task.

Ten years ago, the scenario was clearly different. By 2000, Matt Casters, a Belgian business intelligent consultant, had been working for a while as a datawarehouse architect and administrator. As such, he was one of quite a number of people who, no matter if the company they worked for was big or small, had to deal with the difficulties that involve bridging the gap between information technology and business needs. What made it even worse at that time was that ETL tools were prohibitively expensive and everything had to be crafted done. The last employer he worked for, didn't think that writing a new ETL tool would be a good idea. This was one of the motivations for Matt to become an independent contractor and to start his own company. That was in June 2001.

At the end of that year, he told his wife that he was going to write a new piece of software for himself to do ETL tasks. It was going to take up some time left and right in the evenings and weekends. Surprised, she asked how long it would take you to get it done. He replied that it would probably take five years and that he perhaps would have something working in three.

Working on that started in early 2003. Matt's main goals for writing the software included learning about databases, ETL processes, and data warehousing. This would in turn improve his chances on a job market that was pretty volatile. Ultimately, it would allow him to work full time on the software.

Another important goal was to understand what the tool had to do. Matt wanted a scalable and parallel tool, and wanted to isolate rows of data as much as possible.

The last but not least goal was to pick the right technology that would support the tool. The first idea was to build it on top of KDE, the popular Unix desktop environment. Trolltech, the people behind Qt, the core UI library of KDE, had released database plans to create drivers for popular databases. However, the lack of decent drivers for those databases drove Matt to change plans and use Java. He picked Java because he had some prior experience as he had written a Japanese Chess (Shogi) database program when Java 1.0 was released. To Sun's credit, this software still runs and is available at http://ibridge.be/shogi/.

After a year of development, the tool was capable of reading text files, reading from databases, writing to databases and it was very flexible. The experience with Java was not 100% positive though. The code had grown unstructured, crashes occurred all too often, and it was hard to get something going with the Java graphic library used at that moment, the Abstract Window Toolkit (AWT); it looked bad and it was slow.

As for the library, Matt decided to start using the newly released Standard Widget Toolkit (SWT), which helped solve part of the problem. As for the rest, Kettle was a complete mess. It was time to ask for help. The help came in hands of Wim De Clercq, a senior enterprise Java architect, co-owner of Ixor (www.ixor.be) and also friend of Matt. At various intervals over the next few years, Wim involved himself in the project, giving advices to Matt about good practices in Java programming. Listening to that advice meant performing massive amounts of code changes. As a consequence, it was not unusual to spend weekends doing nothing but refactoring code and fixing thousands of errors because of that. But, bit by bit, things kept going in the right direction.

At that same time, Matt also showed the results to his peers, colleagues, and other senior BI consultants to hear what they thought of Kettle. That was how he got in touch with the Flemish Traffic Centre (www.verkeerscentrum.be/verkeersinfo/kaart) where billions of rows of data had to be integrated from thousands of data sources all over Belgium. All of a sudden, he was being paid to deploy and improve Kettle to handle that job. The diversity of test cases at the traffic center helped to improve Kettle dramatically. That was somewhere in 2004 and Kettle was by its version 1.2.

While working at Flemish, Matt also posted messages on Javaforge (www.javaforge.com) to let people know they could download a free copy of Kettle for their own use. He got a few reactions. Despite some of them being remarkably negative, most were positive. The most interesting response came from a nice guy called Jens Bleuel in Germany who asked if it was possible to integrate third-party software into Kettle. In his specific case, he needed a connector to link Kettle with the German SAP software (www.sap.com). Kettle didn't have a plugin architecture, so Jens' question made Matt think about a plugin system, and that was the main motivation for developing version 2.0.

For various reasons including the birth of Matt's son Sam and a lot of consultancy work, it took around a year to release Kettle version 2.0. It was a fairly complete release with advanced support for slowly changing dimensions and junk dimensions (Chapter 9 explains those concepts), ability to connect to thirteen different databases, and the most important fact being support for plugins. Matt contacted Jens to let him know the news and Jens was really interested. It was a very memorable moment for Matt and Jens as it took them only a few hours to get a new plugin going that read data from an SAP/R3 server. There was a lot of excitement, and they agreed to start promoting the sales of Kettle from the Kettle.be website and from Proratio (www.proratio.de), the company Jens worked for.

Those were days of improvements, requests, people interested in the project. However, it became too much to handle. Doing development and sales all by themselves was no fun after a while. As such, Matt thought about open sourcing Kettle early in 2005 and by late summer he made his decision. Jens and Proratio didn't mind and the decision was final.

When they finally open sourced Kettle on December 2005, the response was massive. The downloadable package put up on Javaforge got downloaded around 35000 times during first week only. The news got spread all over the world pretty quickly.

What followed was a flood of messages, both private and on the forum. At its peak in March 2006, Matt got over 300 messages a day concerning Kettle.

In no time, he was answering questions like crazy, allowing people to join the development team and working as a consultant at the same time. Added to this, the birth of his daughter Hannelore in February 2006 was too much to deal with.

Fortunately, good times came. While Matt was trying to handle all that, a discussion was taking place at the Pentaho forum (http://forums.pentaho.org/) concerning the ETL tool that Pentaho should support. They had selected Enhydra Octopus, a Java-based ETL software, but they didn't have a strong reliance on a specific tool.

While Jens was evaluating all sorts of open source BI packages, he came across that thread. Matt replied immediately persuading people at Pentaho to consider including Kettle. And he must be convincing because the answer came quickly and was positive. James Dixon, Pentaho founder and CTO, opened Kettle the possibility to be the premier and only ETL tool supported by Pentaho. Later on, Matt came in touch with one of the other Pentaho founders, Richard Daley, who offered him a job. That allowed Matt to focus full-time on Kettle. Four years later, he's still happily working for Pentaho as chief architect for data integration, doing the best effort to deliver Kettle 4.0. Jens Bleuel, who collaborated with Matt since the early versions, is now also part of the Pentaho team.

About the Author

María Carina was born in a small town in the Patagonia region in Argentina. She earned her Bachelor degree in Computer Science at UNLP in La Plata and then moved to Buenos Aires where she has lived since 1994 working in IT.

She has been working as a BI consultant for the last 10 years. At the beginning she worked with Cognos suite. However, over the last three years, she has been dedicated, full time, to developing Pentaho BI solutions both for local and several Latin-American companies, as well as for a French automotive company in the last months.

She is also an active contributor to the Pentaho community.

At present, she lives in Buenos Aires, Argentina, with her husband Adrián and children Camila and Nicolás.

Writing my first book in a foreign language and working on a full time job at the same time, not to mention the upbringing of two small kids, was definitely a big challenge. Now I can tell that it's not impossible.

I dedicate this book to my husband and kids; I'd like to thank them for all their support and tolerance over the last year. I'd also like to thank my colleagues and friends who gave me encouraging words throughout the writing process.

Special thanks to the people at Packt; working with them has been really pleasant.

I'd also like to thank the Pentaho community and developers for making Kettle the incredible tool it is. Thanks to the technical reviewers who, with their very critical eye, contributed to make this a book suited to the audience.

Finally, I'd like to thank Matt Casters who, despite his busy schedule, was willing to help me from the first moment he knew about this book.

About the Reviewers

Jens Bleuel is a Senior Consultant and Engineer at Pentaho. He is also working as a project leader, trainer, and product specialist in the services and support department. Before he joined Pentaho in mid 2007, he was software developer and project leader, and his main business was Data Warehousing and the architecture along with designing and developing of user friendly tools. He studied business economics, was on a grammar school for electronics, and has been programming in a wide area of environments such as Assembler, C, Visual Basic, Delphi, .NET, and these days mainly in Java. His customer focus is on the wholesale market and consumer goods industries. Jens is 40 years old and lives with his wife and two boys in Mainz, Germany (near the nice Rhine river). In his spare time, he practices Tai-Chi, Qigong, and photography.

Roland Bouman has been working in the IT industry since 1998, mostly as a database and web application developer. He has also worked for MySQL AB (later Sun Microsystems) as certification developer and as curriculum developer.

Roland mainly focuses on open source web technology, databases, and Business Intelligence. He's an active member of the MySQL and Pentaho communities and can often be found speaking at worldwide conferences and events such as the MySQL user conference, the O'Reilly Open Source conference (OSCON), and at Pentaho community events.

Roland is co-author of the MySQL 5.1 Cluster DBA Certification Study Guide (Vervante, ISBN: 595352502) and Pentaho Solutions: Business Intelligence and Data Warehousing with Pentaho and MySQL (Wiley, ISBN: 978-0-470-48432-6). He also writes on a regular basis for the Dutch Database Magazine (DBM).

Roland is @rolandbouman on Twitter and maintains a blog at http://rpbouman.blogspot.com/.

Matt Casters has been an independent senior BI consultant for almost two decades. In that period he led, designed, and implemented numerous data warehouses and BI solutions for large and small companies. In that capacity, he always had the need for ETL in some form or another. Almost out of pure necessity, he has been busy writing the ETL tool called Kettle (a.k.a. Pentaho Data Integration) for the past eight years. First, he developed the tool mostly on his own. Since the end of 2005 when Kettle was declared an open source technology, development took place with the help of a large community.

Since the Kettle project was acquired by Pentaho in early 2006, he has been Chief of Data Integration at Pentaho as the lead architect, head of development, and spokesperson for the Kettle community.

I would like to personally thank the complete community for their help in making Kettle the success it is today. In particular, I would like to thank Maria for taking the time to write this nice book as well as the many articles on the Pentaho wiki (for example, the Kettle tutorials), and her appreciated participation on the forum. Many thanks also go to my employer Pentaho, for their large investment in open source BI in general and Kettle in particular.

James Dixon is the Chief Geek and one of the co-founders of Pentaho Corporation—the leading commercial open source Business Intelligence company. He has worked in the business intelligence market since graduating in 1992 from Southampton University with a degree in Computer Science. He has served as Software Engineer, Development Manager, Engineering VP, and CTO at multiple business intelligence software companies. He regularly uses Pentaho Data Integration for internal projects and was involved in the architectural design of PDI V3.0.

He lives in Orlando, Florida, with his wife Tami and son Samuel.

I would like to thank my co-founders, my parents, and my wife Tami for all their support and tolerance of my odd working hours.

I would like to thank my son Samuel for all the opportunities he gives me to prove I'm not as clever as I think I am.

Will Gorman is an Engineering Team Lead at Pentaho. He works on a variety of Pentaho's products, including Reporting, Analysis, Dashboards, Metadata, and the BI Server. Will started his career at GE Research and earned his Masters degree in Computer Science at Rensselaer Polytechnic Institute in Troy, New York. Will is the author of Pentaho Reporting 3.5 for Java Developers (ISBN: 3193), published by Packt Publishing.

Gretchen Moran is a graduate of University of Wisconsin – Stevens Point with a Bachelor's degree in Computer Information Systems with a minor in Data Communications. Gretchen began her career as a corporate data warehouse developer in the insurance industry and joined Arbor Software/Hyperion Solutions in 1999 as a commercial developer for the Hyperion Analyzer and Web Analytics team. Gretchen has been a key player with Pentaho Corporation since its inception in 2004. As Community Leader and core developer, Gretchen managed the explosive growth of Pentaho's open source community for her first 2 years with the company. Gretchen has contributed to many of the Pentaho projects, including the Pentaho BI Server, Pentaho Data Integration, Pentaho Metadata Editor, Pentaho Reporting, Pentaho Charting, and others.

Thanks Doug, Anthony, Isabella and Baby Jack for giving me my favorite challenges and crowning achievements—being a wife and mom.

Preface

Pentaho Data Integration (aka Kettle) is an engine along with a suite of tools responsible for the processes of Extracting, Transforming, and Loading—better known as the ETL processes. PDI not only serves as an ETL tool, but it's also used for other purposes such as migrating data between applications or databases, exporting data from databases to flat files, data cleansing, and much more. PDI has an intuitive, graphical, drag-and-drop design environment, and its ETL capabilities are powerful. However, getting started with PDI can be difficult or confusing. This book provides the guidance needed to overcome that difficulty, covering the key features of PDI. Each chapter introduces new features, allowing you to gradually get involved with the tool.

By the end of the book, you will have not only experimented with all kinds of examples, but will also have built a basic but complete datamart with the help of PDI.

How to read this book

Although it is recommended that you read all the chapters, you don't need to. The book allows you to tailor the PDI learning process according to your particular needs.

The first four chapters, along with Chapter 7 and Chapter 10, cover the core concepts. If you don't know PDI and want to learn just the basics, reading those chapters would suffice. Besides, if you need to work with databases, you could include Chapter 8 in the roadmap.

If you already know the basics, you can improve your PDI knowledge by reading chapters 5, 6, and 11.

Finally, if you already know PDI and want to learn how to use it to load or maintain a datawarehouse or datamart, you will find all that you need in chapters 9 and 12.

While Chapter 13 is useful for anyone who is willing to take it further, all the appendices are valuable resources for anyone who reads this book.

What this book covers

Chapter 1, Getting started with Pentaho Data Integration serves as the most basic introduction to PDI, presenting the tool. The chapter includes instructions for installing PDI and gives you the opportunity to play with the graphical designer (Spoon). The chapter also includes instructions for installing a MySQL server.

Chapter 2, Getting Started with Transformations introduces one of the basic components of PDI—transformations. Then, it focuses on the explanation of how to work with files. It explains how to get data from simple input sources such as txt, csv, xml, and so on, do a preview of the data, and send the data back to any of these common output formats. The chapter also explains how to read command-line parameters and system information.

Chapter 3, Basic Data Manipulation explains the simplest and most commonly used ways of transforming data, including performing calculations, adding constants, counting, filtering, ordering, and looking for data.

Chapter 4—Controlling the Flow of Data explains different options that PDI offers to combine or split flows of data.

Chapter 5, Transforming Your Data with JavaScript Code and the JavaScript Step explains how JavaScript coding can help in the treatment of data. It shows why you need to code inside PDI, and explains in detail how to do it.

Chapter 6, Transforming the Row Set explains the ability of PDI to deal with some sophisticated problems, such as normalizing data from pivoted tables, in a simple fashion.

Chapter 7, Validating Data and Handling Errors explains the different options that PDI has to validate data, and how to treat the errors that may appear.

Chapter 8, Working with Databases explains how to use PDI to work with databases. The list of topics covered includes connecting to a database, previewing and getting data, and inserting, updating, and deleting data. As database knowledge is not presumed, the chapter also covers fundamental concepts of databases and the SQL language.

Chapter 9, Performing Advanced Operations with Databases explains how to perform advanced operations with databases, including those specially designed to load datawarehouses. A primer on datawarehouse concepts is also given in case you are not familiar with the subject.

Chapter 10, Creating Basic Task Flow serves as an introduction to processes in PDI. Through the creation of simple jobs, you will learn what jobs are and what they are used for.

Chapter 11, Creating Advanced Transformations and Jobs deals with advanced concepts that will allow you to build complex PDI projects. The list of covered topics includes nesting jobs, iterating on jobs and transformations, and creating subtransformations.

Chapter 12, Developing and implementing a simple datamart presents a simple datamart project, and guides you to build the datamart by using all the concepts learned throughout the book.

Chapter 13, Taking it Further gives a list of best PDI practices and recommendations for going beyond.

Appendix A, Working with repositories guides you step by step in the creation of a PDI database repository and then gives instructions to work with it.

Appendix B, Pan and Kitchen: Launching Transformations and Jobs from the Command Line is a quick reference for running transformations and jobs from the command line.

Appendix C, Quick Reference: Steps and Job Entries serves as a quick reference to steps and job entries used throughout the book.

Appendix D, Spoon Shortcuts is an extensive list of Spoon shortcuts useful for saving time when designing and running PDI jobs and transformations.

Appendix E, Introducing PDI 4 features quickly introduces you to the architectural and functional features included in Kettle 4—the version that was under development while writing this book.

Appendix F, Pop Quiz Answers, contains answers to pop quiz questions.

What you need for this book

PDI is a multiplatform tool. This means no matter what your operating system is, you will be able to work with the tool. The only prerequisite is to have JVM 1.5 or a higher version installed. It is also useful to have Excel or Calc along with a nice text editor.

Having an Internet connection while reading is extremely useful as well. Several links are provided throughout the book that complement what is explained. Besides, there is the PDI forum where you may search or post doubts if you are stuck with something.

Who this book is for

This book is for software developers, database administrators, IT students, and everyone involved or interested in developing ETL solutions or, more generally, doing any kind of data manipulation. If you have never used PDI before, this will be a perfect book to start with.

You will find this book to be a good starting point if you are a database administrator, a data warehouse designer, an architect, or any person who is responsible for data warehouse projects and need to load data into them.

You don't need to have any prior data warehouse or database experience to read this book. Fundamental database and data warehouse technical terms and concepts are explained in an easy-to-understand language.

Conventions

In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.

Code words in text are shown as follows: "You read the examination.txt file, and did some calculations to see how the students did."

New terms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in our text like this: "Edit the Sort rows step by double-clicking it, click the Get Fields button, and adjust the grid."

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.

To send us general feedback, simply drop an email to <[email protected]>, and mention the book title in the subject of your message.

If there is a book that you need and would like to see us publish, please send us a note in the SUGGEST A TITLE form on www.packtpub.com or email <[email protected]>.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Tip

Downloading the example code for the book

Visit http://www.packtpub.com/files/code/9546_Code.zip to directly download the example code.

The downloadable files contain instructions on how to use them.

Errata

Although we have taken every care to ensure the accuracy of our contents, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in text or code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration, and help us to improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/support, selecting your book, clicking on the let us know link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata added to any list of existing errata. Any existing errata can be viewed by selecting your title from http://www.packtpub.com/support.

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors, and our ability to bring you valuable content.

Questions

You can contact us at <[email protected]> if you are having a problem with any aspect of the book, and we will do our best to address it.

Chapter 1. Getting Started with Pentaho Data Integration

Pentaho Data Integrationis an engine along with a suite of tools responsible for the processes of extracting, transforming, and loading—best known as the ETL processes. This book is meant to teach you how to use PDI.

In this chapter you will:

Learn what Pentaho Data Integration isInstall the software and start working with the PDI graphical designerInstall MySQL, a database engine that you will use when you start working with databases

Pentaho Data Integration and Pentaho BI Suite

Before introducing PDI, let's talk about Pentaho BI Suite. The Pentaho Business Intelligence Suite is a collection of software applications intended to create and deliver solutions for decision making. The main functional areas covered by the suite are:

Analysis: The analysis engine serves multidimensional analysis. It's provided by the Mondrian OLAP server and the JPivot library for navigation and exploring.Reporting: The reporting engine allows designing, creating, and distributing reports in various known formats (HTML, PDF, and so on) from different kinds of sources. The reports created in Pentaho are based mainly in the JFreeReport library, but it's possible to integrate reports created with external reporting libraries such as Jasper Reports or BIRT.Data Mining: Data mining is running data through algorithms in order to understand the business and do predictive analysis. Data mining is possible thanks to the Weka Project.Dashboards: Dashboards are used to monitor and analyze Key Performance Indicators (KPIs). A set of tools incorporated to the BI Suite in the latest version allows users to create interesting dashboards, including graphs, reports, analysis views, and other Pentaho content, without much effort.Data integration: Data integration is used to integrate scattered information from different sources (applications, databases, files) and make the integrated information available to the final user. Pentaho Data Integration—our main concern—is the engine that provides this functionality.

All this functionality can be used standalone as well as integrated. In order to run analysis, reports, and so on integrated as a suite, you have to use the Pentaho BI Platform. The platform has a solution engine, and offers critical services such as authentication, scheduling, security, and web services.

This set of software and services forms a complete BI Platform, which makes Pentaho Suite the world's leading open source Business Intelligence Suite.

Exploring the Pentaho Demo

Despite being out of the scope of this book, it's worth to briefly introduce the Pentaho Demo. The Pentaho BI Platform Demo is a preconfigured installation that lets you explore several capabilities of the Pentaho platform. It includes sample reports, cubes, and dashboards for Steel Wheels. Steel Wheels is a fictional store that sells all kind of scale replicas of vehicles.

The demo can be downloaded from http://sourceforge.net/projects/pentaho/files/. Under the Business Intelligence Server folder, look for the latest stable version. The file you have to download is named biserver-ce-3.5.2.stable.zip for Windows and biserver-ce-3.5.2.stable.tar.gz for other systems.

In the same folder you will find a file named biserver-getting_started-ce-3.5.0.pdf. The file is a guide that introduces you the platform and gives you some guidance on how to install and run it. The guide even includes a mini tutorial on building a simple PDI input-output transformation.

Note

You can find more about Pentaho BI Suite at www.pentaho.org.

Pentaho Data Integration

Most of the Pentaho engines, including the engines mentioned earlier, were created as community projects and later adopted by Pentaho. The PDI engine is no exception—Pentaho Data Integration is the new denomination for the business intelligence tool born as Kettle.

Note

The name Kettle didn't come from the recursive acronym Kettle Extraction, Transportation, Transformation, and Loading Environment it has now, but from KDEExtraction, Transportation, Transformation and Loading Environment, as the tool was planned to be written on top of KDE, as mentioned in the introduction of the book.

In April 2006 the Kettle project was acquired by the Pentaho Corporation and Matt Casters, Kettle's founder, also joined the Pentaho team as a Data Integration Architect.

When Pentaho announced the acquisition, James Dixon, the Chief Technology Officer, said:

We reviewed many alternatives for open source data integration, and Kettle clearly had the best architecture, richest functionality, and most mature user interface. The open architecture and superior technology of the Pentaho BI Platform and Kettle allowed us to deliver integration in only a few days, and make that integration available to the community.

By joining forces with Pentaho, Kettle benefited from a huge developer community, as well as from a company that would support the future of the project.

From that moment the tool has grown constantly. Every few months a new release is available, bringing to the users, improvements in performance and existing functionality, new functionality, ease of use, and great changes in look and feel. The following is a timeline of the major events related to PDI since its acquisition by Pentaho:

June 2006: PDI 2.3 is released. Numerous developers had joined the project and there were bug fixes provided by people in various regions of the world. Among other changes, the version included enhancements for large scale environments and multilingual capabilities.February 2007: Almost seven months after the last major revision, PDI 2.4 is released including remote execution and clustering support (more on this in Chapter 13), enhanced database support, and a single designer for the two main elements you design in Kettle—jobs and transformations.May 2007: PDI 2.5 is released including many new features, the main feature being the advanced error handling.November 2007: PDI 3.0 emerges totally redesigned. Its major library changed to gain massive performance. The look and feel also changed completely.October 2008: PDI 3.1 comes with an easier-to-use tool, along with a lot of new functionalities as well.April 2009: PDI 3.2 is released with a really large number of changes for a minor version—new functionality, visualization improvements, performance improvements, and a huge pile of bug fixes. The main change in this version was the incorporation of dynamic clustering (see Chapter 13 for details).In 2010 PDI 4.0 will be released, delivering mostly improvements with regard to enterprise features such as version control.

Note

Most users still refer to PDI as Kettle, its further name. Therefore, the names PDI, Pentaho Data Integration, and Kettle will be used interchangeably throughout the book.

Using PDI in real world scenarios

Paying attention to its name, Pentaho Data Integration, you could think of PDI as a tool to integrate data.

In you look at its original name, K.E.T.T.L.E., then you must conclude that it is a tool used for ETL processes which, as you may know, are most frequently seen in data warehouse environments.

In fact, PDI not only serves as a data integrator or an ETL tool, but is such a powerful tool that it is common to see it used for those and for many other purposes. Here you have some examples.

Loading datawarehouses or datamarts

The loading of a datawarehouse or a datamart involves many steps, and there are many variants depending on business area or business rules. However, in every case, the process involves the following steps:

Extracting information from one or different databases, text files, and other sources. The extraction process may include the task of validating and discarding data that doesn't match expected patterns or rules.Transforming the obtained data to meet the business and technical needs required on the target. Transformation implies tasks such as converting data types, doing some calculations, filtering irrelevant data, and summarizing.Loading the transformed data into the target database. Depending on the requirements, the loading may overwrite the existing information, or may add new information each time it is executed.

Kettle comes ready to do every stage of this loading process. The following sample screenshot shows a simple ETL designed with Kettle:

Integrating data

Imagine two similar companies that need to merge their databases in order to have a unified view of the data, or a single company that has to combine information from a main ERP application and a CRM application, though they're not connected. These are just two of hundreds of examples where data integration is needed. Integrating data is not just a matter of gathering and mixing data; some conversions, validation, and transport of data has to be done. Kettle is meant to do all those tasks.

Data cleansing

Why do we need that data be correct and accurate? There are many reasons—for the efficiency of business, to generate trusted conclusions in data mining or statistical studies, to succeed when integrating data, and so on. Data cleansing is about ensuring that the data is correct and precise. This can be ensured by verifying if the data meets certain rules, discarding or correcting those that don't follow the expected pattern, setting default values for missing data, eliminating information that is duplicated, normalizing data to conform minimum and maximum values, and so on—tasks that Kettle makes possible, thanks to its vast set of transformation and validation capabilities.

Migrating information