E-Book
39,59 €

Pentaho 3.2 Data Integration: Beginner's Guide E-Book

Roldan Maria Carina

0,0

39,59 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

Pentaho Data Integration (a.k.a. Kettle) is a full-featured open source ETL (Extract, Transform, and Load) solution. Although PDI is a feature-rich tool, effectively capturing, manipulating, cleansing, transferring, and loading data can get complicated.This book is full of practical examples that will help you to take advantage of Pentaho Data Integration's graphical, drag-and-drop design environment. You will quickly get started with Pentaho Data Integration by following the step-by-step guidance in this book. The useful tips in this book will encourage you to exploit powerful features of Pentaho Data Integration and perform ETL operations with ease.Starting with the installation of the PDI software, this book will teach you all the key PDI concepts. Each chapter introduces new features, allowing you to gradually get involved with the tool. First, you will learn to work with plain files, and to do all kinds of data manipulation. Then, the book gives you a primer on databases and teaches you how to work with databases inside PDI. Not only that, you'll be given an introduction to data warehouse concepts and you will learn to load data in a data warehouse. After that, you will learn to implement simple and complex processes.Once you've learned all the basics, you will build a simple datamart that will serve to reinforce all the concepts learned through the book.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 569

Veröffentlichungsjahr: 2010

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Pentaho 3.2 Data Integration Beginner's Guide

Credits

Foreword

The Kettle Project

About the Author

About the Reviewers

Preface

How to read this book

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Errata

Piracy

Questions

1. Getting Started with Pentaho Data Integration

Pentaho Data Integration and Pentaho BI Suite

Exploring the Pentaho Demo

Pentaho Data Integration

Using PDI in real world scenarios

Loading datawarehouses or datamarts

Integrating data

Data cleansing

Migrating information

Exporting data

Integrating PDI using Pentaho BI

Pop quiz – PDI data sources

Installing PDI

Time for action – installing PDI

What just happened?

Pop quiz – PDI prerequisites

Launching the PDI graphical designer: Spoon

Time for action – starting and customizing Spoon

What just happened?

Spoon

Setting preferences in the Options window

Storing transformations and jobs in a repository

Creating your first transformation

Time for action – creating a hello world transformation

What just happened?

Directing the Kettle engine with transformations

Exploring the Spoon interface

Viewing the transformation structure

Running and previewing the transformation

Time for action – running and previewing the hello_world transformation

What just happened?

Previewing the results in the Execution Results window

Pop quiz – PDI basics

Installing MySQL

Time for action – installing MySQL on Windows

What just happened?

Time for action – installing MySQL on Ubuntu

What just happened?

Summary

2. Getting Started with Transformations

Reading data from files

Time for action – reading results of football matches from files

What just happened?

Input files

Input steps

Reading several files at once

Time for action – reading all your files at a time using a single Text file input step

What just happened?

Time for action – reading all your files at a time using a single Text file input step and regular expressions

What just happened?

Regular expressions

Troubleshooting reading files

Grids

Have a go hero – explore your own files

Sending data to files

Time for action – sending the results of matches to a plain file

What just happened?

Output files

Output steps

Some data definitions

Rowset

Streams

The Select values step

Have a go hero – extending your transformations by writing output files

Getting system information

Time for action – updating a file with news about examinations

What just happened?

Getting information by using Get System Info step

Data types

Date fields

Numeric fields

Running transformations from a terminal window

Time for action – running the examination transformation from a terminal window

What just happened?

Have a go hero – using different date formats

Go for a hero – formatting 99.55

Pop quiz–formatting data

XML files

Time for action – getting data from an XML file with information about countries

What just happened?

What is XML

PDI transformation files

Getting data from XML files

XPath

Configuring the Get data from XML step

Kettle variables

How and when you can use variables

Have a go hero – exploring XML files

Have a go hero – enhancing the output countries file

Have a go hero – documenting your work

Summary

3. Basic Data Manipulation

Basic calculations

Time for action – reviewing examinations by using the Calculator step

What just happened?

Adding or modifying fields by using different PDI steps

The Calculator step

The Formula step

Time for action – reviewing examinations by using the Formula step

What just happened?

Have a go hero – listing students and their examinations results

Pop quiz – concatenating strings

Calculations on groups of rows

Time for action – calculating World Cup statistics by grouping data

What just happened?

Group by step

Have a go hero – calculating statistics for the examinations

Have a go hero – listing the languages spoken by country

Filtering

Time for action – counting frequent words by filtering

What just happened?

Filtering rows using the Filter rows step

Have a go hero – playing with filters

Have a go hero – counting words and discarding those that are commonly used

Looking up data

Time for action – finding out which language people speak

What just happened?

The Stream lookup step

Have a go hero – counting words more precisely

Summary

4. Controlling the Flow of Data

Splitting streams

Time for action – browsing new PDI features by copyinga dataset

What just happened?

Copying rows

Have a go hero – recalculating statistics

Distributing rows

Time for action – assigning tasks by distributing

What just happened?

Pop quiz – data movement (copying and distributing)

Splitting the stream based on conditions

Time for action – assigning tasks by filtering priorities with the Filter rows step

What just happened?

PDI steps for splitting the stream based on conditions

Time for action – assigning tasks by filtering priorities with the Switch/ Case step

What just happened?

Have a go hero – listing languages and countries

Pop quiz – splitting a stream

Merging streams

Time for action – gathering progress and merging all together

What just happened?

PDI options for merging streams

Time for action – giving priority to Bouchard by using Append Stream

What just happened?

Have a go hero – sorting and merging all tasks

Have a go hero – trying to find missing countries

Summary

5. Transforming Your Data with JavaScript Code and the JavaScript Step

Doing simple tasks with the JavaScript step

Time for action – calculating scores with JavaScript

What just happened?

Using the JavaScript language in PDI

Inserting JavaScript code using the Modified Java Script Value step

Adding fields

Modifying fields

Turning on the compatibility switch

Have a go hero – adding and modifying fields to the contest data

Testing your code

Time for action – testing the calculation of averages

What just happened?

Testing the script using the Test script button

Have a go hero – testing the new calculation of the average

Enriching the code

Time for action – calculating flexible scores by using variables

What just happened?

Using named parameters

Using the special Start, Main, and End scripts

Using transformation predefined constants

Pop quiz – finding the 7 errors

Have a go hero – keeping the top 10 performances

Have a go hero – calculating scores with Java code

Reading and parsing unstructured files

Time for action – changing a list of house descriptions with JavaScript

What just happened?

Looking at previous rows

Have a go hero – enhancing the houses file

Have a go hero – fill gaps in the contest file

Avoiding coding by using purpose-built steps

Have a go hero – creating alternative solutions

Summary

6. Transforming the Row Set

Converting rows to columns

Time for action – enhancing a films file by converting rows to columns

What just happened?

Converting row data to column data by using the Row denormalizer step

Have a go hero – houses revisited

Aggregating data with a Row denormalizer step

Time for action – calculating total scores by performances by country

What just happened?

Using Row denormalizer for aggregating data

Have a go hero – calculating scores by skill by continent

Normalizing data

Time for action – enhancing the matches file by normalizing the dataset

What just happened?

Modifying the dataset with a Row Normalizer step

Summarizing the PDI steps that operate on sets of rows

Have a go hero – verifying the benefits of normalization

Have a go hero – normalizing the Films file

Have a go hero – calculating scores by judge

Generating a custom time dimension dataset by using Kettle variables

Time for action – creating the time dimension dataset

What just happened?

Getting variables

Time for action – getting variables for setting the default starting date

What just happened?

Using the Get Variables step

Have a go hero – enhancing the time dimension

Pop quiz – using Kettle variables inside transformations

Summary

7. Validating Data and Handling Errors

Capturing errors

Time for action – capturing errors while calculating the ageof a film

What just happened?

Using PDI error handling functionality

Aborting a transformation

Time for action – aborting when there are too many errors

What just happened?

Aborting a transformation using the Abort step

Fixing captured errors

Time for action – treating errors that may appear

What just happened?

Treating rows coming to the error stream

Pop quiz – PDI error handling

Have a go hero – capturing errors while seeing who wins

Avoiding unexpected errors by validating data

Time for action – validating genres with a Regex Evaluation step

What just happened?

Validating data

Time for action – checking films file with the Data Validator

What just happened?

Defining simple validation rules using the Data Validator

Have a go hero – validating the football matches file

Cleansing data

Have a go hero – cleansing films data

Summary

8. Working with Databases

Introducing the Steel Wheels sample database

Connecting to the Steel Wheels database

Time for action – creating a connection with the Steel Wheels database

What just happened?

Connecting with Relational Database Management Systems

Pop quiz – defining database connections

Have a go hero – connecting to your own databases

Exploring the Steel Wheels database

Time for action – exploring the sample database

What just happened?

A brief word about SQL

Exploring any configured database with the PDI Database explorer

Have a go hero – exploring the sample data in depth

Have a go hero – exploring your own databases

Querying a database

Time for action – getting data about shipped orders

What just happened?

Getting data from the database with the Table input step

Using the SELECT statement for generating a new dataset

Making flexible queries by using parameters

Time for action – getting orders in a range of dates by using parameters

What just happened?

Adding parameters to your queries

Making flexible queries by using Kettle variables

Time for action – getting orders in a range of dates by using variables

What just happened?

Using Kettle variables in your queries

Pop quiz – database datatypes versus PDI datatypes

Have a go hero – querying the sample data

Sending data to a database

Time for action – loading a table with a list of manufacturers

What just happened?

Inserting new data into a database table with the Table output step

Inserting or updating data by using other PDI steps

Time for action – inserting new products or updating existent ones

What just happened?

Time for action – testing the update of existing products

What just happened?

Inserting or updating data with the Insert/Update step

Have a go hero – populating a films database

Have a go hero – creating the time dimension

Have a go hero – populating the products table

Pop quiz – Insert/Update step versus Table Output/Update steps

Pop quiz – filtering the first 10 rows

Eliminating data from a database

Time for action – deleting data about discontinued items

What just happened?

Deleting records of a database table with the Delete step

Have a go hero – deleting old orders

Summary

9. Performing Advanced Operations with Databases

Preparing the environment

Time for action – populating the Jigsaw database

What just happened?

Exploring the Jigsaw database model

Looking up data in a database

Doing simple lookups

Time for action – using a Database lookup step to create a list of products to buy

What just happened?

Looking up values in a database with the Database lookup step

Have a go hero – preparing the delivery of the products

Have a go hero – refining the transformation

Doing complex lookups

Time for action – using a Database join step to create a list of suggested products to buy

What just happened?

Joining data from the database to the stream data by using a Database join step

Have a go hero – rebuilding the list of customers

Introducing dimensional modeling

Loading dimensions with data

Time for action – loading a region dimension with a Combination lookup/update step

What just happened?

Time for action – testing the transformation that loads the region dimension

What just happened?

Describing data with dimensions

Loading Type I SCD with a Combination lookup/update step

Have a go hero – adding regions to the Region Dimension

Have a go hero – loading the manufacturers dimension

Have a go hero – loading a mini-dimension

Keeping a history of changes

Time for action – keeping a history of product changes with the Dimension lookup/update step

What just happened?

Time for action – testing the transformation that keeps a historyof product changes

What just happened?

Keeping an entire history of data with a Type II slowly changing dimension

Loading Type II SCDs with the Dimension lookup/update step

Have a go hero – keeping a history just for the theme of a product

Have a go hero – loading a Type II SCD dimension

Pop quiz – loading slowly changing dimensions

Pop quiz – loading type III slowly changing dimensions

Summary

10. Creating Basic Task Flows

Introducing PDI jobs

Time for action – creating a simple hello world job

What just happened?

Executing processes with PDI jobs

Using Spoon to design and run jobs

Using the transformation job entry

Pop quiz – defining PDI jobs

Have a go hero – loading the dimension tables

Receiving arguments and parameters in a job

Time for action – customizing the hello world file with arguments and parameters

What just happened?

Using named parameters in jobs

Have a go hero – backing up your work

Running jobs from a terminal window

Time for action – executing the hello world job from a terminal window

What just happened?

Have a go hero – experiencing Kitchen

Using named parameters and command-line arguments in transformations

Time for action – calling the hello world transformation with fixed arguments and parameters

What just happened?

Have a go hero – saying hello again and again

Have a go hero – loading the time dimension from a job

Deciding between the use of a command-line argument and a named parameter

Have a go hero – analysing the use of arguments and named parameters

Running job entries under conditions

Time for action – sending a sales report and warning the administrator if something is wrong

What just happened?

Changing the flow of execution on the basis of conditions

Have a go hero – refining the sales report

Creating and using a file results list

Have a go hero – sharing your work

Summary

11. Creating Advanced Transformations and Jobs

Enhancing your processes with the use of variables

Time for action – updating a file with news about examinations by setting a variable with the name of the file

What just happened?

Setting variables inside a transformation

Have a go hero – enhancing the examination tutorial even more

Have a go hero – enhancing the jigsaw database update process

Have a go hero – executing the proper jigsaw database update process

Enhancing the design of your processes

Time for action – generating files with top scores

What just happened?

Pop quiz – using the Add Sequence step

Reusing part of your transformations

Time for action – calculating the top scores with a subtransformation

What just happened?

Creating and using subtransformations

Have a go hero – refining the subtransformation

Have a go hero – counting words more precisely (second version)

Creating a job as a process flow

Time for action – splitting the generation of top scores by copying and getting rows

What just happened?

Transferring data between transformations by using the copy /get rows mechanism

Have a go hero – modifying the flow

Nesting jobs

Time for action – generating the files with top scores by nesting jobs

What just happened?

Running a job inside another job with a job entry

Understanding the scope of variables

Pop quiz – deciding the scope of variables

Iterating jobs and transformations

Time for action – generating custom files by executing a transformation for every input row

What just happened?

Executing for each row

Have a go hero – processing several files at once

Have a go hero – building lists of products to buy

Have a go hero – e-mail students to let them know how they did

Summary

12. Developing and Implementing a Simple Datamart

Exploring the sales datamart

Deciding the level of granularity

Loading the dimensions

Time for action – loading dimensions for the sales datamart

What just happened?

Extending the sales datamart model

Have a go hero – loading the dimensions for the puzzles star model

Loading a fact table with aggregated data

Time for action – loading the sales fact table by looking up dimensions

What just happened?

Getting the information from the source with SQL queries

Translating the business keys into surrogate keys

Obtaining the surrogate key for a Type I SCD

Obtaining the surrogate key for a Type II SCD

Obtaining the surrogate key for the Junk dimension

Obtaining the surrogate key for the Time dimension

Pop quiz – modifying a star model and loading the star with PDI

Have a go hero – loading a puzzles fact table

Getting facts and dimensions together

Time for action – loading the fact table using a range of dates obtained from the command line

What just happened?

Time for action – loading the sales star

What just happened?

Have a go hero – enhancing the loading process of the sales fact table

Have a go hero – loading the puzzles sales star

Have a go hero – loading the facts once a month

Getting rid of administrative tasks

Time for action – automating the loading of the sales datamart

What just happened?

Have a go hero – Creating a back up of your work automatically

Have a go hero – enhancing the automate process by sending an e-mail if an error occurs

Summary

13. Taking it Further

PDI best practices

Getting the most out of PDI

Extending Kettle with plugins

Have a go hero – listing the top 10 students by using the Head plugin step

Overcoming real world risks with some remote execution

Scaling out to overcome bigger risks

Pop quiz – remote execution and clustering

Integrating PDI and the Pentaho BI suite

PDI as a process action

PDI as a datasource

More about the Pentaho suite

PDI Enterprise Edition and Kettle Developer Support

Summary

A. Working with Repositories

Creating a repository

Time for action – creating a PDI repository

What just happened?

Creating repositories to store your transformationand jobs

Working with the repository storage system

Time for action – logging into a repository

What just happened?

Logging into a repository by using credentials

Defining repository user accounts

Creating transformations and jobs in repository folders

Creating database connections, partitions, servers, and clusters

Backing up and restoring a repository

Examining and modifying the contents of a repository with the Repository explorer

Migrating from a file-based system to a repository-based system and vice-versa

Summary

B. Pan and Kitchen: Launching Transformations and Jobs from the Command Line

Running transformations and jobs stored in files

Running transformations and jobs from a repository

Specifying command line options

Checking the exit code

Providing options when running Pan and Kitchen

Log details

Named parameters

Arguments

Variables

C. Quick Reference: Steps and Job Entries

Transformation steps

Job entries

D. Spoon Shortcuts

General shortcuts

Designing transformations and jobs

Grids

Repositories

E. Introducing PDI 4 Features

Agile BI

Visual improvements for designing transformations and jobs

Experiencing the mouse-over assistance

Time for action – creating a hop with the mouse-over assistance

What just happened?

Using the mouse-over assistance toolbar

Experiencing the sniff-testing feature

Experiencing the job drill-down feature

Experiencing even more visual changes

Enterprise features

Summary

F. Pop Quiz Answers

Chapter 1

PDI data sources

PDI prerequisites

PDI basics

Chapter 2

formatting data

Chapter 3

concatenating strings

Chapter 4

data movement (copying and distributing)

splitting a stream

Chapter 5

finding the seven errors

Chapter 6

using Kettle variables inside transformations

Chapter 7

PDI error handling

Chapter 8

defining database connections

database datatypes versus PDI datatypes

Insert/Update step versus Table Output/Update steps

filtering the first 10 rows

Chapter 9

loading slowly changing dimensions

loading type III slowly changing dimensions

Chapter 10

defining PDI jobs

Chapter 11

using the Add sequence step

deciding the scope of variables

Chapter 12

modifying a star model and loading the star with PDI

Chapter 13

remote execution and clustering

Index

Pentaho 3.2 Data Integration Beginner's Guide

Maria Carina Roldan

Pentaho 3.2 Data Integration Beginner's Guide

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, Packt Publishing, nor its dealers or distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: April 2010

Production Reference: 1050410

Published by Packt Publishing Ltd.

32 Lincoln Road

Olton

Birmingham, B27 6PA, UK.

ISBN 978-1-847199-54-6

www.packtpub.com

Cover Image by Parag Kadam (<[email protected]>)

Credits

Author

María Carina Roldán

Reviewers

Jens Bleuel

Roland Bouman

Matt Casters

James Dixon

Will Gorman

Gretchen Moran

Acquisition Editor

Usha Iyer

Development Editor

Reshma Sundaresan

Technical Editors

Gaurav Datar

Rukhsana Khambatta

Copy Editor

Sanchari Mukherjee

Editorial Team Leader

Gagandeep Singh

Project Team Leader

Lata Basantani

Project Coordinator

Poorvi Nair

Proofreader

Sandra Hopper

Indexer

Rekha Nair

Graphics

Geetanjali Sawant

Production Coordinator

Shantanu Zagade

Cover Work

Shantanu Zagade

Foreword

If we look back at what has happened in the data integration market over the last 10 years we can see a lot of change. In the first half of that decade there was an explosion in the number of data integration tools and in the second half there was a big wave of consolidations. This consolidation wave put an ever growing amount of data integration power in the hands of only a few large billion dollar companies. For any person, company or project in need of data integration, this meant either paying large amounts of money or doing hand-coding of their solution.

During that exact same period, we saw web servers, programming languages, operating systems, and even relational databases turn into a commodity in the ICT market place. This was driven among other things by the availability of open source software such as Apache, GNU, Linux, MySQL, and many others. For the ICT market, this meant that more services could be deployed at a lower cost. If you look closely at what has been going on in those last 10 years, you will notice that most companies increasingly deployed more ICT services to end-users. These services get more and more connected over an ever growing network. Pretty much anything ranging from tiny mobile devices to huge cloud-based infrastructure is being deployed and all those can contain data that is valuable to an organization.

The job of any person that needs to integrate all this data is not easy. Complexity of information services technology usually increases exponentially with the number of systems involved. Because of this, integrating all these systems can be a daunting and scary task that is never complete. Any piece of code lives in what can be described as a software ecosystem that is always in a state of flux. Like in nature, certain ecosystems evolve extremely fast where others change very slowly over time. However, like in nature all ICT systems change. What is needed is another wave of commodification in the area of data integration and business intelligence in general. This is where Pentaho comes in.

Pentaho tries to provide answers to these problems by making the integration software available as open source, accessible, easy to use, and easy to maintain for users and developers alike. Every release of our software we try to make things easier, better, and faster. However, even if things can be done with nice user interfaces, there are still a huge amount of possibilities and options to choose from.

As the founder of the project I've always liked the fact that Kettle users had a lot of choice. Choice translates into creativity, and creativity often delivers good solutions that are comfortable to the person implementing them. However, this choice can be daunting to any beginning Kettle developer. With thousands of options to choose from, it can be very hard to get started.

This is above all others the reason why I'm very happy to see this book come to life. It will be a great and indispensable help for everyone that is taking steps into the wonderful world of data integration with Kettle. As such, I hope you see this book as an open invitation to get started with Kettle in the wonderful world of data integration.

Matt Casters

Chief Data Integration at Pentaho

Kettle founder

The Kettle Project

Whether there is a migration to do, an ETL process to run, or a need for massively loading data into a database, you have several software tools, ranging from expensive and sophisticated to free open source and friendly ones, which help you accomplish the task.

Ten years ago, the scenario was clearly different. By 2000, Matt Casters, a Belgian business intelligent consultant, had been working for a while as a datawarehouse architect and administrator. As such, he was one of quite a number of people who, no matter if the company they worked for was big or small, had to deal with the difficulties that involve bridging the gap between information technology and business needs. What made it even worse at that time was that ETL tools were prohibitively expensive and everything had to be crafted done. The last employer he worked for, didn't think that writing a new ETL tool would be a good idea. This was one of the motivations for Matt to become an independent contractor and to start his own company. That was in June 2001.

At the end of that year, he told his wife that he was going to write a new piece of software for himself to do ETL tasks. It was going to take up some time left and right in the evenings and weekends. Surprised, she asked how long it would take you to get it done. He replied that it would probably take five years and that he perhaps would have something working in three.

Working on that started in early 2003. Matt's main goals for writing the software included learning about databases, ETL processes, and data warehousing. This would in turn improve his chances on a job market that was pretty volatile. Ultimately, it would allow him to work full time on the software.

Another important goal was to understand what the tool had to do. Matt wanted a scalable and parallel tool, and wanted to isolate rows of data as much as possible.

The last but not least goal was to pick the right technology that would support the tool. The first idea was to build it on top of KDE, the popular Unix desktop environment. Trolltech, the people behind Qt, the core UI library of KDE, had released database plans to create drivers for popular databases. However, the lack of decent drivers for those databases drove Matt to change plans and use Java. He picked Java because he had some prior experience as he had written a Japanese Chess (Shogi) database program when Java 1.0 was released. To Sun's credit, this software still runs and is available at http://ibridge.be/shogi/.

After a year of development, the tool was capable of reading text files, reading from databases, writing to databases and it was very flexible. The experience with Java was not 100% positive though. The code had grown unstructured, crashes occurred all too often, and it was hard to get something going with the Java graphic library used at that moment, the Abstract Window Toolkit (AWT); it looked bad and it was slow.

As for the library, Matt decided to start using the newly released Standard Widget Toolkit (SWT), which helped solve part of the problem. As for the rest, Kettle was a complete mess. It was time to ask for help. The help came in hands of Wim De Clercq, a senior enterprise Java architect, co-owner of Ixor (www.ixor.be) and also friend of Matt. At various intervals over the next few years, Wim involved himself in the project, giving advices to Matt about good practices in Java programming. Listening to that advice meant performing massive amounts of code changes. As a consequence, it was not unusual to spend weekends doing nothing but refactoring code and fixing thousands of errors because of that. But, bit by bit, things kept going in the right direction.

At that same time, Matt also showed the results to his peers, colleagues, and other senior BI consultants to hear what they thought of Kettle. That was how he got in touch with the Flemish Traffic Centre (www.verkeerscentrum.be/verkeersinfo/kaart) where billions of rows of data had to be integrated from thousands of data sources all over Belgium. All of a sudden, he was being paid to deploy and improve Kettle to handle that job. The diversity of test cases at the traffic center helped to improve Kettle dramatically. That was somewhere in 2004 and Kettle was by its version 1.2.

While working at Flemish, Matt also posted messages on Javaforge (www.javaforge.com) to let people know they could download a free copy of Kettle for their own use. He got a few reactions. Despite some of them being remarkably negative, most were positive. The most interesting response came from a nice guy called Jens Bleuel in Germany who asked if it was possible to integrate third-party software into Kettle. In his specific case, he needed a connector to link Kettle with the German SAP software (www.sap.com). Kettle didn't have a plugin architecture, so Jens' question made Matt think about a plugin system, and that was the main motivation for developing version 2.0.

For various reasons including the birth of Matt's son Sam and a lot of consultancy work, it took around a year to release Kettle version 2.0. It was a fairly complete release with advanced support for slowly changing dimensions and junk dimensions (Chapter 9 explains those concepts), ability to connect to thirteen different databases, and the most important fact being support for plugins. Matt contacted Jens to let him know the news and Jens was really interested. It was a very memorable moment for Matt and Jens as it took them only a few hours to get a new plugin going that read data from an SAP/R3 server. There was a lot of excitement, and they agreed to start promoting the sales of Kettle from the Kettle.be website and from Proratio (www.proratio.de), the company Jens worked for.

Those were days of improvements, requests, people interested in the project. However, it became too much to handle. Doing development and sales all by themselves was no fun after a while. As such, Matt thought about open sourcing Kettle early in 2005 and by late summer he made his decision. Jens and Proratio didn't mind and the decision was final.

When they finally open sourced Kettle on December 2005, the response was massive. The downloadable package put up on Javaforge got downloaded around 35000 times during first week only. The news got spread all over the world pretty quickly.

What followed was a flood of messages, both private and on the forum. At its peak in March 2006, Matt got over 300 messages a day concerning Kettle.

In no time, he was answering questions like crazy, allowing people to join the development team and working as a consultant at the same time. Added to this, the birth of his daughter Hannelore in February 2006 was too much to deal with.

Fortunately, good times came. While Matt was trying to handle all that, a discussion was taking place at the Pentaho forum (http://forums.pentaho.org/) concerning the ETL tool that Pentaho should support. They had selected Enhydra Octopus, a Java-based ETL software, but they didn't have a strong reliance on a specific tool.

While Jens was evaluating all sorts of open source BI packages, he came across that thread. Matt replied immediately persuading people at Pentaho to consider including Kettle. And he must be convincing because the answer came quickly and was positive. James Dixon, Pentaho founder and CTO, opened Kettle the possibility to be the premier and only ETL tool supported by Pentaho. Later on, Matt came in touch with one of the other Pentaho founders, Richard Daley, who offered him a job. That allowed Matt to focus full-time on Kettle. Four years later, he's still happily working for Pentaho as chief architect for data integration, doing the best effort to deliver Kettle 4.0. Jens Bleuel, who collaborated with Matt since the early versions, is now also part of the Pentaho team.

About the Author

María Carina was born in a small town in the Patagonia region in Argentina. She earned her Bachelor degree in Computer Science at UNLP in La Plata and then moved to Buenos Aires where she has lived since 1994 working in IT.

She has been working as a BI consultant for the last 10 years. At the beginning she worked with Cognos suite. However, over the last three years, she has been dedicated, full time, to developing Pentaho BI solutions both for local and several Latin-American companies, as well as for a French automotive company in the last months.

She is also an active contributor to the Pentaho community.

At present, she lives in Buenos Aires, Argentina, with her husband Adrián and children Camila and Nicolás.

Writing my first book in a foreign language and working on a full time job at the same time, not to mention the upbringing of two small kids, was definitely a big challenge. Now I can tell that it's not impossible.

I dedicate this book to my husband and kids; I'd like to thank them for all their support and tolerance over the last year. I'd also like to thank my colleagues and friends who gave me encouraging words throughout the writing process.

Special thanks to the people at Packt; working with them has been really pleasant.

I'd also like to thank the Pentaho community and developers for making Kettle the incredible tool it is. Thanks to the technical reviewers who, with their very critical eye, contributed to make this a book suited to the audience.

Finally, I'd like to thank Matt Casters who, despite his busy schedule, was willing to help me from the first moment he knew about this book.

About the Reviewers

Jens Bleuel is a Senior Consultant and Engineer at Pentaho. He is also working as a project leader, trainer, and product specialist in the services and support department. Before he joined Pentaho in mid 2007, he was software developer and project leader, and his main business was Data Warehousing and the architecture along with designing and developing of user friendly tools. He studied business economics, was on a grammar school for electronics, and has been programming in a wide area of environments such as Assembler, C, Visual Basic, Delphi, .NET, and these days mainly in Java. His customer focus is on the wholesale market and consumer goods industries. Jens is 40 years old and lives with his wife and two boys in Mainz, Germany (near the nice Rhine river). In his spare time, he practices Tai-Chi, Qigong, and photography.

Roland Bouman has been working in the IT industry since 1998, mostly as a database and web application developer. He has also worked for MySQL AB (later Sun Microsystems) as certification developer and as curriculum developer.

Roland mainly focuses on open source web technology, databases, and Business Intelligence. He's an active member of the MySQL and Pentaho communities and can often be found speaking at worldwide conferences and events such as the MySQL user conference, the O'Reilly Open Source conference (OSCON), and at Pentaho community events.

Roland is co-author of the MySQL 5.1 Cluster DBA Certification Study Guide (Vervante, ISBN: 595352502) and Pentaho Solutions: Business Intelligence and Data Warehousing with Pentaho and MySQL (Wiley, ISBN: 978-0-470-48432-6). He also writes on a regular basis for the Dutch Database Magazine (DBM).

Roland is @rolandbouman on Twitter and maintains a blog at http://rpbouman.blogspot.com/.

Matt Casters has been an independent senior BI consultant for almost two decades. In that period he led, designed, and implemented numerous data warehouses and BI solutions for large and small companies. In that capacity, he always had the need for ETL in some form or another. Almost out of pure necessity, he has been busy writing the ETL tool called Kettle (a.k.a. Pentaho Data Integration) for the past eight years. First, he developed the tool mostly on his own. Since the end of 2005 when Kettle was declared an open source technology, development took place with the help of a large community.

Since the Kettle project was acquired by Pentaho in early 2006, he has been Chief of Data Integration at Pentaho as the lead architect, head of development, and spokesperson for the Kettle community.

I would like to personally thank the complete community for their help in making Kettle the success it is today. In particular, I would like to thank Maria for taking the time to write this nice book as well as the many articles on the Pentaho wiki (for example, the Kettle tutorials), and her appreciated participation on the forum. Many thanks also go to my employer Pentaho, for their large investment in open source BI in general and Kettle in particular.

James Dixon is the Chief Geek and one of the co-founders of Pentaho Corporation—the leading commercial open source Business Intelligence company. He has worked in the business intelligence market since graduating in 1992 from Southampton University with a degree in Computer Science. He has served as Software Engineer, Development Manager, Engineering VP, and CTO at multiple business intelligence software companies. He regularly uses Pentaho Data Integration for internal projects and was involved in the architectural design of PDI V3.0.

He lives in Orlando, Florida, with his wife Tami and son Samuel.

I would like to thank my co-founders, my parents, and my wife Tami for all their support and tolerance of my odd working hours.

I would like to thank my son Samuel for all the opportunities he gives me to prove I'm not as clever as I think I am.

Will Gorman is an Engineering Team Lead at Pentaho. He works on a variety of Pentaho's products, including Reporting, Analysis, Dashboards, Metadata, and the BI Server. Will started his career at GE Research and earned his Masters degree in Computer Science at Rensselaer Polytechnic Institute in Troy, New York. Will is the author of Pentaho Reporting 3.5 for Java Developers (ISBN: 3193), published by Packt Publishing.

Gretchen Moran is a graduate of University of Wisconsin – Stevens Point with a Bachelor's degree in Computer Information Systems with a minor in Data Communications. Gretchen began her career as a corporate data warehouse developer in the insurance industry and joined Arbor Software/Hyperion Solutions in 1999 as a commercial developer for the Hyperion Analyzer and Web Analytics team. Gretchen has been a key player with Pentaho Corporation since its inception in 2004. As Community Leader and core developer, Gretchen managed the explosive growth of Pentaho's open source community for her first 2 years with the company. Gretchen has contributed to many of the Pentaho projects, including the Pentaho BI Server, Pentaho Data Integration, Pentaho Metadata Editor, Pentaho Reporting, Pentaho Charting, and others.

Thanks Doug, Anthony, Isabella and Baby Jack for giving me my favorite challenges and crowning achievements—being a wife and mom.

Preface

Pentaho Data Integration (aka Kettle) is an engine along with a suite of tools responsible for the processes of Extracting, Transforming, and Loading—better known as the ETL processes. PDI not only serves as an ETL tool, but it's also used for other purposes such as migrating data between applications or databases, exporting data from databases to flat files, data cleansing, and much more. PDI has an intuitive, graphical, drag-and-drop design environment, and its ETL capabilities are powerful. However, getting started with PDI can be difficult or confusing. This book provides the guidance needed to overcome that difficulty, covering the key features of PDI. Each chapter introduces new features, allowing you to gradually get involved with the tool.

By the end of the book, you will have not only experimented with all kinds of examples, but will also have built a basic but complete datamart with the help of PDI.

How to read this book

Although it is recommended that you read all the chapters, you don't need to. The book allows you to tailor the PDI learning process according to your particular needs.

The first four chapters, along with Chapter 7 and Chapter 10, cover the core concepts. If you don't know PDI and want to learn just the basics, reading those chapters would suffice. Besides, if you need to work with databases, you could include Chapter 8 in the roadmap.

If you already know the basics, you can improve your PDI knowledge by reading chapters 5, 6, and 11.

Finally, if you already know PDI and want to learn how to use it to load or maintain a datawarehouse or datamart, you will find all that you need in chapters 9 and 12.

While Chapter 13 is useful for anyone who is willing to take it further, all the appendices are valuable resources for anyone who reads this book.

What this book covers

Chapter 1, Getting started with Pentaho Data Integration serves as the most basic introduction to PDI, presenting the tool. The chapter includes instructions for installing PDI and gives you the opportunity to play with the graphical designer (Spoon). The chapter also includes instructions for installing a MySQL server.

Chapter 2, Getting Started with Transformations introduces one of the basic components of PDI—transformations. Then, it focuses on the explanation of how to work with files. It explains how to get data from simple input sources such as txt, csv, xml, and so on, do a preview of the data, and send the data back to any of these common output formats. The chapter also explains how to read command-line parameters and system information.

Chapter 3, Basic Data Manipulation explains the simplest and most commonly used ways of transforming data, including performing calculations, adding constants, counting, filtering, ordering, and looking for data.

Chapter 4—Controlling the Flow of Data explains different options that PDI offers to combine or split flows of data.

Chapter 5, Transforming Your Data with JavaScript Code and the JavaScript Step explains how JavaScript coding can help in the treatment of data. It shows why you need to code inside PDI, and explains in detail how to do it.

Chapter 6, Transforming the Row Set explains the ability of PDI to deal with some sophisticated problems, such as normalizing data from pivoted tables, in a simple fashion.

Chapter 7, Validating Data and Handling Errors explains the different options that PDI has to validate data, and how to treat the errors that may appear.

Chapter 8, Working with Databases explains how to use PDI to work with databases. The list of topics covered includes connecting to a database, previewing and getting data, and inserting, updating, and deleting data. As database knowledge is not presumed, the chapter also covers fundamental concepts of databases and the SQL language.

Chapter 9, Performing Advanced Operations with Databases explains how to perform advanced operations with databases, including those specially designed to load datawarehouses. A primer on datawarehouse concepts is also given in case you are not familiar with the subject.

Chapter 10, Creating Basic Task Flow serves as an introduction to processes in PDI. Through the creation of simple jobs, you will learn what jobs are and what they are used for.

Chapter 11, Creating Advanced Transformations and Jobs deals with advanced concepts that will allow you to build complex PDI projects. The list of covered topics includes nesting jobs, iterating on jobs and transformations, and creating subtransformations.

Chapter 12, Developing and implementing a simple datamart presents a simple datamart project, and guides you to build the datamart by using all the concepts learned throughout the book.

Chapter 13, Taking it Further gives a list of best PDI practices and recommendations for going beyond.

Appendix A, Working with repositories guides you step by step in the creation of a PDI database repository and then gives instructions to work with it.

Appendix B, Pan and Kitchen: Launching Transformations and Jobs from the Command Line is a quick reference for running transformations and jobs from the command line.

Appendix C, Quick Reference: Steps and Job Entries serves as a quick reference to steps and job entries used throughout the book.

Appendix D, Spoon Shortcuts is an extensive list of Spoon shortcuts useful for saving time when designing and running PDI jobs and transformations.

Appendix E, Introducing PDI 4 features quickly introduces you to the architectural and functional features included in Kettle 4—the version that was under development while writing this book.

Appendix F, Pop Quiz Answers, contains answers to pop quiz questions.

What you need for this book

PDI is a multiplatform tool. This means no matter what your operating system is, you will be able to work with the tool. The only prerequisite is to have JVM 1.5 or a higher version installed. It is also useful to have Excel or Calc along with a nice text editor.

Having an Internet connection while reading is extremely useful as well. Several links are provided throughout the book that complement what is explained. Besides, there is the PDI forum where you may search or post doubts if you are stuck with something.

Who this book is for

This book is for software developers, database administrators, IT students, and everyone involved or interested in developing ETL solutions or, more generally, doing any kind of data manipulation. If you have never used PDI before, this will be a perfect book to start with.

You will find this book to be a good starting point if you are a database administrator, a data warehouse designer, an architect, or any person who is responsible for data warehouse projects and need to load data into them.

You don't need to have any prior data warehouse or database experience to read this book. Fundamental database and data warehouse technical terms and concepts are explained in an easy-to-understand language.

Conventions

In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.

Code words in text are shown as follows: "You read the examination.txt file, and did some calculations to see how the students did."

New terms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in our text like this: "Edit the Sort rows step by double-clicking it, click the Get Fields button, and adjust the grid."

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.

To send us general feedback, simply drop an email to <[email protected]>, and mention the book title in the subject of your message.

If there is a book that you need and would like to see us publish, please send us a note in the SUGGEST A TITLE form on www.packtpub.com or email <[email protected]>.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Tip

Downloading the example code for the book

Visit http://www.packtpub.com/files/code/9546_Code.zip to directly download the example code.

The downloadable files contain instructions on how to use them.

Errata

Although we have taken every care to ensure the accuracy of our contents, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in text or code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration, and help us to improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/support, selecting your book, clicking on the let us know link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata added to any list of existing errata. Any existing errata can be viewed by selecting your title from http://www.packtpub.com/support.

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors, and our ability to bring you valuable content.

Questions

You can contact us at <[email protected]> if you are having a problem with any aspect of the book, and we will do our best to address it.

Chapter 1. Getting Started with Pentaho Data Integration

Pentaho Data Integrationis an engine along with a suite of tools responsible for the processes of extracting, transforming, and loading—best known as the ETL processes. This book is meant to teach you how to use PDI.

In this chapter you will:

Learn what Pentaho Data Integration isInstall the software and start working with the PDI graphical designerInstall MySQL, a database engine that you will use when you start working with databases

Pentaho Data Integration and Pentaho BI Suite

Before introducing PDI, let's talk about Pentaho BI Suite. The Pentaho Business Intelligence Suite is a collection of software applications intended to create and deliver solutions for decision making. The main functional areas covered by the suite are:

Analysis: The analysis engine serves multidimensional analysis. It's provided by the Mondrian OLAP server and the JPivot library for navigation and exploring.Reporting: The reporting engine allows designing, creating, and distributing reports in various known formats (HTML, PDF, and so on) from different kinds of sources. The reports created in Pentaho are based mainly in the JFreeReport library, but it's possible to integrate reports created with external reporting libraries such as Jasper Reports or BIRT.Data Mining: Data mining is running data through algorithms in order to understand the business and do predictive analysis. Data mining is possible thanks to the Weka Project.Dashboards: Dashboards are used to monitor and analyze Key Performance Indicators (KPIs). A set of tools incorporated to the BI Suite in the latest version allows users to create interesting dashboards, including graphs, reports, analysis views, and other Pentaho content, without much effort.Data integration: Data integration is used to integrate scattered information from different sources (applications, databases, files) and make the integrated information available to the final user. Pentaho Data Integration—our main concern—is the engine that provides this functionality.

All this functionality can be used standalone as well as integrated. In order to run analysis, reports, and so on integrated as a suite, you have to use the Pentaho BI Platform. The platform has a solution engine, and offers critical services such as authentication, scheduling, security, and web services.

This set of software and services forms a complete BI Platform, which makes Pentaho Suite the world's leading open source Business Intelligence Suite.

Exploring the Pentaho Demo

Despite being out of the scope of this book, it's worth to briefly introduce the Pentaho Demo. The Pentaho BI Platform Demo is a preconfigured installation that lets you explore several capabilities of the Pentaho platform. It includes sample reports, cubes, and dashboards for Steel Wheels. Steel Wheels is a fictional store that sells all kind of scale replicas of vehicles.

The demo can be downloaded from http://sourceforge.net/projects/pentaho/files/. Under the Business Intelligence Server folder, look for the latest stable version. The file you have to download is named biserver-ce-3.5.2.stable.zip for Windows and biserver-ce-3.5.2.stable.tar.gz for other systems.

In the same folder you will find a file named biserver-getting_started-ce-3.5.0.pdf. The file is a guide that introduces you the platform and gives you some guidance on how to install and run it. The guide even includes a mini tutorial on building a simple PDI input-output transformation.

Note

You can find more about Pentaho BI Suite at www.pentaho.org.

Pentaho Data Integration

Most of the Pentaho engines, including the engines mentioned earlier, were created as community projects and later adopted by Pentaho. The PDI engine is no exception—Pentaho Data Integration is the new denomination for the business intelligence tool born as Kettle.

Note

The name Kettle didn't come from the recursive acronym Kettle Extraction, Transportation, Transformation, and Loading Environment it has now, but from KDEExtraction, Transportation, Transformation and Loading Environment, as the tool was planned to be written on top of KDE, as mentioned in the introduction of the book.

In April 2006 the Kettle project was acquired by the Pentaho Corporation and Matt Casters, Kettle's founder, also joined the Pentaho team as a Data Integration Architect.

When Pentaho announced the acquisition, James Dixon, the Chief Technology Officer, said:

We reviewed many alternatives for open source data integration, and Kettle clearly had the best architecture, richest functionality, and most mature user interface. The open architecture and superior technology of the Pentaho BI Platform and Kettle allowed us to deliver integration in only a few days, and make that integration available to the community.

By joining forces with Pentaho, Kettle benefited from a huge developer community, as well as from a company that would support the future of the project.

From that moment the tool has grown constantly. Every few months a new release is available, bringing to the users, improvements in performance and existing functionality, new functionality, ease of use, and great changes in look and feel. The following is a timeline of the major events related to PDI since its acquisition by Pentaho:

June 2006: PDI 2.3 is released. Numerous developers had joined the project and there were bug fixes provided by people in various regions of the world. Among other changes, the version included enhancements for large scale environments and multilingual capabilities.February 2007: Almost seven months after the last major revision, PDI 2.4 is released including remote execution and clustering support (more on this in Chapter 13), enhanced database support, and a single designer for the two main elements you design in Kettle—jobs and transformations.May 2007: PDI 2.5 is released including many new features, the main feature being the advanced error handling.November 2007: PDI 3.0 emerges totally redesigned. Its major library changed to gain massive performance. The look and feel also changed completely.October 2008: PDI 3.1 comes with an easier-to-use tool, along with a lot of new functionalities as well.April 2009: PDI 3.2 is released with a really large number of changes for a minor version—new functionality, visualization improvements, performance improvements, and a huge pile of bug fixes. The main change in this version was the incorporation of dynamic clustering (see Chapter 13 for details).In 2010 PDI 4.0 will be released, delivering mostly improvements with regard to enterprise features such as version control.

Note

Most users still refer to PDI as Kettle, its further name. Therefore, the names PDI, Pentaho Data Integration, and Kettle will be used interchangeably throughout the book.

Using PDI in real world scenarios

Paying attention to its name, Pentaho Data Integration, you could think of PDI as a tool to integrate data.

In you look at its original name, K.E.T.T.L.E., then you must conclude that it is a tool used for ETL processes which, as you may know, are most frequently seen in data warehouse environments.

In fact, PDI not only serves as a data integrator or an ETL tool, but is such a powerful tool that it is common to see it used for those and for many other purposes. Here you have some examples.

Loading datawarehouses or datamarts

The loading of a datawarehouse or a datamart involves many steps, and there are many variants depending on business area or business rules. However, in every case, the process involves the following steps:

Extracting information from one or different databases, text files, and other sources. The extraction process may include the task of validating and discarding data that doesn't match expected patterns or rules.Transforming the obtained data to meet the business and technical needs required on the target. Transformation implies tasks such as converting data types, doing some calculations, filtering irrelevant data, and summarizing.Loading the transformed data into the target database. Depending on the requirements, the loading may overwrite the existing information, or may add new information each time it is executed.

Kettle comes ready to do every stage of this loading process. The following sample screenshot shows a simple ETL designed with Kettle:

Integrating data

Imagine two similar companies that need to merge their databases in order to have a unified view of the data, or a single company that has to combine information from a main ERP application and a CRM application, though they're not connected. These are just two of hundreds of examples where data integration is needed. Integrating data is not just a matter of gathering and mixing data; some conversions, validation, and transport of data has to be done. Kettle is meant to do all those tasks.

Data cleansing

Why do we need that data be correct and accurate? There are many reasons—for the efficiency of business, to generate trusted conclusions in data mining or statistical studies, to succeed when integrating data, and so on. Data cleansing is about ensuring that the data is correct and precise. This can be ensured by verifying if the data meets certain rules, discarding or correcting those that don't follow the expected pattern, setting default values for missing data, eliminating information that is duplicated, normalizing data to conform minimum and maximum values, and so on—tasks that Kettle makes possible, thanks to its vast set of transformation and validation capabilities.