Learning Pentaho Data Integration 8 CE - Third Edition - Maria Carina Roldan - E-Book

Learning Pentaho Data Integration 8 CE - Third Edition E-Book

Maria Carina Roldan

0,0
45,59 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Get up and running with the Pentaho Data Integration tool using this hands-on, easy-to-read guide

About This Book

  • Manipulate your data by exploring, transforming, validating, and integrating it using Pentaho Data Integration 8 CE
  • A comprehensive guide exploring the features of Pentaho Data Integration 8 CE
  • Connect to any database engine, explore the databases, and perform all kind of operations on relational databases

Who This Book Is For

This book is a must-have for software developers, business intelligence analysts, IT students, or anyone involved or interested in developing ETL solutions. If you plan on using Pentaho Data Integration for doing any data manipulation task, this book will help you as well. This book is also a good starting point for data warehouse designers, architects, or anyone who is responsible for data warehouse projects and needs to load data into them.

What You Will Learn

  • Explore the features and capabilities of Pentaho Data Integration 8 Community Edition
  • Install and get started with PDI
  • Learn the ins and outs of Spoon, the graphical designer tool
  • Learn to get data from all kind of data sources, such as plain files, Excel spreadsheets, databases, and XML files
  • Use Pentaho Data Integration to perform CRUD (create, read, update, and delete) operations on relationaldatabases
  • Populate a data mart with Pentaho Data Integration
  • Use Pentaho Data Integration to organize files and folders, run daily processes, deal with errors, and more

In Detail

Pentaho Data Integration(PDI) is an intuitive and graphical environment packed with drag-and-drop design and powerful Extract-Tranform-Load (ETL) capabilities. This book shows and explains the new interactive features of Spoon, the revamped look and feel, and the newest features of the tool including transformations and jobs Executors and the invaluable Metadata Injection capability.

We begin with the installation of PDI software and then move on to cover all the key PDI concepts. Each of the chapter introduces new features, enabling you to gradually get practicing with the tool. First, you will learn to do all kind of data manipulation and work with simple plain files. Then, the book teaches you how you can work with relational databases inside PDI. Moreover, you will be given a primer on data warehouse concepts and you will learn how to load data in a data warehouse. During the course of this book, you will be familiarized with its intuitive, graphical and drag-and-drop design environment.

By the end of this book, you will learn everything you need to know in order to meet your data manipulation requirements. Besides, your will be given best practices and advises for designing and deploying your projects.

Style and approach

Step by step guide filled with practical, real world scenarios and examples.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 516

Veröffentlichungsjahr: 2017

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Learning Pentaho Data Integration 8 CE

Third Edition

 

 

 

 

 

 

 

 

 

 

An end-to-end guide to exploring, transforming, and integrating your data across multiple sources

 

 

 

 

 

 

 

 

 

 

María Carina Roldán

 

 

 

 

BIRMINGHAM - MUMBAI

Learning Pentaho Data Integration 8 CE

Third Edition

 

 

Copyright © 2017 Packt Publishing

 

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

 

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

 

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

 

First published: April 2010

Second edition: October 2013

Production reference: 1011217

Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.

ISBN 978-1-78829-243-6

 

www.packtpub.com

Credits

Author

María Carina Roldán

Copy Editor

Tasneem Fatehi

Reviewers

Diethard SteinerPaulo PiresMiguel Gaspar

Project Coordinator

 

Manthan Patel

Commissioning Editor

 

Amey Varangaonkar

Proofreader

 

Safis Editing

Acquisition Editor

 

Tushar Gupta

Indexer

 

Rekha Nair

Content Development Editor

Tejas Limkar

Graphics

 

Tania Dutta

Technical EditorSayli Nikalje

Production Coordinator

 

Shraddha Falebhai

About the Author

María Carina Roldán was born in Argentina; she earned her Bachelor's degree in Computer Science at the Universidad Nacional de La Plata (UNLP) and then moved to Buenos Aires, where she has lived since 1994.

She has worked as a BI consultant for almost twenty years. She started working with the Pentaho technology back in 2006. Since then, she has been dedicated full time to developing BI solutions, mainly as an ETL specialist using Pentaho tools. In all these years, she worked for different companies, including Assert Solutions, a Pentaho partner company based in Argentina, and Webdetails, a Portuguese company acquired by Pentaho in 2013. Currently, she works as an independent consultant.

Carina is the author of the first and second edition of Pentaho Data Integration Beginner's Book, Packt Publishing and the co-author of the first and second edition of Pentaho Data Integration Cookbook, Packt Publishing.

 

 

 

 

I'd like to thank my colleagues and friends who gave me encouraging words throughout the writing process. I would also like to thank the technical reviewers for the time and dedication that they put into reviewing the book. I dedicate this book to my kids, Camila and Nicolás.

About the Reviewers

Diethard Steineris one of the very early Pentaho adopters. He has implemented business intelligence projects for various clients for more than a decade, covering everything from data integration to multi-dimensional cubes and dashboards. Over the last few years, he has gained in-depth experiencein utilizing Pentaho tools in the big data world. These days, he is running his own independent consultancy company called Bissol Consulting Ltd in London.

Diethard has been a very active Pentaho community member and regularly publishes articles on Pentaho and the wider business intelligence world on his GitHub blog (http://diethardsteiner.github.io/).

 

 

 

Paulo Pires is a Geographical Engineer, who in his early thirties decided to apply his skills in the business intelligence area. He started working in Webdetails more than 6 years ago and specialized in building dashboards with CTools and Pentaho, along with doing some ETL work when needed.

2 years ago Webdetails became a Pentaho company, and last month Hitachi Data Systems, Hitachi Insight Group, and Pentaho joined into a single company, called Hitachi Vantara, where Paulo Pires is a Senior Consultant.

 

 

 

Miguel Gaspar started working at Webdetails about 3 years ago, some time before the acquisition of Webdetails by Pentaho. He was a consultant in the Implementation team and his work involved developing dashboard solutions as part of services. He is now acting as the technical owner of some of the Implementations projects as part of the Webdetails team in Pentaho.

He likes to be as professional as possible, but in an informal way. One of his favorite hobbies is learning and his particular areas of interest are: business analytics, predictive analysis and big data, augmented reality, and cloud computing. He likes to play and is a huge martial arts fan and also one of the worst soccer players ever. He is married and a parent of two young and lovely daughters, who would like to spend more time playing like crazies with him. He also likes to spend time with friends or just having a drink and a good talk with someone else, if possible with his family at his side. He really hates liars.

 

 

 

 

 

 

www.PacktPub.com

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by Packt

Copy and paste, print, and bookmark content

On demand and accessible via a web browser

Customer Feedback

Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/178829243X. If you'd like to join our team of regular reviewers, you can email us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!

Table of Contents

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

Getting Started with Pentaho Data Integration

Pentaho Data Integration and Pentaho BI Suite

Introducing Pentaho Data Integration

Using PDI in real-world scenarios

Loading data warehouses or data marts

Integrating data

Data cleansing

Migrating information

Exporting data

Integrating PDI along with other Pentaho tools

Installing PDI

Launching the PDI Graphical Designer - Spoon

Starting and customizing Spoon

Exploring the Spoon interface

Extending the PDI functionality through the Marketplace

Introducing transformations

The basics about transformations

Creating a Hello World! Transformation

Designing a Transformation

Previewing and running a Transformation

Installing useful related software

Summary

Getting Started with Transformations

Designing and previewing transformations

Getting familiar with editing features

Using the mouseover assistance toolbar

Adding steps and creating hops

Working with grids

Designing transformations

Putting the editing features in practice

Previewing and fixing errors as they appear

Looking at the results in the execution results pane

The Logging tab

The Step Metrics tab

Running transformations in an interactive fashion

Understanding PDI data and metadata

Understanding the PDI rowset

Adding or modifying fields by using different PDI steps

Explaining the PDI data types

Handling errors

Implementing the error handling functionality

Customizing the error handling

Summary

Creating Basic Task Flows

Introducing jobs

Learning the basics about jobs

Creating a Simple Job

Designing and running jobs

Revisiting the Spoon interface and the editing features

Designing jobs

Getting familiar with the job design process

Looking at the results in the Execution results window

The Logging tab

The Job metrics tab

Enriching your work by sending an email

Running transformations from a Job

Using the Transformation Job Entry 

Understanding and changing the flow of execution

Changing the flow of execution based on conditions

Forcing a status with an abort Job or success entry

Changing the execution to be synchronous

Managing files

Creating a Job that moves some files

Selecting files and folders

Working with regular expressions

Summarizing the Job entries that deal with files

Customizing the file management

Knowing the basics about Kettle variables

Understanding the kettle.properties file

How and when you can use variables

Summary

Reading and Writing Files

Reading data from files

Reading a simple file

Troubleshooting reading files

Learning to read all kind of files

Specifying the name and location of the file

Reading several files at the same time

Reading files that are compressed or located on a remote server

Reading a file whose name is known at runtime

Describing the incoming fields

Reading Date fields

Reading Numeric fields

Reading only a subset of the file

Reading the most common kinds of sources

Reading text files

Reading spreadsheets

Reading XML files

Reading JSON files

Outputting data to files

Creating a simple file

Learning to create all kind of files and write data into them

Providing the name and location of an output file

Creating a file whose name is known only at runtime

Creating several files whose name depend on the content of the file

Describing the content of the output file

Formatting Date fields

Formatting Numeric fields

Creating the most common kinds of files

Creating text files

Creating spreadsheets

Creating XML files

Creating JSON files

Working with Big Data and cloud sources

Reading files from an AWS S3 instance

Writing files to an AWS S3 instance

Getting data from HDFS

Sending data to HDFS

Summary

Manipulating PDI Data and Metadata

Manipulating simple fields

Working with strings

Extracting parts of strings using regular expressions

Searching and replacing using regular expressions

Doing some math with Numeric fields

Operating with dates

Performing simple operations on dates

Subtracting dates with the Calculator step

Getting information relative to the current date

Using the Get System Info step

Performing other useful operations on dates

Getting the month names with a User Defined Java Class step

Modifying the metadata of streams

Working with complex structures

Working with XML

Introducing XML terminology

Getting familiar with the XPath notation

Parsing XML structures with PDI

Reading an XML file with the Get data from XML step

Parsing an XML structure stored in a field

PDI Transformation and Job files

Parsing JSON structures

Introducing JSON terminology

Getting familiar with the JSONPath notation

Parsing JSON structures with PDI

Reading a JSON file with the JSON input step

Parsing a JSON structure stored in a field

Summary

Controlling the Flow of Data

Filtering data

Filtering rows upon conditions

Reading a file and getting the list of words found in it

Filtering unwanted rows with a Filter rows step

Filtering rows by using the Java Filter step

Filtering data based on row numbers

Splitting streams unconditionally

Copying rows

Distributing rows

Introducing partitioning and clustering

Splitting the stream based on conditions

Splitting a stream based on a simple condition

Exploring PDI steps for splitting a stream based on conditions

Merging streams in several ways

Merging two or more streams

Customizing the way of merging streams

Looking up data

Looking up data with a Stream lookup step

Summary

Cleansing, Validating, and Fixing Data

Cleansing data

Cleansing data by example

Standardizing information

Improving the quality of data

Introducing PDI steps useful for cleansing data

Dealing with non-exact matches

Cleansing by doing a fuzzy search

Deduplicating non-exact matches

Validating data

Validating data with PDI

Validating and reporting errors to the log

Introducing common validations and their implementation with PDI

Treating invalid data by splitting and merging streams

Fixing data that doesn't match the rules

Summary

Manipulating Data by Coding

Doing simple tasks with the JavaScript step

Using the JavaScript language in PDI

Inserting JavaScript code using the JavaScript step

Adding fields

Modifying fields

Organizing your code

Controlling the flow using predefined constants

Testing the script using the Test script button

Parsing unstructured files with JavaScript

Doing simple tasks with the Java Class step

Using the Java language in PDI

Inserting Java code using the Java Class step

Learning to insert java code in a Java Class step

Data types equivalence

Adding fields

Modifying fields

Controlling the flow with the putRow() function

Testing the Java Class using the Test class button

Getting the most out of the Java Class step

Receiving parameters

Reading data from additional steps

Redirecting data to different target steps

Parsing JSON structures

Avoiding coding using purpose-built steps

Summary

Transforming the Dataset

Sorting data

Sorting a dataset with the sort rows step

Working on groups of rows

Aggregating data

Summarizing the PDI steps that operate on sets of rows

Converting rows to columns

Converting row data to column data using the Row denormaliser step

Aggregating data with a Row Denormaliser step

Normalizing data

Modifying the dataset with a Row Normaliser step

Going forward and backward across rows

Picking rows backward and forward with the Analytic Query step

Summary

Performing Basic Operations with Databases

Connecting to a database and exploring its content

Connecting with Relational Database Management Systems

Exploring a database with the Database Explorer

Previewing and getting data from a database

Getting data from the database with the Table input step

Using the Table input step to run flexible queries

Adding parameters to your queries

Using Kettle variables in your queries

Inserting, updating, and deleting data

Inserting new data into a database table

Inserting or updating data with the Insert / Update step

Deleting records of a database table with the Delete step

Performing CRUD operations with more flexibility

Verifying a connection, running DDL scripts, and doing other useful tasks

Looking up data in different ways

Doing simple lookups with the Database Value Lookup step

Making a performance difference when looking up data in a database

Performing complex database lookups

Looking for data using a Database join step

Looking for data using a Dynamic SQL row step

Summary

Loading Data Marts with PDI

Preparing the environment

Exploring the Jigsaw database model

Creating the database and configuring the environment

Introducing dimensional modeling

Loading dimensions with data

Learning the basics of dimensions

Understanding dimensions technical details

Loading a time dimension

Introducing and loading Type I slowly changing dimensions

Loading a Type I SCD with a combination lookup/update step

Introducing and loading Type II slowly changing dimension

Loading Type II SCDs with a dimension lookup/update step

Loading a Type II SDC for the first time

Loading a Type II SDC and verifying how history is kept

Explaining and loading Type III SCD and Hybrid SCD

Loading other kinds of dimensions

Loading a mini dimension

Loading junk dimensions

Explaining degenerate dimensions

Loading fact tables

Learning the basics about fact tables

Deciding the level of granularity

Translating the business keys into surrogate keys

Obtaining the surrogate key for Type I SCD

Obtaining the surrogate key for Type II SCD

Obtaining the surrogate key for the junk dimension

Obtaining the surrogate key for a time dimension

Loading  a cumulative fact table

Loading  a snapshot fact table

Loading a fact table by inserting snapshot data

Loading a fact table by overwriting snapshot data

Summary

Creating Portable and Reusable Transformations

Defining and using Kettle variables

Introducing all kinds of Kettle variables

Explaining predefined variables

Revisiting the kettle.properties file

Defining variables at runtime

Setting a variable with a constant value

Setting a variable with a value unknown beforehand

Setting variables with partial or total results of your flow

Defining and using named parameters

Using variables as fields of your stream

Creating reusable Transformations

Creating and executing sub-transformations

Creating and testing a sub-transformation

Executing a sub-transformation

Introducing more elaborate sub-transformations

Making the data flow between transformations

Transferring data using the copy/get rows mechanism

Executing transformations in an iterative way

Using Transformation executors

Configuring the executors with advanced settings

Getting the results of the execution of the inner transformation

Working with groups of data

Using variables and named parameters

Continuing the flow after executing the inner transformation

Summary

Implementing Metadata Injection

Introducing metadata injection

Explaining how metadata injection works

Creating a template Transformation

Injecting metadata

Discovering metadata and injecting it

Identifying use cases to implement metadata injection

Summary

Creating Advanced Jobs

Enhancing your processes with the use of variables

Running nested jobs

Understanding the scope of variables

Using named parameters

Using variables to create flexible processes

Using variables to name jobs and transformations

Using variables to name Job and Transformation folders

Accessing copied rows for different purposes

Using the copied rows to manage files in advanced ways

Using the copied rows as parameters of a Job or Transformation

Working with filelists

Maintaining a filelist

Using the filelist for different purposes

Attaching files in an email

Copying, moving, and deleting files

Introducing other ways to process the filelist

Executing jobs in an iterative way

Using Job executors

Configuring the executors with advanced settings

Getting the results of the execution of the job

Working with groups of data

Using variables and named parameters

Capturing the result filenames

Summary

Launching Transformations and Jobs from the Command Line

Using the Pan and Kitchen utilities

Running jobs and transformations

Checking the exit code

Supplying named parameters and variables

Using command-line arguments

Deciding between the use of a command-line argument and named parameters

Sending the output of executions to log files

Automating the execution

Summary

Best Practices for Designing and Deploying a PDI Project

Setting up a new project

Setting up the local environment

Defining a folder structure for the project

Dealing with external resources

Defining and adopting a versioning system

Best practices to design jobs and transformations

Styling your work

Making the work portable

Designing and developing reusable jobs and transformations

Maximizing the performance

Analyzing Steps Metrics

Analyzing performance graphs

Deploying the project in different environments

Modifying the Kettle home directory

Modifying the Kettle home in Windows

Modifying the Kettle home in Unix-like systems

Summary

Preface

Pentaho Data Integration (also known as Kettle) is an engine, along with a suite of tools, responsible for the processes of Extracting, Transforming, and Loading, better known as the ETL processes. Pentaho Data Integration (PDI) not only serves as an ETL tool, but it's also used for other purposes, such as migrating data between applications or databases, exporting data from databases to flat files, data cleansing, and much more. PDI has an intuitive, graphical, drag and drop design environment, and its ETL capabilities are powerful. However, getting started with PDI can be difficult and confusing. This book provides the guidance needed to overcome that difficulty, by covering the key features of PDI. Learning Pentaho Data Integration 8 CE explains the new interactive features of the graphical designer—Spoon, and its revamped look and feel. It also covers the newest features of the tool including Transformations and Jobs executors and the invaluable metadata injection capability.

The content of the book is based on PDI 8 Community Edition (CE). However, it can be used with the Enterprise Edition (EE) as well. Besides, if you are currently working with an earlier version of the tool, you should know that most of the content is also valid for PDI 6 and PDI 7.

By the end of the book, not only will you have experimented with all kinds of examples, but will also have gained the knowledge about developing useful, portable, reusable, and well-designed processes. 

What this book covers

Chapter 1, Getting Started with Pentaho Data Integration, serves as an introduction to PDI, presenting the tool. This chapter includes instructions for installing PDI and gives you the opportunity to play with the graphical designer (Spoon).

Chapter 2, Getting Started with Transformations, explains the fundamentals of working with transformations, including learning the simplest ways of transforming data and getting familiar with the process of designing, debugging, and testing a Transformation. This chapter also explains the basics of handling errors.

Chapter 3, Creating Basic Task Flows, serves as an introduction to the processes in PDI. Through the creation of simple Jobs, you will learn what Jobs are and what they are used for.

Chapter 4, Reading and Writing Files, explains how to get data from several files formats as spreadsheets, CSV files, and more. It also explains how to save data in the same kind of formats.

Chapter 5, Manipulating PDI Data and Metadata, expands the set of operations learned in the previous chapters. Besides exploring new PDI steps for data manipulation, this chapter introduces the Select Value step for manipulating metadata. It also explains how to get system information and predefined variables for being used as part of the data flow. The chapter also explains how to read and write XML and JSON structures.

Chapter 6, Controlling the Flow of Data, explains different options that PDI offers to deal with more than one stream of data: It explains how to combine and split flows of data, filter data and more.

Chapter 7, Cleansing, Validating, and Fixing Data, offers different ways for cleansing data, and also for dealing with invalid data, either by discarding it or by fixing it.

Chapter 8, Manipulating Data by Coding, explains how JavaScript and Java coding can help in the treatment of data. It shows why you may need to code inside PDI, and explains in detail how to do it.

Chapter 9, Transforming the Dataset, explains techniques for transforming the dataset as a whole; for example, aggregating data or normalizing pivoted tables.

Chapter 10, Performing Basic Operations with Databases, explains how to use PDI to work with databases. The list of topics in this chapter includes connecting to a database, previewing and getting data. It also covers other basic operations as inserting, looking up for data, and more.

Chapter 11, Loading Data Marts with PDI, explains the details about loading simple data marts. It shows how to load common types of dimensions (SCD, Junk, Time, and so on) and also different types of fact tables.

Chapter 12, Creating Portable and Reusable Transformations, explains several techniques for creating versatile transformations that can be used and reused in different scenarios or with different sets of data.

Chapter 13, Implementing Metadata Injection, explains a powerful feature of PDI, which is basically about injecting metadata into a template Transformation at runtime. Pentaho team has put in huge effort to highly support this feature in the latest PDI versions, so it's worth to explain in detail how this feature works.

Chapter 14, Creating Advanced Jobs, explains techniques for creating complex processes; for example, iterating over Jobs or manipulating lists of files for different purposes.

Chapter 15, Launching Transformations and Jobs from the Command Line, is a reference not only for running transformations from a Terminal, but also for dealing with the output of the executions.

Chapter 16, Best Practices for Designing and Deploying a PDI Project, covers the setup of a new project and also the best practices that make it easier to develop, maintain, and deploy a project in different environments.

What you need for this book

PDI is a multiplatform tool. This means that no matter which operating system you have, you will be able to work with the tool. The only prerequisite is to have JVM 1.8 installed. You will also need an Office suite, for example, Open Office or Libre Office, and a good text editor, for example, Sublime III or Notepad ++. Access to a relational database is recommended. Suggested engines are MySQL and PostgreSQL, but could be others of your choice as well.

Having an internet connection while reading is extremely useful too. Several links are provided throughout the book that complements what is explained. Besides, there is the PDI forum where you may search or post doubts if you are stuck with something.

Who this book is for

This book is a must-have for software developers, business intelligence analysts, IT students, and everyone involved or interested in developing ETL solutions, or more generally, doing any kind of data manipulation. Those who have never used PDI will benefit the most from the book, but those who have will also find it useful. This book is also a good starting point for data warehouse designers, architects, or anyone who is responsible for data warehouse projects and needs to load data into them.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meanings.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: Unzip the downloaded file in a folder of your choice, as, for example, c:/util/kettle or /home/pdi_user/kettle.

A block of code is set as follows:

project_name,start_date,end_dateProject A,2016-01-10,2016-01-25Project B,2016-04-03,2016-07-21Project C,2017-01-15,???Project D,2015-09-03,2015-12-20Project E,2016-05-11,2016-05-31Project F,2011-12-01,2013-11-30

Any command-line input or output is written as follows:

kitchen /file:c:/pdi_labs/hello_world.kjb

New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: Open Spoon from the main menu and navigate to File | New | Transformation.

Warnings or important notes appear like this.
Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply email [email protected] and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files emailed directly to you. You can download the code files by following these steps:

Log in or register to our website using your email address and password.

Hover the mouse pointer on the

SUPPORT

tab at the top.

Click on

Code Downloads & Errata

.

Enter the name of the book in the

Search

box.

Select the book for which you're looking to download the code files.

Choose from the drop-down menu where you purchased this book from.

Click on

Code Download

.

Once the file is downloaded, make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for Windows

Zipeg / iZip / UnRarX for Mac

7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Learning-Pentaho-Data-Integration-8-CE. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/LearningPentahoDataIntegration8CE_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title. To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the internet, do provide us with the location address or the website name immediately, so that we can pursue a remedy. Contact us at [email protected] with a link to the suspected pirated material. We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.

Getting Started with Pentaho Data Integration

Pentaho Data Integration (PDI) is an engine along with a suite of tools responsible for the processes of Extracting, Transforming, and Loading (also known as ETL processes). This book is meant to teach you how to use PDI.

In this chapter, you will:

Learn what Pentaho Data Integration is

Install the software and start working with the PDI graphical designer (Spoon)

Explore the Spoon interface

Set up your environment by installing other useful related software

Pentaho Data Integration and Pentaho BI Suite

Before introducing PDI, let's talk about Pentaho BI Suite. The Pentaho Business Intelligence Suite is a collection of software applications intended to create and deliver solutions for decision making. The main functional areas covered by the suite are:

Analysis

: The analysis engine serves multidimensional analysis. It's provided by the

Mondrian OLAP

server.

Reporting

: The reporting engine allows designing, creating, and distributing reports in various known formats (HTML, PDF, and so on), from different kinds of sources. In the Enterprise Edition of Pentaho, you can also generate interactive reports. 

Data mining

: Data mining is used for running data through algorithms in order to understand the business and do predictive analysis. Data mining is possible thanks to

Weka p

roject

.

Dashboards

: Dashboards are used to monitor and analyze

Key Performance Indicators

(

KPIs

).

CTools

is a set of tools and components created to help the user to build custom dashboards on top of Pentaho. There are specific CTools for different purposes, including a

Community Dashboard Editor

(

CDE

), a very powerful charting library (CCC), and a plugin for accessing data with great flexibility (CDA), among others. While the Ctools allow to develop advanced and custom dashboards, there is a

Dashboard Designer

, available only in Pentaho Enterprise Edition, that allows to build dashboards in an easy way. 

Data integration

: Data integration is used to integrate scattered information from different sources (for example, applications, databases, and files) and make the integrated information available to the final user. PDI—the tool that we will learn to use throughout the book—is the engine that provides this functionality. PDI also interacts with the rest of the tools, as, for example, reading OLAP cubes, generating Pentaho Reports, and doing data mining with R Executor Script and the CPython Script Executor.

All of these tools can be used standalone but also integrated. Pentaho tightly couples data integration with analytics in a modern platform: the PDI and Business Analytics Platform. This solution offers critical services, for example:

Authentication 

and authorization

Scheduling

Security

Web services

Scalability and failover

This set of software and services forms a complete BI Suite, which makes Pentaho the world's leading open source BI option on the market.

You can find out more about the of the platform at https://community.hds.com/community/products-and-solutions/pentaho/. There is also an Enterprise Edition with additional features and support. You can find more on this at http://www.pentaho.com/.

Introducing Pentaho Data Integration

Most of the Pentaho engines, including the engines mentioned earlier, were created as community projects and later adopted by Pentaho. The PDI engine is not an exception; Pentaho Data Integration is the new denomination for the business intelligence tool born as Kettle.

The name Kettle didn't come from the recursive acronym Kettle Extraction, Transportation, Transformation, and Loading Environment it has now. It came from KDE Extraction, Transportation, Transformation and Loading Environment, since the tool was planned to be written on top of KDE, a Linux desktop environment.

In April 2006, the Kettle project was acquired by the Pentaho Corporation, and Matt Casters, the Kettle founder, also joined the Pentaho team as a data integration architect.

When Pentaho announced the acquisition, James Dixon, the Chief Technology Officer, said:

We reviewed many alternatives for open source data integration, and Kettle clearly had the best architecture, richest functionality, and most mature user interface. The open architecture and superior technology of the Pentaho BI Platform and Kettle allowed us to deliver integration in only a few days, and make that integration available to the community.

By joining forces with Pentaho, Kettle benefited from a huge developer community, as well as from a company that would support the future of the project.

From that moment, the tool has grown with no pause. Every few months a new release is available, bringing to the user's improvements in performance and existing functionality, new functionality, and ease of use, along with great changes in look and feel. The following is a timeline of the major events related to PDI since its acquisition by Pentaho:

June 2006

: PDI 2.3 was released. Numerous developers had joined the project and there were bug fixes provided by people in various regions of the world. The version included, among other changes, enhancements for large-scale environments and multilingual capabilities.

November 2007

: PDI 3.0 emerged totally redesigned. Its major library changed to gain massive performance improvements. The look and feel had also changed completely.

April 2009

: PDI 3.2 was released with a really large amount of changes for a minor version: new functionality, visualization and performance improvements, and a huge amount of bug fixes.

June 2010

: PDI 4.0 was released, delivering mostly improvements with regard to enterprise features, for example, version control. In the community version, the focus was on several visual improvements.

November 2013

: PDI 5.0 was released, offering better previewing of data, easier looping, a lot of big data improvements, an improved plugin marketplace, and hundreds of bug fixes and features enhancements, as in all releases. In its Enterprise version, it offered interesting low-level features, such as step load balancing, Job transactions, and restartability.

December 2015

: PDI 6.0 was released with new features such as data services, data lineage, bigger support for

Big Data

, and several changes in the graphical designer for improving the PDI user experience. Some months later, PDI 6.1 was released including

metadata

injection

, a feature that enables the user to modify Transformations at runtime. Metadata injection had been available in earlier versions, but it was in 6.1 that Pentaho started to put in a big effort in implementing this powerful feature.

November 2016

: PDI 7.0 emerged with many improvements in the enterprise version, including data inspection capabilities, more support for

Big Data

technologies, and improved repository management. In the community version, the main change was an expanded metadata injection support.

November 2017

: Pentaho 8.0 is released. The highlights of this latest version are the optimization of processing resources, a better user experience, and the enhancement of the connectivity to streaming data sources—real-time processing.

Using PDI in real-world scenarios

Paying attention to its name, Pentaho Data Integration, you could think of PDI as a tool to integrate data.

In fact, PDI does not only serve as a data integrator or an ETL tool. PDI is such a powerful tool that it is common to see it being used for these and for many other purposes. Here you have some examples.

Loading data warehouses or data marts

The loading of a data warehouse or a data mart involves many steps, and there are many variants depending on business area or business rules.

However, in every case, with no exception, the process involves the following steps:

Extracting information from one or more databases, text files, XML files, and other sources. The extract process may include the task of validating and discarding data that doesn't match expected patterns or rules.

Transforming the obtained data to meet the business and technical needs required on the target. Transforming includes such tasks such as converting data types, doing some calculations, filtering irrelevant data, and summarizing.

Loading the transformed data into the target database or file store. Depending on the requirements, the loading may overwrite the existing information or may add new information each time it is executed.

Kettle comes ready to do every stage of this loading process. The following screenshot shows a simple ETL designed with the tool:

ETL process

Integrating data

Imagine two similar companies that need to merge their databases in order to have a unified view of the data, or a single company that has to combine information from a main Enterprise Resource Planning (ERP) application and a Customer Relationship Management (CRM) application, though they're not connected. These are just two of hundreds of examples where data integration is needed. The integration is not just a matter of gathering and mixing data; some conversions, validation, and transfer of data have to be done. PDI is meant to do all these tasks.

Data cleansing

Data cleansing is about ensuring that the data is correct and precise. This can be achieved by verifying if the data meets certain rules, discarding or correcting those which don't follow the expected pattern, setting default values for missing data, eliminating information that is duplicated, normalizing data to conform to minimum and maximum values, and so on. These are tasks that Kettle makes possible, thanks to its vast set of transformation and validation capabilities.

Migrating information

Think of a company, any size, which uses a commercial ERP application. One day the owners realize that the licenses are consuming an important share of its budget. So they decide to migrate to an open source ERP. The company will no longer have to pay licenses, but if they want to change, they will have to migrate the information. Obviously, it is not an option to start from scratch or type the information by hand. Kettle makes the migration possible, thanks to its ability to interact with most kind of sources and destinations, such as plain files, commercial and free databases, and spreadsheets, among others.

Exporting data

Data may need to be exported for numerous reasons:

To create detailed business reports

To allow communication between different departments within the same company

To deliver data from your legacy systems to obey government regulations, and so on

Kettle has the power to take raw data from the source and generate these kinds of ad hoc reports.

Integrating PDI along with other Pentaho tools

The previous examples show typical uses of PDI as a standalone application. However, Kettle may be used embedded as part of a process or a data flow. Some examples are preprocessing data for an online report, sending emails in a scheduled fashion, generating spreadsheet reports, feeding a dashboard with data coming from web services, and so on.

The use of PDI integrated with other tools is beyond the scope of this book. If you are interested, you can find more information on this subject in the Pentaho Data Integration Cookbook - Second Edition by Packt Publishing at https://www.packtpub.com/big-data-and-business-intelligence/pentaho-data-integration-cookbook-second-edition.

Installing PDI

In order to work with PDI, you need to install the software.

The only prerequisite to install the tool is to have JRE 8.0 installed. If you don't have it, download it from www.javasoft.com and install it before proceeding.

Following are the instructions to install the PDI software, irrespective of the operating system you may be using:

Go to the

Download

page at

http://sourceforge.net/projects/pentaho/files/Data Integration

.

Choose the newest stable release. At this time, it is 8.0, as shown in the following screenshot:

PDI on SourceForge.net

Download the available

zip

file, which will serve you for all platforms.

Unzip the downloaded file in a folder of your choice, as, for example, 

c:/util/kettle

or

/home/pdi_user/kettle

.

And that's all. You have installed the tool in just a few minutes.

The version of PDI that you just installed corresponds to the Community Edition (CE) of the tool. The book, however, can be also used for learning to use the Enterprise Edition (EE). Excepting for minor differences if you work with repositories, most of the examples in the book should work without changes. Also, if for any reason you have to use a previous version of PDI, the good news are that most of the content explained here also applies to PDI 6 and PDI 7.

Launching the PDI Graphical Designer - Spoon

Now that you've installed PDI, you're ready to start working with the data. That will be possible only inside a graphical environment. PDI has a desktop designer tool named Spoon. Let's launch Spoon and see what it looks like.

Starting and customizing Spoon

Spoon is PDI's desktop design tool. With Spoon, you design, preview, and test all your work, that is, transformations and jobs. When you see PDI screenshots, what you are really seeing are Spoon screenshots. The other PDI components, which you will learn about in the following chapters, are executed from Terminal windows.

Here is how you launch the tool:

Start Spoon. If your system is Windows, run

Spoon.bat

 from within the PDI install directory. In other platforms, such as Unix, Linux, and so on, open a Terminal window and type

spoon.sh

.

The main window shows up. The 

Welcome!

window appears with some useful links for you to see:

Welcome page
If Spoon doesn't start as expected, launch SpoonDebug.bat (or .sh) instead. This utility starts Spoon with a console output and gives you the option to redirect the output to a file. By inspecting this output, you will be able to find out what happened and fix the issue.

These simple steps would be enough to start working, but before that, it's advisable to customize Spoon to your needs. For doing that:

Click on

Options...

from the

Tools

menu. A window appears where you can change various general and visual characteristics. Uncheck the highlighted checkbox, as shown in the following screenshot:

Kettle options

Select the tab window

Look & Feel

.

Change the

Font for notes

,

Show Canvas Grid

, and

Preferred Language

settings as shown in the following screenshot:

Look and Feel options

Click on the

OK

button.

Restart Spoon in order to apply the changes. You should not see the

Welcome!

window. You should see the following screenshot full of French words instead:

French as preferred language

As you can see, the Options window has a lot of settings. We changed only a few, just to show the feature. Feel free to change the settings according to your needs or preferences.

Remember to restart Spoon in order to see the changes applied.

In particular, take note of the following tip about the selected language.

If you choose a preferred language other than English, you should select a different language as an alternative. If you do so, every name or description not translated to your preferred language will be shown in the alternative language.

One of the settings that you changed was the appearance of the Welcome! window at startup. TheWelcome! page isfull of links to web resources, blogs, forums, books on PDI, and more. Following those links, you will be able to learn more and become active in the Pentaho community. You can reach that window anytime by navigating to theHelp | Welcome Screen option.

The Welcome! page redirects you to the forum at https://forums.pentaho.com/forumdisplay.php?135-Data-Integration-Kettle. Since November 2017 there is a new collaboration space. You can reach the PDI space at https://community.hds.com/community/products-and-solutions/pentaho/data-integration. 

Exploring the Spoon interface

As explained earlier, Spoon is the tool with which you create, preview, and run transformations. The following screenshot shows you the basic work areas: Main Menu, Main Toolbar, Steps Tree, Transformation Toolbar, and Canvas (Work Area). Note that there is a sample Transformation opened; it allows you to see how the tool looks when you are working with it:

Spoon interface
The terms Canvas and work area will be used interchangeably throughout the book.

The Steps Tree option is only available in Design view. There is also an area named View that shows the structure of the Transformation currently being edited. You can see that area by clicking on the View tab at the upper-left corner of the screen:

Spoon view area

Extending the PDI functionality through the Marketplace

Pentaho Data Integration is built on a pluggable architecture. This means that it can be extended to fulfill needs not included out of the box. The Marketplace—a plugin itself—emerged as a straightforward way for browsing and installing available plugins, developed by the community or even by Pentaho.

In PDI, you will find plugins for connecting to a particular database engine, for executing scripts, for transforming data in new ways, and more. According to the purpose, the plugins are classified into several types: big data, connectivity, and statistics, among others. In particular, there is a type named Experimental, which you will not use except for playing around. There is another type named Deprecated, which we don't recommend you use unless you need it for back compatibility.

An important point to highlight about plugins is the maturity stage. The maturity classification model consists of two parallel lanes:

Community

Lane

: For Community and customer-sponsored projects.

Customer

Lane

: For projects which are part of the official Pentaho offering. Projects in the

Customer Lane

can start as projects developed in the Community Lane that create value for Pentaho subscription customers.

There are four stages in each lane. To put it simply, stage 1 means that the plugin is under development (it is usually a lab experiment), while stage 4 indicates a mature state; a plugin in stage 4 is successfully adopted and could be used in production environments. Stages 2 and 3 are stages in between these two.

For a full explanation of the model and the maturity stages, you can refer to https://community.hds.com/docs/DOC-1009876.

That said, let's go back to Spoon. You can access the Marketplace page by clicking onMarketplace from theToolsmenu. The page is quite simple, as shown in the following screenshot:

Marketplace

By default, you see the list of all the Available/Installed plugins. There is a secondary tab where you can filter just the installed ones.

Also, you can filter by plugin Type and by maturity Stage. And if you are looking for a particular plugin, there is also a Search textbox available.

Once in the Marketplace page, for every plugin you can see:

The name

The author

The maturity stage

The status:

Available

or

Installed

The branch and version

A button for installing the plugin or a check telling that the plugin is already installed

If you click on the plugin name, a pop-up window shows up displaying the full description for the selected plugin, as shown in the following example:

Sample plugin in Marketplace

Besides browsing the list of plugins, you can install or uninstall them:

In order to install a plugin, there is an

Install

button in the plugin list and also in the pop-up window

If the plugin is already installed, the pop-up window will also offer the option for uninstalling it, as in the previous example

Note that some plugins are only available in Pentaho Enterprise Edition. For a particular plugin, you can find this information as part of its full description.

It's premature to decide if you need to install a plugin for your work. So let's put this subject aside for a while; we will get back to this feature later in the book.

Introducing transformations

Till now, you've just opened and customized the look and feel of Spoon. It's time to do some interesting tasks beyond looking around. As mentioned before, in PDI we basically work with two kinds of artifacts: transformations and jobs. In this section, we will introduce transformations. First of all, we will introduce some basic definitions. Then, we will design, preview, and run our first Transformation.

The basics about transformations

A Transformation is an entity made of steps linked by hops. These steps and hops build paths through which data flows: the data enters or is created in a step, the step applies some kind of Transformation to it, and finally, the data leaves that step. Therefore, it's said that a Transformation is data flow oriented. Graphically, steps are represented with small boxes, while hops are represented by directional arrows, as depicted in the following sample:

Steps and hops

A Transformation itself is neither a program nor an executable file. It is just plain XML. The Transformation contains metadata, which tells the Kettle engine what to do.

A step is a minimal unit inside a Transformation. A big set of steps is available, either out of the box or the Marketplace, as explained before. These steps are grouped in categories, as, for example, input, output, or transform. Each step is conceived to accomplish a specific function, going from a simple task as reading a parameter to normalizing a dataset.

A hop is a graphical representation of data flowing between two steps: an origin and a destination. The data that flows through that hop constitutes the output data of the origin step and the input data of the destination step.

That's enough theory for now. Let's see it in practice.

Creating a Hello World! Transformation

In this section, we will design, preview, and run a simple Hello World! Transformation; simple, but good enough for our first practical example.

Designing a Transformation

Here are the steps to start working on our very first Transformation. All you need for starting is to have PDI installed:

Open Spoon.From the main menu and navigate to

File

|

New

|

Transformation

.

On the left of the screen, under the

Design

tab, you'll see a tree of

Steps

. Expand the

Input

branch by double-clicking on it.

Note that if you work in Mac OS, a single click is enough.

Then, left-click on the

Data Grid

 icon and without releasing the button, drag and drop the selected icon to the main canvas. The screen will look like the following screenshot:

Dragging and dropping a step
The dotted grid appeared as a consequence of the changes we made in the options window. Also, note that we changed the preferred language back to English.

Double-click on the

Data Grid

 step you just put on the canvas, and fill the

Meta

tab as follows:

Configuring a metadata tab

Now select the

Data

tab and fill the grid with some names, as in the following screenshot. Then click on 

OK

 to close the window:

Filling a Data tab

From the Steps tree, double-click on the

Scripting

 branch, click on the

User Defined Java Expression

 icon, and drag and drop it to the main canvas.

Put the mouse cursor over the

Data Grid

 step and wait until a tiny toolbar shows up succeeding the

Data Grid

icon, as shown next:

Mouseover assistance toolbar

Click on the output connector (the icon highlighted in the preceding image) and drag it towards the

User Defined Java Expression

(

UDJE

) step. A greyed hop is displayed.

When the mouse cursor is over the

UDJE

 step, release the button. A link—a hop from now on is created from the

Data Grid

 step to the

UDJE

 step. The screen should look like this:

Connecting steps with a hop

Double-click the

UDJE

 

icon and fill the grid as shown. Then close the window:

Configuring a UDJE step

Done! We have a draft for our first Transformation. A Data Grid with the names of a list of people, and a script step that builds the hello_message.

Before continuing, let's just add some color note to our work. This is totally optional, but as your work gets more complicated, it's highly recommended that you comment your transformations:

Right-click anywhere on the canvas to bring a contextual menu.

In the menu, select the

New note

option. A note editor appears.

Type some description, such as

Hello, World!

. Select the

Font style

tab and choose some nice font and colors for your note, and then click on

OK

. This should be the 

final result:

Hello World Transformation

The final step is to save the work:

From the main menu, navigate to

Edit

|

Settings...

. A window appears to specify Transformation properties. Fill the

Transformation name

textbox with a simple name, such as

hello world

. Fill the

Description

textbox with a short description, such as

My first transformation

. Finally, provide a more clear explanation in the

Extended description

textbox, and then click on

OK

.

From the main menu, navigate to

File

|

Save

 and save the Transformation in a folder of your choice with the name

hello_world

.

Next step is to preview the data produced and run the Transformation.

Previewing and running a Transformation

Now we will preview and run the Transformation created earlier. Note the difference between both:

The

Preview

functionality allows you to see a sample of the data produced for selected steps

The

Run

option effectively runs the whole Transformation

In our Transformation, we will preview the output of the User Defined Java Expression step:

Select the

User Defined Java Expression

 step by left-clicking on it.

Click on the

Preview

icon in the bar menu preceding in the main canvas: 

Preview icon in the Transformation toolbar

The

Transformation debug dialog

window will appear. Click on the

Quick Launch

button.

A window will appear to preview the data generated by the Transformation, as shown in the following screenshot:

Previewing the Hello World Transformation

Close the preview window.

You can preview the output of any step in the Transformation at any time of your designing process. You can also preview the data even if you haven't yet saved the work.

Once we have the Transformation ready, we can run it:

Click on the

Run

icon:

Run icon in the Transformation toolbar

A window named

Run Options

 appears. Click on

Run

.

You need to save the Transformation before you run it. If you have modified the Transformation without saving it, you will be prompted to do so.

At the bottom of the screen, you should see a log with the result of the execution:

Sample execution results window

Whether you preview or run a Transformation, you'll get an Execution Results window showing what happened. You will learn more about this in Chapter 2, Getting Started with Transformations.

Installing useful related software

Before skipping to the next chapter, let's devote some time to the installation of extra software that will complement our work with PDI.

First of all, it is really important that you have a nice text editor. You will need it for preparing testing data, for reading files before ingesting them with PDI, for viewing data that comes out of transformations, and for reviewing logs. A couple of examples of good text editors are Notepad++ and Sublime Text.

You will be working with spreadsheets, so another useful software will be a spreadsheet editor, as, for example, OpenOffice Calc.

In Chapter 10, Performing Basic Operations with Databases, and Chapter 11, Loading Data Marts with PDI, you will work with databases. As PostgreSQL has become a very used and popular open source database, it was the database engine chosen for the database-related tutorials in this book. So, if you intend to work with databases from PDI, it will be necessary that you have access to a PostgreSQL database engine.