45,59 €
Get up and running with the Pentaho Data Integration tool using this hands-on, easy-to-read guide
This book is a must-have for software developers, business intelligence analysts, IT students, or anyone involved or interested in developing ETL solutions. If you plan on using Pentaho Data Integration for doing any data manipulation task, this book will help you as well. This book is also a good starting point for data warehouse designers, architects, or anyone who is responsible for data warehouse projects and needs to load data into them.
Pentaho Data Integration(PDI) is an intuitive and graphical environment packed with drag-and-drop design and powerful Extract-Tranform-Load (ETL) capabilities. This book shows and explains the new interactive features of Spoon, the revamped look and feel, and the newest features of the tool including transformations and jobs Executors and the invaluable Metadata Injection capability.
We begin with the installation of PDI software and then move on to cover all the key PDI concepts. Each of the chapter introduces new features, enabling you to gradually get practicing with the tool. First, you will learn to do all kind of data manipulation and work with simple plain files. Then, the book teaches you how you can work with relational databases inside PDI. Moreover, you will be given a primer on data warehouse concepts and you will learn how to load data in a data warehouse. During the course of this book, you will be familiarized with its intuitive, graphical and drag-and-drop design environment.
By the end of this book, you will learn everything you need to know in order to meet your data manipulation requirements. Besides, your will be given best practices and advises for designing and deploying your projects.
Step by step guide filled with practical, real world scenarios and examples.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 516
Veröffentlichungsjahr: 2017
BIRMINGHAM - MUMBAI
Copyright © 2017 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: April 2010
Second edition: October 2013
Production reference: 1011217
ISBN 978-1-78829-243-6
www.packtpub.com
Author
María Carina Roldán
Copy Editor
Tasneem Fatehi
Reviewers
Diethard SteinerPaulo PiresMiguel Gaspar
Project Coordinator
Manthan Patel
Commissioning Editor
Amey Varangaonkar
Proofreader
Safis Editing
Acquisition Editor
Tushar Gupta
Indexer
Rekha Nair
Content Development Editor
Tejas Limkar
Graphics
Tania Dutta
Technical EditorSayli Nikalje
Production Coordinator
Shraddha Falebhai
María Carina Roldán was born in Argentina; she earned her Bachelor's degree in Computer Science at the Universidad Nacional de La Plata (UNLP) and then moved to Buenos Aires, where she has lived since 1994.
She has worked as a BI consultant for almost twenty years. She started working with the Pentaho technology back in 2006. Since then, she has been dedicated full time to developing BI solutions, mainly as an ETL specialist using Pentaho tools. In all these years, she worked for different companies, including Assert Solutions, a Pentaho partner company based in Argentina, and Webdetails, a Portuguese company acquired by Pentaho in 2013. Currently, she works as an independent consultant.
Carina is the author of the first and second edition of Pentaho Data Integration Beginner's Book, Packt Publishing and the co-author of the first and second edition of Pentaho Data Integration Cookbook, Packt Publishing.
Diethard Steineris one of the very early Pentaho adopters. He has implemented business intelligence projects for various clients for more than a decade, covering everything from data integration to multi-dimensional cubes and dashboards. Over the last few years, he has gained in-depth experiencein utilizing Pentaho tools in the big data world. These days, he is running his own independent consultancy company called Bissol Consulting Ltd in London.
Diethard has been a very active Pentaho community member and regularly publishes articles on Pentaho and the wider business intelligence world on his GitHub blog (http://diethardsteiner.github.io/).
Paulo Pires is a Geographical Engineer, who in his early thirties decided to apply his skills in the business intelligence area. He started working in Webdetails more than 6 years ago and specialized in building dashboards with CTools and Pentaho, along with doing some ETL work when needed.
2 years ago Webdetails became a Pentaho company, and last month Hitachi Data Systems, Hitachi Insight Group, and Pentaho joined into a single company, called Hitachi Vantara, where Paulo Pires is a Senior Consultant.
Miguel Gaspar started working at Webdetails about 3 years ago, some time before the acquisition of Webdetails by Pentaho. He was a consultant in the Implementation team and his work involved developing dashboard solutions as part of services. He is now acting as the technical owner of some of the Implementations projects as part of the Webdetails team in Pentaho.
He likes to be as professional as possible, but in an informal way. One of his favorite hobbies is learning and his particular areas of interest are: business analytics, predictive analysis and big data, augmented reality, and cloud computing. He likes to play and is a huge martial arts fan and also one of the worst soccer players ever. He is married and a parent of two young and lovely daughters, who would like to spend more time playing like crazies with him. He also likes to spend time with friends or just having a drink and a good talk with someone else, if possible with his family at his side. He really hates liars.
For support files and downloads related to your book, please visit www.PacktPub.com.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www.packtpub.com/mapt
Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/178829243X. If you'd like to join our team of regular reviewers, you can email us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
Getting Started with Pentaho Data Integration
Pentaho Data Integration and Pentaho BI Suite
Introducing Pentaho Data Integration
Using PDI in real-world scenarios
Loading data warehouses or data marts
Integrating data
Data cleansing
Migrating information
Exporting data
Integrating PDI along with other Pentaho tools
Installing PDI
Launching the PDI Graphical Designer - Spoon
Starting and customizing Spoon
Exploring the Spoon interface
Extending the PDI functionality through the Marketplace
Introducing transformations
The basics about transformations
Creating a Hello World! Transformation
Designing a Transformation
Previewing and running a Transformation
Installing useful related software
Summary
Getting Started with Transformations
Designing and previewing transformations
Getting familiar with editing features
Using the mouseover assistance toolbar
Adding steps and creating hops
Working with grids
Designing transformations
Putting the editing features in practice
Previewing and fixing errors as they appear
Looking at the results in the execution results pane
The Logging tab
The Step Metrics tab
Running transformations in an interactive fashion
Understanding PDI data and metadata
Understanding the PDI rowset
Adding or modifying fields by using different PDI steps
Explaining the PDI data types
Handling errors
Implementing the error handling functionality
Customizing the error handling
Summary
Creating Basic Task Flows
Introducing jobs
Learning the basics about jobs
Creating a Simple Job
Designing and running jobs
Revisiting the Spoon interface and the editing features
Designing jobs
Getting familiar with the job design process
Looking at the results in the Execution results window
The Logging tab
The Job metrics tab
Enriching your work by sending an email
Running transformations from a Job
Using the Transformation Job Entry
Understanding and changing the flow of execution
Changing the flow of execution based on conditions
Forcing a status with an abort Job or success entry
Changing the execution to be synchronous
Managing files
Creating a Job that moves some files
Selecting files and folders
Working with regular expressions
Summarizing the Job entries that deal with files
Customizing the file management
Knowing the basics about Kettle variables
Understanding the kettle.properties file
How and when you can use variables
Summary
Reading and Writing Files
Reading data from files
Reading a simple file
Troubleshooting reading files
Learning to read all kind of files
Specifying the name and location of the file
Reading several files at the same time
Reading files that are compressed or located on a remote server
Reading a file whose name is known at runtime
Describing the incoming fields
Reading Date fields
Reading Numeric fields
Reading only a subset of the file
Reading the most common kinds of sources
Reading text files
Reading spreadsheets
Reading XML files
Reading JSON files
Outputting data to files
Creating a simple file
Learning to create all kind of files and write data into them
Providing the name and location of an output file
Creating a file whose name is known only at runtime
Creating several files whose name depend on the content of the file
Describing the content of the output file
Formatting Date fields
Formatting Numeric fields
Creating the most common kinds of files
Creating text files
Creating spreadsheets
Creating XML files
Creating JSON files
Working with Big Data and cloud sources
Reading files from an AWS S3 instance
Writing files to an AWS S3 instance
Getting data from HDFS
Sending data to HDFS
Summary
Manipulating PDI Data and Metadata
Manipulating simple fields
Working with strings
Extracting parts of strings using regular expressions
Searching and replacing using regular expressions
Doing some math with Numeric fields
Operating with dates
Performing simple operations on dates
Subtracting dates with the Calculator step
Getting information relative to the current date
Using the Get System Info step
Performing other useful operations on dates
Getting the month names with a User Defined Java Class step
Modifying the metadata of streams
Working with complex structures
Working with XML
Introducing XML terminology
Getting familiar with the XPath notation
Parsing XML structures with PDI
Reading an XML file with the Get data from XML step
Parsing an XML structure stored in a field
PDI Transformation and Job files
Parsing JSON structures
Introducing JSON terminology
Getting familiar with the JSONPath notation
Parsing JSON structures with PDI
Reading a JSON file with the JSON input step
Parsing a JSON structure stored in a field
Summary
Controlling the Flow of Data
Filtering data
Filtering rows upon conditions
Reading a file and getting the list of words found in it
Filtering unwanted rows with a Filter rows step
Filtering rows by using the Java Filter step
Filtering data based on row numbers
Splitting streams unconditionally
Copying rows
Distributing rows
Introducing partitioning and clustering
Splitting the stream based on conditions
Splitting a stream based on a simple condition
Exploring PDI steps for splitting a stream based on conditions
Merging streams in several ways
Merging two or more streams
Customizing the way of merging streams
Looking up data
Looking up data with a Stream lookup step
Summary
Cleansing, Validating, and Fixing Data
Cleansing data
Cleansing data by example
Standardizing information
Improving the quality of data
Introducing PDI steps useful for cleansing data
Dealing with non-exact matches
Cleansing by doing a fuzzy search
Deduplicating non-exact matches
Validating data
Validating data with PDI
Validating and reporting errors to the log
Introducing common validations and their implementation with PDI
Treating invalid data by splitting and merging streams
Fixing data that doesn't match the rules
Summary
Manipulating Data by Coding
Doing simple tasks with the JavaScript step
Using the JavaScript language in PDI
Inserting JavaScript code using the JavaScript step
Adding fields
Modifying fields
Organizing your code
Controlling the flow using predefined constants
Testing the script using the Test script button
Parsing unstructured files with JavaScript
Doing simple tasks with the Java Class step
Using the Java language in PDI
Inserting Java code using the Java Class step
Learning to insert java code in a Java Class step
Data types equivalence
Adding fields
Modifying fields
Controlling the flow with the putRow() function
Testing the Java Class using the Test class button
Getting the most out of the Java Class step
Receiving parameters
Reading data from additional steps
Redirecting data to different target steps
Parsing JSON structures
Avoiding coding using purpose-built steps
Summary
Transforming the Dataset
Sorting data
Sorting a dataset with the sort rows step
Working on groups of rows
Aggregating data
Summarizing the PDI steps that operate on sets of rows
Converting rows to columns
Converting row data to column data using the Row denormaliser step
Aggregating data with a Row Denormaliser step
Normalizing data
Modifying the dataset with a Row Normaliser step
Going forward and backward across rows
Picking rows backward and forward with the Analytic Query step
Summary
Performing Basic Operations with Databases
Connecting to a database and exploring its content
Connecting with Relational Database Management Systems
Exploring a database with the Database Explorer
Previewing and getting data from a database
Getting data from the database with the Table input step
Using the Table input step to run flexible queries
Adding parameters to your queries
Using Kettle variables in your queries
Inserting, updating, and deleting data
Inserting new data into a database table
Inserting or updating data with the Insert / Update step
Deleting records of a database table with the Delete step
Performing CRUD operations with more flexibility
Verifying a connection, running DDL scripts, and doing other useful tasks
Looking up data in different ways
Doing simple lookups with the Database Value Lookup step
Making a performance difference when looking up data in a database
Performing complex database lookups
Looking for data using a Database join step
Looking for data using a Dynamic SQL row step
Summary
Loading Data Marts with PDI
Preparing the environment
Exploring the Jigsaw database model
Creating the database and configuring the environment
Introducing dimensional modeling
Loading dimensions with data
Learning the basics of dimensions
Understanding dimensions technical details
Loading a time dimension
Introducing and loading Type I slowly changing dimensions
Loading a Type I SCD with a combination lookup/update step
Introducing and loading Type II slowly changing dimension
Loading Type II SCDs with a dimension lookup/update step
Loading a Type II SDC for the first time
Loading a Type II SDC and verifying how history is kept
Explaining and loading Type III SCD and Hybrid SCD
Loading other kinds of dimensions
Loading a mini dimension
Loading junk dimensions
Explaining degenerate dimensions
Loading fact tables
Learning the basics about fact tables
Deciding the level of granularity
Translating the business keys into surrogate keys
Obtaining the surrogate key for Type I SCD
Obtaining the surrogate key for Type II SCD
Obtaining the surrogate key for the junk dimension
Obtaining the surrogate key for a time dimension
Loading a cumulative fact table
Loading a snapshot fact table
Loading a fact table by inserting snapshot data
Loading a fact table by overwriting snapshot data
Summary
Creating Portable and Reusable Transformations
Defining and using Kettle variables
Introducing all kinds of Kettle variables
Explaining predefined variables
Revisiting the kettle.properties file
Defining variables at runtime
Setting a variable with a constant value
Setting a variable with a value unknown beforehand
Setting variables with partial or total results of your flow
Defining and using named parameters
Using variables as fields of your stream
Creating reusable Transformations
Creating and executing sub-transformations
Creating and testing a sub-transformation
Executing a sub-transformation
Introducing more elaborate sub-transformations
Making the data flow between transformations
Transferring data using the copy/get rows mechanism
Executing transformations in an iterative way
Using Transformation executors
Configuring the executors with advanced settings
Getting the results of the execution of the inner transformation
Working with groups of data
Using variables and named parameters
Continuing the flow after executing the inner transformation
Summary
Implementing Metadata Injection
Introducing metadata injection
Explaining how metadata injection works
Creating a template Transformation
Injecting metadata
Discovering metadata and injecting it
Identifying use cases to implement metadata injection
Summary
Creating Advanced Jobs
Enhancing your processes with the use of variables
Running nested jobs
Understanding the scope of variables
Using named parameters
Using variables to create flexible processes
Using variables to name jobs and transformations
Using variables to name Job and Transformation folders
Accessing copied rows for different purposes
Using the copied rows to manage files in advanced ways
Using the copied rows as parameters of a Job or Transformation
Working with filelists
Maintaining a filelist
Using the filelist for different purposes
Attaching files in an email
Copying, moving, and deleting files
Introducing other ways to process the filelist
Executing jobs in an iterative way
Using Job executors
Configuring the executors with advanced settings
Getting the results of the execution of the job
Working with groups of data
Using variables and named parameters
Capturing the result filenames
Summary
Launching Transformations and Jobs from the Command Line
Using the Pan and Kitchen utilities
Running jobs and transformations
Checking the exit code
Supplying named parameters and variables
Using command-line arguments
Deciding between the use of a command-line argument and named parameters
Sending the output of executions to log files
Automating the execution
Summary
Best Practices for Designing and Deploying a PDI Project
Setting up a new project
Setting up the local environment
Defining a folder structure for the project
Dealing with external resources
Defining and adopting a versioning system
Best practices to design jobs and transformations
Styling your work
Making the work portable
Designing and developing reusable jobs and transformations
Maximizing the performance
Analyzing Steps Metrics
Analyzing performance graphs
Deploying the project in different environments
Modifying the Kettle home directory
Modifying the Kettle home in Windows
Modifying the Kettle home in Unix-like systems
Summary
Pentaho Data Integration (also known as Kettle) is an engine, along with a suite of tools, responsible for the processes of Extracting, Transforming, and Loading, better known as the ETL processes. Pentaho Data Integration (PDI) not only serves as an ETL tool, but it's also used for other purposes, such as migrating data between applications or databases, exporting data from databases to flat files, data cleansing, and much more. PDI has an intuitive, graphical, drag and drop design environment, and its ETL capabilities are powerful. However, getting started with PDI can be difficult and confusing. This book provides the guidance needed to overcome that difficulty, by covering the key features of PDI. Learning Pentaho Data Integration 8 CE explains the new interactive features of the graphical designer—Spoon, and its revamped look and feel. It also covers the newest features of the tool including Transformations and Jobs executors and the invaluable metadata injection capability.
The content of the book is based on PDI 8 Community Edition (CE). However, it can be used with the Enterprise Edition (EE) as well. Besides, if you are currently working with an earlier version of the tool, you should know that most of the content is also valid for PDI 6 and PDI 7.
By the end of the book, not only will you have experimented with all kinds of examples, but will also have gained the knowledge about developing useful, portable, reusable, and well-designed processes.
Chapter 1, Getting Started with Pentaho Data Integration, serves as an introduction to PDI, presenting the tool. This chapter includes instructions for installing PDI and gives you the opportunity to play with the graphical designer (Spoon).
Chapter 2, Getting Started with Transformations, explains the fundamentals of working with transformations, including learning the simplest ways of transforming data and getting familiar with the process of designing, debugging, and testing a Transformation. This chapter also explains the basics of handling errors.
Chapter 3, Creating Basic Task Flows, serves as an introduction to the processes in PDI. Through the creation of simple Jobs, you will learn what Jobs are and what they are used for.
Chapter 4, Reading and Writing Files, explains how to get data from several files formats as spreadsheets, CSV files, and more. It also explains how to save data in the same kind of formats.
Chapter 5, Manipulating PDI Data and Metadata, expands the set of operations learned in the previous chapters. Besides exploring new PDI steps for data manipulation, this chapter introduces the Select Value step for manipulating metadata. It also explains how to get system information and predefined variables for being used as part of the data flow. The chapter also explains how to read and write XML and JSON structures.
Chapter 6, Controlling the Flow of Data, explains different options that PDI offers to deal with more than one stream of data: It explains how to combine and split flows of data, filter data and more.
Chapter 7, Cleansing, Validating, and Fixing Data, offers different ways for cleansing data, and also for dealing with invalid data, either by discarding it or by fixing it.
Chapter 8, Manipulating Data by Coding, explains how JavaScript and Java coding can help in the treatment of data. It shows why you may need to code inside PDI, and explains in detail how to do it.
Chapter 9, Transforming the Dataset, explains techniques for transforming the dataset as a whole; for example, aggregating data or normalizing pivoted tables.
Chapter 10, Performing Basic Operations with Databases, explains how to use PDI to work with databases. The list of topics in this chapter includes connecting to a database, previewing and getting data. It also covers other basic operations as inserting, looking up for data, and more.
Chapter 11, Loading Data Marts with PDI, explains the details about loading simple data marts. It shows how to load common types of dimensions (SCD, Junk, Time, and so on) and also different types of fact tables.
Chapter 12, Creating Portable and Reusable Transformations, explains several techniques for creating versatile transformations that can be used and reused in different scenarios or with different sets of data.
Chapter 13, Implementing Metadata Injection, explains a powerful feature of PDI, which is basically about injecting metadata into a template Transformation at runtime. Pentaho team has put in huge effort to highly support this feature in the latest PDI versions, so it's worth to explain in detail how this feature works.
Chapter 14, Creating Advanced Jobs, explains techniques for creating complex processes; for example, iterating over Jobs or manipulating lists of files for different purposes.
Chapter 15, Launching Transformations and Jobs from the Command Line, is a reference not only for running transformations from a Terminal, but also for dealing with the output of the executions.
Chapter 16, Best Practices for Designing and Deploying a PDI Project, covers the setup of a new project and also the best practices that make it easier to develop, maintain, and deploy a project in different environments.
PDI is a multiplatform tool. This means that no matter which operating system you have, you will be able to work with the tool. The only prerequisite is to have JVM 1.8 installed. You will also need an Office suite, for example, Open Office or Libre Office, and a good text editor, for example, Sublime III or Notepad ++. Access to a relational database is recommended. Suggested engines are MySQL and PostgreSQL, but could be others of your choice as well.
Having an internet connection while reading is extremely useful too. Several links are provided throughout the book that complements what is explained. Besides, there is the PDI forum where you may search or post doubts if you are stuck with something.
This book is a must-have for software developers, business intelligence analysts, IT students, and everyone involved or interested in developing ETL solutions, or more generally, doing any kind of data manipulation. Those who have never used PDI will benefit the most from the book, but those who have will also find it useful. This book is also a good starting point for data warehouse designers, architects, or anyone who is responsible for data warehouse projects and needs to load data into them.
In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meanings.
Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: Unzip the downloaded file in a folder of your choice, as, for example, c:/util/kettle or /home/pdi_user/kettle.
A block of code is set as follows:
project_name,start_date,end_dateProject A,2016-01-10,2016-01-25Project B,2016-04-03,2016-07-21Project C,2017-01-15,???Project D,2015-09-03,2015-12-20Project E,2016-05-11,2016-05-31Project F,2011-12-01,2013-11-30
Any command-line input or output is written as follows:
kitchen /file:c:/pdi_labs/hello_world.kjb
New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: Open Spoon from the main menu and navigate to File | New | Transformation.
Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply email [email protected] and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Now that you are the proud owner of a Packt book, we have a number of things to help you get the most from your purchase.
You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files emailed directly to you. You can download the code files by following these steps:
Log in or register to our website using your email address and password.
Hover the mouse pointer on the
SUPPORT
tab at the top.
Click on
Code Downloads & Errata
.
Enter the name of the book in the
Search
box.
Select the book for which you're looking to download the code files.
Choose from the drop-down menu where you purchased this book from.
Click on
Code Download
.
Once the file is downloaded, make sure that you unzip or extract the folder using the latest version of:
WinRAR / 7-Zip for Windows
Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Learning-Pentaho-Data-Integration-8-CE. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/LearningPentahoDataIntegration8CE_ColorImages.pdf.
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title. To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.
Piracy of copyrighted material on the internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the internet, do provide us with the location address or the website name immediately, so that we can pursue a remedy. Contact us at [email protected] with a link to the suspected pirated material. We appreciate your help in protecting our authors and our ability to bring you valuable content.
If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.
Pentaho Data Integration (PDI) is an engine along with a suite of tools responsible for the processes of Extracting, Transforming, and Loading (also known as ETL processes). This book is meant to teach you how to use PDI.
In this chapter, you will:
Learn what Pentaho Data Integration is
Install the software and start working with the PDI graphical designer (Spoon)
Explore the Spoon interface
Set up your environment by installing other useful related software
Before introducing PDI, let's talk about Pentaho BI Suite. The Pentaho Business Intelligence Suite is a collection of software applications intended to create and deliver solutions for decision making. The main functional areas covered by the suite are:
Analysis
: The analysis engine serves multidimensional analysis. It's provided by the
Mondrian OLAP
server.
Reporting
: The reporting engine allows designing, creating, and distributing reports in various known formats (HTML, PDF, and so on), from different kinds of sources. In the Enterprise Edition of Pentaho, you can also generate interactive reports.
Data mining
: Data mining is used for running data through algorithms in order to understand the business and do predictive analysis. Data mining is possible thanks to
Weka p
roject
.
Dashboards
: Dashboards are used to monitor and analyze
Key Performance Indicators
(
KPIs
).
CTools
is a set of tools and components created to help the user to build custom dashboards on top of Pentaho. There are specific CTools for different purposes, including a
Community Dashboard Editor
(
CDE
), a very powerful charting library (CCC), and a plugin for accessing data with great flexibility (CDA), among others. While the Ctools allow to develop advanced and custom dashboards, there is a
Dashboard Designer
, available only in Pentaho Enterprise Edition, that allows to build dashboards in an easy way.
Data integration
: Data integration is used to integrate scattered information from different sources (for example, applications, databases, and files) and make the integrated information available to the final user. PDI—the tool that we will learn to use throughout the book—is the engine that provides this functionality. PDI also interacts with the rest of the tools, as, for example, reading OLAP cubes, generating Pentaho Reports, and doing data mining with R Executor Script and the CPython Script Executor.
All of these tools can be used standalone but also integrated. Pentaho tightly couples data integration with analytics in a modern platform: the PDI and Business Analytics Platform. This solution offers critical services, for example:
Authentication
and authorization
Scheduling
Security
Web services
Scalability and failover
This set of software and services forms a complete BI Suite, which makes Pentaho the world's leading open source BI option on the market.
Most of the Pentaho engines, including the engines mentioned earlier, were created as community projects and later adopted by Pentaho. The PDI engine is not an exception; Pentaho Data Integration is the new denomination for the business intelligence tool born as Kettle.
In April 2006, the Kettle project was acquired by the Pentaho Corporation, and Matt Casters, the Kettle founder, also joined the Pentaho team as a data integration architect.
When Pentaho announced the acquisition, James Dixon, the Chief Technology Officer, said:
By joining forces with Pentaho, Kettle benefited from a huge developer community, as well as from a company that would support the future of the project.
From that moment, the tool has grown with no pause. Every few months a new release is available, bringing to the user's improvements in performance and existing functionality, new functionality, and ease of use, along with great changes in look and feel. The following is a timeline of the major events related to PDI since its acquisition by Pentaho:
June 2006
: PDI 2.3 was released. Numerous developers had joined the project and there were bug fixes provided by people in various regions of the world. The version included, among other changes, enhancements for large-scale environments and multilingual capabilities.
November 2007
: PDI 3.0 emerged totally redesigned. Its major library changed to gain massive performance improvements. The look and feel had also changed completely.
April 2009
: PDI 3.2 was released with a really large amount of changes for a minor version: new functionality, visualization and performance improvements, and a huge amount of bug fixes.
June 2010
: PDI 4.0 was released, delivering mostly improvements with regard to enterprise features, for example, version control. In the community version, the focus was on several visual improvements.
November 2013
: PDI 5.0 was released, offering better previewing of data, easier looping, a lot of big data improvements, an improved plugin marketplace, and hundreds of bug fixes and features enhancements, as in all releases. In its Enterprise version, it offered interesting low-level features, such as step load balancing, Job transactions, and restartability.
December 2015
: PDI 6.0 was released with new features such as data services, data lineage, bigger support for
Big Data
, and several changes in the graphical designer for improving the PDI user experience. Some months later, PDI 6.1 was released including
metadata
injection
, a feature that enables the user to modify Transformations at runtime. Metadata injection had been available in earlier versions, but it was in 6.1 that Pentaho started to put in a big effort in implementing this powerful feature.
November 2016
: PDI 7.0 emerged with many improvements in the enterprise version, including data inspection capabilities, more support for
Big Data
technologies, and improved repository management. In the community version, the main change was an expanded metadata injection support.
November 2017
: Pentaho 8.0 is released. The highlights of this latest version are the optimization of processing resources, a better user experience, and the enhancement of the connectivity to streaming data sources—real-time processing.
Paying attention to its name, Pentaho Data Integration, you could think of PDI as a tool to integrate data.
In fact, PDI does not only serve as a data integrator or an ETL tool. PDI is such a powerful tool that it is common to see it being used for these and for many other purposes. Here you have some examples.
The loading of a data warehouse or a data mart involves many steps, and there are many variants depending on business area or business rules.
However, in every case, with no exception, the process involves the following steps:
Extracting information from one or more databases, text files, XML files, and other sources. The extract process may include the task of validating and discarding data that doesn't match expected patterns or rules.
Transforming the obtained data to meet the business and technical needs required on the target. Transforming includes such tasks such as converting data types, doing some calculations, filtering irrelevant data, and summarizing.
Loading the transformed data into the target database or file store. Depending on the requirements, the loading may overwrite the existing information or may add new information each time it is executed.
Kettle comes ready to do every stage of this loading process. The following screenshot shows a simple ETL designed with the tool:
Imagine two similar companies that need to merge their databases in order to have a unified view of the data, or a single company that has to combine information from a main Enterprise Resource Planning (ERP) application and a Customer Relationship Management (CRM) application, though they're not connected. These are just two of hundreds of examples where data integration is needed. The integration is not just a matter of gathering and mixing data; some conversions, validation, and transfer of data have to be done. PDI is meant to do all these tasks.
Data cleansing is about ensuring that the data is correct and precise. This can be achieved by verifying if the data meets certain rules, discarding or correcting those which don't follow the expected pattern, setting default values for missing data, eliminating information that is duplicated, normalizing data to conform to minimum and maximum values, and so on. These are tasks that Kettle makes possible, thanks to its vast set of transformation and validation capabilities.
Think of a company, any size, which uses a commercial ERP application. One day the owners realize that the licenses are consuming an important share of its budget. So they decide to migrate to an open source ERP. The company will no longer have to pay licenses, but if they want to change, they will have to migrate the information. Obviously, it is not an option to start from scratch or type the information by hand. Kettle makes the migration possible, thanks to its ability to interact with most kind of sources and destinations, such as plain files, commercial and free databases, and spreadsheets, among others.
Data may need to be exported for numerous reasons:
To create detailed business reports
To allow communication between different departments within the same company
To deliver data from your legacy systems to obey government regulations, and so on
Kettle has the power to take raw data from the source and generate these kinds of ad hoc reports.
The previous examples show typical uses of PDI as a standalone application. However, Kettle may be used embedded as part of a process or a data flow. Some examples are preprocessing data for an online report, sending emails in a scheduled fashion, generating spreadsheet reports, feeding a dashboard with data coming from web services, and so on.
In order to work with PDI, you need to install the software.
Following are the instructions to install the PDI software, irrespective of the operating system you may be using:
Go to the
Download
page at
http://sourceforge.net/projects/pentaho/files/Data Integration
.
Choose the newest stable release. At this time, it is 8.0, as shown in the following screenshot:
Download the available
zip
file, which will serve you for all platforms.
Unzip the downloaded file in a folder of your choice, as, for example,
c:/util/kettle
or
/home/pdi_user/kettle
.
And that's all. You have installed the tool in just a few minutes.
Now that you've installed PDI, you're ready to start working with the data. That will be possible only inside a graphical environment. PDI has a desktop designer tool named Spoon. Let's launch Spoon and see what it looks like.
Spoon is PDI's desktop design tool. With Spoon, you design, preview, and test all your work, that is, transformations and jobs. When you see PDI screenshots, what you are really seeing are Spoon screenshots. The other PDI components, which you will learn about in the following chapters, are executed from Terminal windows.
Here is how you launch the tool:
Start Spoon. If your system is Windows, run
Spoon.bat
from within the PDI install directory. In other platforms, such as Unix, Linux, and so on, open a Terminal window and type
spoon.sh
.
The main window shows up. The
Welcome!
window appears with some useful links for you to see:
These simple steps would be enough to start working, but before that, it's advisable to customize Spoon to your needs. For doing that:
Click on
Options...
from the
Tools
menu. A window appears where you can change various general and visual characteristics. Uncheck the highlighted checkbox, as shown in the following screenshot:
Select the tab window
Look & Feel
.
Change the
Font for notes
,
Show Canvas Grid
, and
Preferred Language
settings as shown in the following screenshot:
Click on the
OK
button.
Restart Spoon in order to apply the changes. You should not see the
Welcome!
window. You should see the following screenshot full of French words instead:
As you can see, the Options window has a lot of settings. We changed only a few, just to show the feature. Feel free to change the settings according to your needs or preferences.
In particular, take note of the following tip about the selected language.
One of the settings that you changed was the appearance of the Welcome! window at startup. TheWelcome! page isfull of links to web resources, blogs, forums, books on PDI, and more. Following those links, you will be able to learn more and become active in the Pentaho community. You can reach that window anytime by navigating to theHelp | Welcome Screen option.
As explained earlier, Spoon is the tool with which you create, preview, and run transformations. The following screenshot shows you the basic work areas: Main Menu, Main Toolbar, Steps Tree, Transformation Toolbar, and Canvas (Work Area). Note that there is a sample Transformation opened; it allows you to see how the tool looks when you are working with it:
The Steps Tree option is only available in Design view. There is also an area named View that shows the structure of the Transformation currently being edited. You can see that area by clicking on the View tab at the upper-left corner of the screen:
Pentaho Data Integration is built on a pluggable architecture. This means that it can be extended to fulfill needs not included out of the box. The Marketplace—a plugin itself—emerged as a straightforward way for browsing and installing available plugins, developed by the community or even by Pentaho.
In PDI, you will find plugins for connecting to a particular database engine, for executing scripts, for transforming data in new ways, and more. According to the purpose, the plugins are classified into several types: big data, connectivity, and statistics, among others. In particular, there is a type named Experimental, which you will not use except for playing around. There is another type named Deprecated, which we don't recommend you use unless you need it for back compatibility.
An important point to highlight about plugins is the maturity stage. The maturity classification model consists of two parallel lanes:
Community
Lane
: For Community and customer-sponsored projects.
Customer
Lane
: For projects which are part of the official Pentaho offering. Projects in the
Customer Lane
can start as projects developed in the Community Lane that create value for Pentaho subscription customers.
There are four stages in each lane. To put it simply, stage 1 means that the plugin is under development (it is usually a lab experiment), while stage 4 indicates a mature state; a plugin in stage 4 is successfully adopted and could be used in production environments. Stages 2 and 3 are stages in between these two.
That said, let's go back to Spoon. You can access the Marketplace page by clicking onMarketplace from theToolsmenu. The page is quite simple, as shown in the following screenshot:
By default, you see the list of all the Available/Installed plugins. There is a secondary tab where you can filter just the installed ones.
Also, you can filter by plugin Type and by maturity Stage. And if you are looking for a particular plugin, there is also a Search textbox available.
Once in the Marketplace page, for every plugin you can see:
The name
The author
The maturity stage
The status:
Available
or
Installed
The branch and version
A button for installing the plugin or a check telling that the plugin is already installed
If you click on the plugin name, a pop-up window shows up displaying the full description for the selected plugin, as shown in the following example:
Besides browsing the list of plugins, you can install or uninstall them:
In order to install a plugin, there is an
Install
button in the plugin list and also in the pop-up window
If the plugin is already installed, the pop-up window will also offer the option for uninstalling it, as in the previous example
It's premature to decide if you need to install a plugin for your work. So let's put this subject aside for a while; we will get back to this feature later in the book.
Till now, you've just opened and customized the look and feel of Spoon. It's time to do some interesting tasks beyond looking around. As mentioned before, in PDI we basically work with two kinds of artifacts: transformations and jobs. In this section, we will introduce transformations. First of all, we will introduce some basic definitions. Then, we will design, preview, and run our first Transformation.
A Transformation is an entity made of steps linked by hops. These steps and hops build paths through which data flows: the data enters or is created in a step, the step applies some kind of Transformation to it, and finally, the data leaves that step. Therefore, it's said that a Transformation is data flow oriented. Graphically, steps are represented with small boxes, while hops are represented by directional arrows, as depicted in the following sample:
A Transformation itself is neither a program nor an executable file. It is just plain XML. The Transformation contains metadata, which tells the Kettle engine what to do.
A step is a minimal unit inside a Transformation. A big set of steps is available, either out of the box or the Marketplace, as explained before. These steps are grouped in categories, as, for example, input, output, or transform. Each step is conceived to accomplish a specific function, going from a simple task as reading a parameter to normalizing a dataset.
A hop is a graphical representation of data flowing between two steps: an origin and a destination. The data that flows through that hop constitutes the output data of the origin step and the input data of the destination step.
That's enough theory for now. Let's see it in practice.
In this section, we will design, preview, and run a simple Hello World! Transformation; simple, but good enough for our first practical example.
Here are the steps to start working on our very first Transformation. All you need for starting is to have PDI installed:
Open Spoon.From the main menu and navigate to
File
|
New
|
Transformation
.
On the left of the screen, under the
Design
tab, you'll see a tree of
Steps
. Expand the
Input
branch by double-clicking on it.
Then, left-click on the
Data Grid
icon and without releasing the button, drag and drop the selected icon to the main canvas. The screen will look like the following screenshot:
Double-click on the
Data Grid
step you just put on the canvas, and fill the
Meta
tab as follows:
Now select the
Data
tab and fill the grid with some names, as in the following screenshot. Then click on
OK
to close the window:
From the Steps tree, double-click on the
Scripting
branch, click on the
User Defined Java Expression
icon, and drag and drop it to the main canvas.
Put the mouse cursor over the
Data Grid
step and wait until a tiny toolbar shows up succeeding the
Data Grid
icon, as shown next:
Click on the output connector (the icon highlighted in the preceding image) and drag it towards the
User Defined Java Expression
(
UDJE
) step. A greyed hop is displayed.
When the mouse cursor is over the
UDJE
step, release the button. A link—a hop from now on is created from the
Data Grid
step to the
UDJE
step. The screen should look like this:
Double-click the
UDJE
icon and fill the grid as shown. Then close the window:
Done! We have a draft for our first Transformation. A Data Grid with the names of a list of people, and a script step that builds the hello_message.
Before continuing, let's just add some color note to our work. This is totally optional, but as your work gets more complicated, it's highly recommended that you comment your transformations:
Right-click anywhere on the canvas to bring a contextual menu.
In the menu, select the
New note
option. A note editor appears.
Type some description, such as
Hello, World!
. Select the
Font style
tab and choose some nice font and colors for your note, and then click on
OK
. This should be the
final result:
The final step is to save the work:
From the main menu, navigate to
Edit
|
Settings...
. A window appears to specify Transformation properties. Fill the
Transformation name
textbox with a simple name, such as
hello world
. Fill the
Description
textbox with a short description, such as
My first transformation
. Finally, provide a more clear explanation in the
Extended description
textbox, and then click on
OK
.
From the main menu, navigate to
File
|
Save
and save the Transformation in a folder of your choice with the name
hello_world
.
Next step is to preview the data produced and run the Transformation.
Now we will preview and run the Transformation created earlier. Note the difference between both:
The
Preview
functionality allows you to see a sample of the data produced for selected steps
The
Run
option effectively runs the whole Transformation
In our Transformation, we will preview the output of the User Defined Java Expression step:
Select the
User Defined Java Expression
step by left-clicking on it.
Click on the
Preview
icon in the bar menu preceding in the main canvas:
The
Transformation debug dialog
window will appear. Click on the
Quick Launch
button.
A window will appear to preview the data generated by the Transformation, as shown in the following screenshot:
Close the preview window.
Once we have the Transformation ready, we can run it:
Click on the
Run
icon:
A window named
Run Options
appears. Click on
Run
.
At the bottom of the screen, you should see a log with the result of the execution:
Whether you preview or run a Transformation, you'll get an Execution Results window showing what happened. You will learn more about this in Chapter 2, Getting Started with Transformations.
Before skipping to the next chapter, let's devote some time to the installation of extra software that will complement our work with PDI.
First of all, it is really important that you have a nice text editor. You will need it for preparing testing data, for reading files before ingesting them with PDI, for viewing data that comes out of transformations, and for reviewing logs. A couple of examples of good text editors are Notepad++ and Sublime Text.
You will be working with spreadsheets, so another useful software will be a spreadsheet editor, as, for example, OpenOffice Calc.
In Chapter 10, Performing Basic Operations with Databases, and Chapter 11, Loading Data Marts with PDI, you will work with databases. As PostgreSQL has become a very used and popular open source database, it was the database engine chosen for the database-related tutorials in this book. So, if you intend to work with databases from PDI, it will be necessary that you have access to a PostgreSQL database engine.
