Pentaho Data Integration 4 Cookbook - Adrián Sergio Pulvirenti - E-Book

Pentaho Data Integration 4 Cookbook E-Book

Adrián Sergio Pulvirenti

0,0
34,79 €

oder
-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Pentaho Data Integration (PDI, also called Kettle), one of the data integration tools leaders, is broadly used for all kind of data manipulation such as migrating data between applications or databases, exporting data from databases to flat files, data cleansing, and much more. Do you need quick solutions to the problems you face while using Kettle?
Pentaho Data Integration 4 Cookbook explains Kettle features in detail through clear and practical recipes that you can quickly apply to your solutions. The recipes cover a broad range of topics including processing files, working with databases, understanding XML structures, integrating with Pentaho BI Suite, and more.
Pentaho Data Integration 4 Cookbook shows you how to take advantage of all the aspects of Kettle through a set of practical recipes organized to find quick solutions to your needs. The initial chapters explain the details about working with databases, files, and XML structures. Then you will see different ways for searching data, executing and reusing jobs and transformations, and manipulating streams. Further, you will learn all the available options for integrating Kettle with other Pentaho tools.
Pentaho Data Integration 4 Cookbook has plenty of recipes with easy step-by-step instructions to accomplish specific tasks. There are examples and code that are ready for adaptation to individual needs.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Veröffentlichungsjahr: 2011

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Pentaho Data Integration 4 Cookbook
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers and more
Why Subscribe?
Free Access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Working with Databases
Introduction
Sample databases
Pentaho BI platform databases
Connecting to a database
Getting ready
How to do it...
How it works...
There's more...
Avoiding creating the same database connection over and over again
Avoiding modifying jobs and transformations every time a connection changes
Specifying advanced connection properties
Connecting to a database not supported by Kettle
Checking the database connection at run-time
Getting data from a database
Getting ready
How to do it...
How it works...
There's more...
See also
Getting data from a database by providing parameters
Getting ready
How to do it...
How it works...
There's more...
Parameters coming in more than one row
Executing the SELECT statement several times, each for a different set of parameters
See also
Getting data from a database by running a query built at runtime
Getting ready
How to do it...
How it works...
There's more...
See also
Inserting or updating rows in a table
Getting ready
How to do it...
How it works...
There's more...
Alternative solution if you just want to insert records
Alternative solution if you just want to update rows
Alternative way for inserting and updating
See also
Inserting new rows where a simple primary key has to be generated
Getting ready
How to do it...
How it works...
There's more...
Using the Combination lookup/update for looking up
See also
Inserting new rows where the primary key has to be generated based on stored values
Getting ready
How to do it...
How it works...
There's more...
See also
Deleting data from a table
Getting ready
How to do it...
How it works...
See also
Creating or altering a database table from PDI (design time)
Getting ready
How to do it...
How it works...
There's more...
See also
Creating or altering a database table from PDI (runtime)
How to do it...
How it works...
There's more...
See also
Inserting, deleting, or updating a table depending on a field
Getting ready
How to do it...
How it works...
There's more...
Insert, update, and delete all-in-one
Synchronizing after merge
See also
Changing the database connection at runtime
Getting ready
How to do it...
How it works...
There's more...
See also
Loading a parent-child table
Getting ready
How to do it...
How it works...
See also
2. Reading and Writing Files
Introduction
Reading a simple file
Getting ready
How to do it...
How it works...
There's more...
Alternative notation for a separator
About file format and encoding
About data types and formats
Altering the names, order, or metadata of the fields coming from the file
Reading files with fixed width fields
Reading several files at the same time
Getting ready
How to do it...
How it works...
There's more...
Reading unstructured files
Getting ready
How to do it...
How it works...
There's more...
Master/detail files
Log files
Reading files having one field by row
Getting ready
How to do it...
How it works...
There's more...
See also
Reading files with some fields occupying two or more rows
Getting ready
How to do it...
How it works...
See also
Writing a simple file
Getting ready
How to do it...
How it works...
There's more...
Changing headers
Giving the output fields a format
Writing an unstructured file
Getting ready
How to do it...
How it works...
There's more...
Providing the name of a file (for reading or writing) dynamically
Getting ready
How to do it...
How it works...
There's more...
Get System Info
Generating several files simultaneously with the same structure, but different names
Using the name of a file (or part of it) as a field
Getting ready
How to do it...
How it works...
Reading an Excel file
Getting ready
How to do it...
How it works...
See also
Getting the value of specific cells in an Excel file
Getting ready
How to do it...
How it works...
There's more...
Labels and values horizontally arranged
Looking for a given cell
Writing an Excel file with several sheets
Getting ready
How to do it...
How it works...
There's more...
See also
Writing an Excel file with a dynamic number of sheets
Getting ready
How to do it...
How it works...
See also
3. Manipulating XML Structures
Introduction
Reading simple XML files
Getting ready
How to do it...
How it works...
There's more...
XML data in a field
XML file name in a field
ECMAScript for XML
See also
Specifying fields by using XPath notation
Getting ready
How to do it...
How it works...
There's more...
Getting data from a different path
Getting data selectively
Getting more than one node when the nodes share their XPath notation
Saving time when specifying XPath
Validating well-formed XML files
Getting ready
How to do it...
How it works...
See also
Validating an XML file against DTD definitions
Getting ready
How to do it...
How it works...
There's more...
See also
Validating an XML file against an XSD schema
Getting ready
How to do it...
How it works...
There's more...
See also
Generating a simple XML document
Getting ready
How to do it...
How it works...
There's more...
Generating fields with XML structures
See also
Generating complex XML structures
Getting ready
How to do it...
How it works...
See also
Generating an HTML page using XML and XSL transformations
Getting ready
How to do it...
How it works...
There's more...
See also
4. File Management
Introduction
Copying or moving one or more files
Getting ready
How to do it...
How it works...
There's more...
Moving files
Detecting the existence of the files before copying them
Creating folders
See also
Deleting one or more files
Getting ready
How to do it...
How it works...
There's more...
Figuring out which files have been deleted
See also
Getting files from a remote server
Getting ready
How to do it...
How it works...
There's more...
Specifying files to transfer
Some considerations about connecting to an FTP server
Access via SFTP
Access via FTPS
Getting information about the files being transferred
See also
Putting files on a remote server
Getting ready
How to do it...
How it works...
There's more...
See also
Copying or moving a custom list of files
Getting ready
How to do it...
How it works...
See also
Deleting a custom list of files
Getting ready
How to do it...
How it works...
See also
Comparing files and folders
Getting ready
How to do it...
How it works...
There's more...
Comparing folders
Working with ZIP files
Getting ready
How to do it...
How it works...
There's more...
Avoiding zipping files
Avoiding unzipping files
See also
5. Looking for Data
Introduction
Looking for values in a database table
Getting ready
How to do it...
How it works...
There's more...
Taking some action when the lookup fails
Taking some action when there are too many results
Looking for non-existent data
See also
Looking for values in a database (with complex conditions or multiple tables involved)
Getting ready
How to do it...
How it works...
There's more...
See also
Looking for values in a database with extreme flexibility
Getting ready
How to do it...
How it works...
There's more...
See also
Looking for values in a variety of sources
Getting ready
How to do it...
How it works...
There's more...
Looking for alternatives when the Stream Lookup step doesn't meet your needs
Speeding up your transformation
Using the Value Mapper step for looking up from a short list of values
See also
Looking for values by proximity
Getting ready
How to do it...
How it works...
There's more...
Looking for values consuming a web service
Getting ready
How to do it...
How it works...
There's more...
See also
Looking for values over an intranet or Internet
Getting ready
How to do it...
How it works...
There's more...
See also
6. Understanding Data Flows
Introduction
Splitting a stream into two or more streams based on a condition
Getting ready
How to do it...
How it works...
There's more...
Avoiding the use of Dummy steps
Comparing against the value of a Kettle variable
Avoiding the use of nested Filter Rows steps
Overcoming the difficulties of complex conditions
Merging rows of two streams with the same or different structures
Getting ready
How to do it...
How it works...
There's more...
Making sure that the metadata of the streams is the same
Telling Kettle how to merge the rows of your streams
See also
Comparing two streams and generating differences
Getting ready
How to do it...
How it works...
There's more...
Using the differences to keep a table up to date
See also
Generating all possible pairs formed from two datasets
How to do it...
How it works...
There's more...
Getting variables in the middle of the stream
Limiting the number of output rows
See also
Joining two or more streams based on given conditions
Getting ready
How to do it...
How it works...
There's more...
See also
Interspersing new rows between existent rows
Getting ready
How to do it...
How it works...
See also
Executing steps even when your stream is empty
Getting ready
How to do it...
How it works...
There's more...
Processing rows differently based on the row number
Getting ready
How to do it...
How it works...
There's more...
Identifying specific rows
Identifying the last row in the stream
Avoiding using an Add sequence step to enumerate the rows
See also
7. Executing and Reusing Jobs and Transformations
Introduction
Sample transformations
Sample transformation: Hello
Sample transformation: Random list
Sample transformation: Sequence
Sample transformation: File list
Launching jobs and transformations
Executing a job or a transformation by setting static arguments and parameters
Getting ready
How to do it...
How it works...
There's more...
See also
Executing a job or a transformation from a job by setting arguments and parameters dynamically
Getting ready
How to do it...
How it works...
There's more...
See also
Executing a job or a transformation whose name is determined at runtime
Getting ready
How to do it...
How it works...
There's more...
See also
Executing part of a job once for every row in a dataset
Getting ready
How to do it...
How it works...
There's more...
Accessing the copied rows from jobs, transformations, and other entries
Executing a transformation once for every row in a dataset
Executing a transformation or part of a job once for every file in a list of files
See also
Executing part of a job several times until a condition is true
Getting ready
How to do it...
How it works...
There's more...
Implementing loops in a job
Using the JavaScript step to control the execution of the entries in your job
See also
Creating a process flow
Getting ready
How to do it...
How it works...
There's more...
Serializing/De-serializing data
Other means for transferring or sharing data between transformations
Moving part of a transformation to a subtransformation
Getting ready
How to do it...
How it works...
There's more...
8. Integrating Kettle and the Pentaho Suite
Introduction
A sample transformation
Creating a Pentaho report with data coming from PDI
Getting ready
How to do it...
How it works...
There's more...
Configuring the Pentaho BI Server for running PDI jobs and transformations
Getting ready
How to do it...
How it works...
There's more...
See also
Executing a PDI transformation as part of a Pentaho process
Getting ready
How to do it...
How it works...
There's more...
Specifying the location of the transformation
Supplying values for named parameters, variables and arguments
Keeping things simple when it's time to deliver a plain file
See also
Executing a PDI job from the Pentaho User Console
Getting ready
How to do it...
How it works...
There's more...
See also
Generating files from the PUC with PDI and the CDA plugin
Getting ready
How to do it...
How it works...
There's more...
Populating a CDF dashboard with data coming from a PDI transformation
Getting ready
How to do it...
How it works...
There's more...
See also
9. Getting the Most Out of Kettle
Introduction
Sending e-mails with attached files
Getting ready
How to do it...
How it works...
There's more...
Sending logs through an e-mail
Sending e-mails in a transformation
Generating a custom log file
Getting ready
How to do it...
How it works...
There's more...
Filtering the log file
Creating a clean log file
Isolating log files for different jobs or transformations
See also
Programming custom functionality
Getting ready
How to do it...
How it works...
There's more...
Data type's equivalence
Generalizing you code
Looking up information with additional steps
Customizing logs
Scripting alternatives to the UDJC step
Generating sample data for testing purposes
How to do it...
How it works...
There's more...
Using Data grid step to generate specific data
Working with subsets of your data
See also
Working with Json files
Getting ready
How to do it...
How it works...
There's more...
Reading Json files dynamically
Writing Json files
Getting information about transformations and jobs (file-based)
Getting ready
How to do it...
How it works...
There's more...
Transformation XML nodes
Job XML nodes
Steps and entries information
See also
Getting information about transformations and jobs (repository-based)
Getting ready
How to do it...
How it works...
There's more...
Transformation tables
Job tables
Database connections tables
A. Data Structures
Book's data structure
Books
Authors
Museum's data structure
Museums
Cities
Outdoor data structure
Products
Categories
Steel Wheels structure
Index

Pentaho Data Integration 4 Cookbook

Pentaho Data Integration 4 Cookbook

Copyright © 2011 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: June 2011

Production Reference: 1170611

Published by Packt Publishing Ltd.

32 Lincoln Road

Olton

Birmingham, B27 6PA, UK.

ISBN 978-1-849515-24-5

www.packtpub.com

Cover Image by Ed Maclean (<[email protected]>)

Credits

Authors

Adrián Sergio Pulvirenti

María Carina Roldán

Reviewers

Jan Aertsen

Pedro Alves

Slawomir Chodnicki

Paula Clemente

Samatar Hassan

Nelson Sousa

Acquisition Editor

Usha Iyer

Development Editor

Neha Mallik

Technical Editors

Conrad Sardinha

Azharuddin Sheikh

Project Coordinator

Joel Goveya

Proofreaders

Stephen Silk

Aaron Nash

Indexer

Tejal Daruwale

Graphics

Nilesh Mohite

Production Coordinator

Kruthika Bangera

Cover Work

Kruthika Bangera

About the Authors

Adrián Sergio Pulvirenti was born in Buenos Aires, Argentina, in 1972. He earned his Bachelor's degree in Computer Sciences at UBA, one of the most prestigious universities in South America.

He has dedicated more than 15 years to developing desktop and web-based software solutions. Over the last few years he has been leading integration projects and development of BI solutions.

I'd like to thank my lovely kids Camila and Nicolas, who understood that I couldn't share with them the usual videogame sessions during the writing process. I'd also thank my wife who introduced me to the Pentaho world.

María Carina Roldán was born in Esquel, Argentina, in 1970. She earned her Bachelors degree in Computer Science at UNLP in La Plata; after that she did a postgraduate course in Statistics at the University of Buenos Aires (UBA) in Buenos Aires city where she lives since 1994.

She has worked as a BI consultant for more than 10 years. Over the last four years, she has been dedicated full time to developing BI solutions using Pentaho Suite. Currently she works for Webdetails, one of the main Pentaho contributors.

She is the author of Pentaho3.2DataIntegration:Beginner'sGuide published by PacktPublishing in April 2010.

You can follow her on Twitter at @mariacroldan.

I'd like to thank those who have encouraged me to write this book: On one hand, the Pentaho community. They have given me a rewarding feedback after the Beginner's book. On the other side, my husband who without hesitation agreed to write the book with me. Without them I'm not sure I would have embarked on a new book project.

I'd also like to thank the technical reviewers for the time and dedication that they have put in reviewing the book. In particular, thanks to my colleagues at Webdetails; it's a pleasure and a privilege to work with them every day.

About the Reviewers

Jan Aertsen has worked in IT and decision support for the past 10 years. Since the beginning of his career he has specialized in data warehouse design and business intelligence projects. He has worked on numerous global data warehouse projects within the fashion industry, retail, banking and insurance, telco and utilities, logistics, automotive, and public sector.

Jan holds the degree of Commercial Engineer in international business affairs from the Catholic University of Leuven (Belgium) and extended his further knowledge in the field of business intelligence through a Masters in Artificial Intelligence.

In 1999 Jan started up the business intelligence activities at IOcore together with some of his colleagues, rapidly making this the most important revenue area of the Belgian affiliate. They quickly gained access to a range of customers as KPN Belgium, Orange (now Base), Mobistar, and other Belgian Telcos.

After this experience Jan joined Cap Gemini Ernst & Young in Italy and rapidly became one of their top BI project managers. After having managed some large BI projects (up to 1 million € projects) Jan decided to leave the company and pursue his own ambitions.

In 2002, he founded kJube as an independent platform to develop his ambitions in the world of business intelligence. Since then this has resulted in collaborations with numerous companies as Volvo, Fendi-LVMH, ING, MSC, Securex, SDWorx, Blinck, and Beate Uhse.

Over the years Jan has worked his way through every possible aspect of business intelligence from KPI and strategy definition over budgeting, tool selection, and software investments acquisition to project management and all implementation aspects with most of the available tools. He knows the business side as well as the IT side of the business intelligence, and therefore is one of the rare persons that are able to give you a sound, all-round, vendor-independent advice on business intelligence.

He continues to share his experiences in the field through his blog (blog.kjube.be) and can be contacted at <[email protected]>.

Pedro Alves, is the founder of Webdetails. A Physicist by formation, serious video gamer, volleyball player, open source passionate, and dad of two lovely children.

Since his early professional years he has been responsible for Business Software development and his career led him to work as a Consultant in several Portuguese companies.

In 2008 he decided it was time to get his accumulated experience and share his knowledge about the Pentaho Business Intelligence platform on his own. He founded Webdetails and joined the Mozilla metrics team. Now he leads an international team of BI Consultants and keeps nurturing Webdetails as a world reference Pentaho BI solutions provider and community contributor. He is the Ctools (CDF, CDA, CDE, CBF, CST, CCC) architect and, on a daily basis, keeps developing and improving new components and features to extend and maximize Pentaho's capabilities.

Slawomir Chodnicki specializes in data warehousing and ETL, with a background in web development using various programming languages and frameworks. He has established his blog at http://type-exit.org to help fellow BI developers embrace the possibilities of PDI and other open source BI tools.

I would like to thank all regular members of the ##pentaho IRC channel for their endless patience and support regarding PDI related questions. Very special thanks go to María Carina and Adrián Sergio for creating the Kettle Cookbook and inviting me to be part of the project.

Paula Clemente was born in Sintra, Portugal, in 1983. Divided between the idea of spending her life caring about people and animals or spending quality time with computers, she started studying Computer Science at IST Engineering College—"the Portuguese MIT"—at a time where Internet Social Networking was a synonym of IRC. She graduated in 2008 after completing her Master thesis on Business Processes Management. Since then she is proudly working as a BI Consultant for Webdetails, a Portuguese company specialized in delivering Pentaho BI solutions that earned the Pentaho "Best Community Contributor 2011" award.

Samatar Hassan is an application developer focusing on data integration and business intelligence. He was involved in the Kettle project since the year it was open sourced. He tries to help the community by contributing in different ways; taking the translation effort for French language, participating in the forums, resolving bugs, and adding new features to the software.

He contributed to the "Pentaho Kettle Solutions" book edited by Wiley and written by Matt Casters, the founder of Kettle.

I would first like to thank Adrián Sergio and María Carina Roldán for taking the time to write this book. It is a great idea to show how to take advantage of Kettle through step-by-step recipes. Kettle users have their own ETL bible now.

Finally, I'd like to thank all community members. They are the real power of open source software.

Nelson Sousa is a business intelligence consultant at Webdetails. He's part of the Metrics team at Mozilla where he helps develop and maintain Mozilla's Pentaho server and solution. He specializes in Pentaho dashboards using CDF, CDE, and CDA and also in PDI, processing vast amounts of information that are integrated daily in the various dashboards and reports that are part of the Metrics team day-to-day life.

www.PacktPub.com

Support files, eBooks, discount offers and more

You might want to visit www.PacktPub.com for support files and downloads related to your book.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

http://PacktLib.PacktPub.com

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can access, read and search across Packt's entire library of books. 

Why Subscribe?

Fully searchable across every book published by PacktCopy and paste, print and bookmark contentOn demand and accessible via web browser

Free Access for Packt account holders

If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view nine entirely free books. Simply use your login credentials for immediate access.

We dedicate this book to our family and specially our adorable kids.

- María Carina and Adrián -

Preface

Pentaho Data Integration (PDI, also called Kettle), one of the data integration tools leaders, is broadly used for all kind of data manipulation, such as migrating data between applications or databases, exporting data from databases to flat files, data cleansing, and much more. Do you need quick solutions to the problems you face while using Kettle?

Pentaho Data Integration 4 Cookbook explains Kettle features in detail through clear and practical recipes that you can quickly apply to your solutions. The recipes cover a broad range of topics including processing files, working with databases, understanding XML structures, integrating with Pentaho BI Suite, and more.

Pentaho Data Integration 4 Cookbook shows you how to take advantage of all the aspects of Kettle through a set of practical recipes organized to find quick solutions to your needs. The initial chapters explain the details about working with databases, files, and XML structures. Then you will see different ways for searching data, executing and reusing jobs and transformations, and manipulating streams. Further, you will learn all the available options for integrating Kettle with other Pentaho tools.

Pentaho Data Integration 4 Cookbook has plenty of recipes with easy step-by-step instructions to accomplish specific tasks. There are examples and code that are ready for adaptation to individual needs.

Learn to solve data manipulation problems using the Pentaho Data Integration tool Kettle.

What this book covers

Chapter 1, WorkingwithDatabases helps you to deal with databases in Kettle. The recipes cover creating and sharing connections, loading tables under different scenarios, and creating dynamic SQL statements among others topics.

Chapter 2, ReadingandWritingFiles shows you not only the basics for reading and writing files, but also all the how-tos for dealing with files. The chapter includes parsing unstructured files, reading master/detail files, generating multi-sheet Excel files, and more.

Chapter 3, ManipulatingXMLStructures teaches you how to read, write, and validate XML data. It covers both simple and complex XML structures.

Chapter 4, FileManagement helps you to pick and configure the different options for copying, moving, and transferring lists of files or directories.

Chapter 5, LookingforData explains the different methods for searching information in databases, text files, web services, and more.

Chapter 6, UnderstandingDataFlows focuses on the different ways for combining, splitting, or manipulating streams or flows of data in simple and complex situations.

Chapter 7, Executing and Reusing Jobs and Transformations explains in a simple fashion topics that are critical for building complex PDI projects. For example, building reusable jobs and transformations, iterating the execution of a transformation over a list of data and transferring data between transformations.

Chapter 8, IntegratingKettleandthePentahoSuite. PDI aka Kettle is part of the Pentaho Business Intelligent Suite. As such, it can be used interacting with other components of the suite, for example as the datasource for reporting, or as part of a bigger process. This chapter shows you how to run Kettle jobs and transformations in that context.

Chapter 9, GettingtheMostOutofKettle covers a wide variety of topics, such as customizing a log file, sending e-mails with attachments, or creating a custom functionality.

Appendix, DataStructures describes some structures used in several recipes throughout the book.

What you need for this book

PDI is a multiplatform tool, meaning that you will be able to install the tool no matter what your operating system is. The only prerequisite to work with PDI is to have JVM 1.5 or a higher version installed. It is also useful to have Excel or Calc, a nice text editor, and access to a database engine of your preference.

Having an Internet connection while reading is extremely useful as well. Several links are provided throughout the book that complement what is explained. Besides, there is the PDI forum where you may search or post doubts if you are stuck with something.

Who this book is for

If you are a software developer or anyone involved or interested in developing ETL solutions, or in general, doing any kind of data manipulation, this book is for you. It does not cover PDI basics, SQL basics, or database concepts. You are expected to have a basic understanding of the PDI tool, SQL language, and databases.

Conventions

In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.

Code words in text are shown as follows: "Copy the .jar file containing the driver to the libext/JDBC directory inside the Kettle installation directory".

A block of code is set as follows:

NUMBER, LASTNAME, FIRSTNAME, EXT, OFFICE, REPORTS, TITLE 1188, Firrelli, Julianne,x2174,2,1143, Sales Manager 1619, King, Tom,x103,6,1088,Sales Rep

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

<request> <type>City</type> <query>Buenos aires, Argentina</query> <preferredScale>C</preferredScale> </request>

New terms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: "Add a Delete file entry from the File management category"

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.

To send us general feedback, simply send an e-mail to <[email protected]>, and mention the book title via the subject of your message.

If there is a book that you need and would like to see us publish, please send us a note in the SUGGEST A TITLE form on www.packtpub.com or e-mail <[email protected]>.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/support, selecting your book, clicking on the errata submission form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded on our website, or added to any list of existing errata, under the Errata section of that title. Any existing errata can be viewed by selecting your title from http://www.packtpub.com/support.

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors, and our ability to bring you valuable content.

Questions

You can contact us at <[email protected]> if you are having a problem with any aspect of the book, and we will do our best to address it.

Chapter 1. Working with Databases

In this chapter, we will cover:

Connecting to a databaseGetting data from a databaseGetting data from a database by providing parametersGetting data from a database by running a query built at runtimeInserting or updating rows in a tableInserting new rows when a simple primary key has to be generatedInserting new rows when the primary key has to be generated based on stored valuesDeleting data from a tableCreating or altering a table from PDI (design time)Creating or altering a table from PDI (runtime)Inserting, deleting, or updating a table depending on a fieldChanging the database connection at runtimeLoading a parent-child table

Introduction

Databases are broadly used by organizations to store and administer transactional data such as customer service history, bank transactions, purchases and sales, and so on. They also constitute the storage method for data warehouses, the repositories used in Business Intelligence solutions.

In this chapter, you will learn to deal with databases in Kettle. The first recipe tells you how to connect to a database, which is a prerequisite for all the other recipes. The rest of the chapter teaches you how to perform different operations and can be read in any order according to your needs.

Note

The focus of this chapter is on relational databases (RDBMS). Thus the term database is used as a synonym for relational databases through the recipes.

Sample databases

Through the chapter you will use a couple of sample databases. Those databases can be created and loaded by running the scripts available at the book's website. The scripts are ready to run under MySQL.

Note

If you work with a different DBMS you may have to modify the scripts slightly.

For more information about the structure of the sample databases and the meaning of the tables and fields, please refer to Appendix, Data Structures. Feel free to adapt the recipes to different databases. You could try some well known databases; for example Foodmart (available as part of the Mondrian distribution at http://sourceforge.net/projects/mondrian/) or the MySQL sample databases (available at http://dev.mysql.com/doc/index-other.html).

Pentaho BI platform databases

As part of the sample databases used in this chapter you will use the Pentaho BI platform Demo databases. The Pentaho BI Platform Demo is a pre-configured installation that lets you explore the capabilities of the Pentaho platform. It relies on the following databases:

Database name

Description

hibernate

Administrative information including user authentication and authorization data.

quartz

Repository for Quartz, the scheduler used by Pentaho.

sampledata

Data for Steel Wheels, a fictional company that sells all kind of scale replicas of vehicles.

By default, all those databases are stored in Hypersonic (HSQLDB). The script for creating the databases in HSQLDB can be found at http://sourceforge.net/projects/pentaho/files. Under Business Intelligence Server | 1.7.1-stable look for pentaho_sample_data-1.7.1.zip.

It can also be found at svn://ci.pentaho.com/view/Platform/job/bi-platform-sample-data/.

These databases can be stored in other DBMS as well. Scripts for creating and loading these databases in other popular DBMS as for example MySQL or Oracle can be found in Prashant Raju's blog, at http://www.prashantraju.com/projects/pentaho/.

Beside the scripts, you will find instructions for creating and loading the databases.

Tip

Prashant Raju, an expert Pentaho developer, provides several excellent tutorials related to the Pentaho platform. If you are interested in knowing more about Pentaho, it's worth taking a look at his blog.

Connecting to a database

If you intend to work with a database, either reading, writing, looking up data, and so on, the first thing you will have to do is to create a connection to that database. This recipe will teach you how to do this.

Getting ready

In order to create the connection, you will need to know the connection settings. At least you will need:

Host Name: Domain name or IP address of the database server.Port NumberUser NamePassword

It's recommended that you also have access to the database at the moment of creating the connection.

How to do it...

Open Spoon and create a new transformation.Select the View option that appears in the upper-left corner of the screen, right-click the Database connections option, and select New. The Database Connection dialog window appears.Under Connection Type, select the database engine that matches your DBMS.Fill the Settings options and give the connection a name by typing it in the Connection Name: textbox. Your window should look like this:Press the Test button. A message should appear informing you that the connection to your database is OK.

Note

If you get an error message instead, you should recheck the data entered, as well as the availability of the database server. The server might be down, or it might not be reachable from your machine.

How it works...

A database connection is the definition that allows you to access a database from Kettle. With the data you provide, Kettle can instantiate real database connections and perform the different operations related with databases. Once you define a database connection, you will be able to access that database and execute arbitrary SQL statements: create schema objects like tables, execute SELECT statements, modify rows, and so on.

In this recipe you created the connection from the Database connections tree. You may also create a connection by pressing the New... button in the configuration window of any database-related step in a transformation or job entry in a job. Alternatively, there is also a wizard accessible from the Tools menu or by pressing F3.

Whichever the method you choose, a setting window like the one you saw in the recipe shows up allowing you to define the connection. This task includes:

Selecting a database engine (Connection type:)Selecting the access method (Access:)

Note

Native (JDBC) is recommended but you can also use a predefined ODBC data source, a JNDI data source, or an Oracle OCI connection.

Providing the Host Name or IPEntering the User Name and Password for accessing the database.

A database connection can only be created with a transformation or a job opened. Therefore, in the recipe you were asked to create a transformation. The same could have been achieved by creating a job instead.

There's more...

The recipe showed the simplest way to create a database connection. However, there is more to know about creating database connections.

Avoiding creating the same database connection over and over again

If you intend to use the same database in more than one transformation and/or job, it's recommended that you share the connection. You do this by right-clicking the database connection under the Database connections tree, and clicking on Share. This way the database connection will be available to be used in all transformations and jobs. Shared database connections are recognized because they are bold. As an example take a look at the following sample screenshot:

The databases books and sampledata are shared; the others are not.

The information about shared connections is saved in a file namedshared.xml located in the Kettle home directory.

No matter the Kettle storage method (repository or files) you can share connections. If you are working with the file method, namely ktr and kjb files, the information about shared connections are not only saved in the shared.xml file, but also saved as part of the transformation or job files even if they don't use the connections.

Note

You can avoid saving all the connection data as part of your transformations and jobs by selecting the option Only save used connections to XML? in the Kettle options window.

Avoiding modifying jobs and transformations every time a connection changes

Instead of typing fixed values in the database connection definition, it's worth using variables. For example, instead of typing localhost as the hostname, you can define a variable named HOST_NAME and as host name type its variable notation as ${HOST_NAME} or %%HOST_NAME%%. If you decide to move the database from the local machine to a server, you just have to change the value of the variable and don't need to modify the transformations or jobs that use the connection.

This is especially useful when it's time to move your jobs and transformations between different environments: development, test, and so on.

Specifying advanced connection properties

The recipe showed you how to provide the general properties needed to create a connection. You may need to specify additional options—for example a preferred schema name, or supply some parameters to be used when the connection is initialized. In order to do that, look for those options in the extra tab windows under the General tab of the Database Connection window.

Connecting to a database not supported by Kettle

Kettle offers built-in support for a vast set of database engines. The list includes both commercial databases (such as Oracle) and open source (such as PostgreSQL), traditional row-oriented databases (such as MS SQL Server) and modern column-oriented databases (such as Infobright), disk-storage based databases (such as Informix) and in-memory databases (such as HSQLDB). However, it can happen that you want to connect to a database that is not in that list. In that case, you might still create a connection to that database. First of all, you have to get a JDBC driver for that DBMS. Copy the jar file containing the driver to the libext/JDBC directory inside the Kettle installation directory. Then, create the connection. In this case, as connection type choose Generic database. In the Settings frame specify the connection string (which should be explained along with JDBC), the driver class name, and the username and password. In order to find the values for these settings, you will have to refer to the driver documentation.

Checking the database connection at run-time

If you are not sure that the database connection will be accessible when a job or transformation runs from outside Spoon, you might precede all database-related operations with a Check Db connection job entry. The entry will return true or false depending on the result of checking one or more connections.

Getting data from a database

If you're used to working with databases, one of your main objectives while working with PDI must be getting data from your databases for transforming, loading in other databases, generating reports, and so on. Whatever operation you intend to achieve, the first thing you have to do after connecting to the database, is to get that data and create a PDI dataset. In this recipe you will learn the simplest way to do that.

Getting ready

To follow these instructions you need to have access to any DBMS.

How to do it...

Create a transformation and drop into the canvas a Table Input step. You will find it in the Input category of steps.From the Connection drop-down list select the connection to the database where your data resides, or create it if it doesn't exist.In the SQL text area, type the SQL statement that returns the data you need. So far you should have something like this:Click on Preview. This will bring a sample list of rows so you can confirm that the data is as expected.Press OK to close the Table Input configuration window, and you'll be ready to use the data for further manipulation.

How it works...

The Table Input step you used in the recipe is the main PDI step to get data from a database. When you run or preview the transformation, Kettle executes the SQL and pushes the rows of data coming from the database into the output stream of the step. Each column of the SQL statement leads to a PDI field and each row generated by the execution of the statement becomes a row in the PDI dataset.

Once you get the data from the database, it will be available for any kind of manipulation inside the transformation.

There's more...

In order to save time, or in case you are not sure of the name of the tables or columns in the database, instead of typing the SQL statement press the Get SQL select statement... button. This will bring the Database Explorer window. This window allows you to explore the selected database. By expanding the database tree and selecting the table that interests you, you will be able to explore that table through the different options available under the Actions menu as shown below:

Double-clicking the name of the table will generate a SELECT statement to query that table. You will have the chance to include all the field names in the statement, or simply generate a SELECT * statement. After bringing the SQL to the Table Input configuration window, you will be able to modify it according to your needs.

Note

By generating this statement you will loose any statement already in the SQL text area.

See also

Connectingtoadatabase. In order to get data from a database, you need to have a connection to it. This recipe explains how to do this.Gettingdatafromadatabasebyprovidingparameters. This recipe explains a more flexible way to run database queries.Gettingdatafromadatabasebyrunningaquerybuiltatruntime. This recipe explains an even more flexible method.