32,99 €
A complete guide to Pentaho Kettle, the Pentaho Data lntegration toolset for ETL
This practical book is a complete guide to installing, configuring, and managing Pentaho Kettle. If you’re a database administrator or developer, you’ll first get up to speed on Kettle basics and how to apply Kettle to create ETL solutions—before progressing to specialized concepts such as clustering, extensibility, and data vault models. Learn how to design and build every phase of an ETL solution.
Get the most out of Pentaho Kettle and your data warehousing with this detailed guide—from simple single table data migration to complex multisystem clustered data integration tasks.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 1147
Veröffentlichungsjahr: 2010
Table of Contents
Cover
Title Page
Copyright
Dedication
About the Authors
Credits
Acknowledgments
Introduction
The Origins of Kettle
About This Book
How This Book Is Organized
Prerequisites
On the Website
Further Resources
Part I: Getting Started
Chapter 1: ETL Primer
OLTP versus Data Warehousing
What Is ETL?
ETL, ELT, and EII
Data Integration Challenges
ETL Tool Requirements
Summary
Chapter 2: Kettle Concepts
Design Principles
The Building Blocks of Kettle Design
Parameters and Variables
Visual Programming
Summary
Chapter 3: Installation and Configuration
Kettle Software Overview
Installation
Configuration
Summary
Chapter 4: An Example ETL Solution—Sakila
Sakila
Prerequisites and Some Basic Spoon Skills
The Sample ETL Solution
Summary
Part II: ETL
Chapter 5: ETL Subsystems
Introduction to the 34 Subsystems
Summary
Chapter 6: Data Extraction
Kettle Data Extraction Overview
Working with ERP and CRM Systems
Data Profiling
CDC: Change Data Capture
Delivering Data
Summary
Chapter 7: Cleansing and Conforming
Data Cleansing
Error Handling
Auditing Data and Process Quality
Deduplicating Data
Scripting
Summary
Chapter 8: Handling Dimension Tables
Managing Keys
Loading Dimension Tables
Slowly Changing Dimensions
More Dimensions
Summary
Chapter 9: Loading Fact Tables
Loading in Bulk
Dimension Lookups
Fact Table Handling
Summary
Chapter 10: Working with OLAP Data
OLAP Benefits and Challenges
Working with Mondrian
Working with XML/A Servers
Working with Palo
Summary
Part III: Management and Deployment
Chapter 11: ETL Development Lifecycle
Solution Design
Agile Development
Testing and Debugging
Documenting the Solution
Summary
Chapter 12: Scheduling and Monitoring
Scheduling
Monitoring
Summary
Chapter 13: Versioning and Migration
Version Control Systems
Kettle Metadata
Managing Repositories
Version Migration System
Summary
Chapter 14: Lineage and Auditing
Batch-Level Lineage Extraction
Lineage
Logging and Operational Metadata
Summary
Part IV: Performance and Scalability
Chapter 15: Performance Tuning
Transformation Performance: Finding the Weakest Link
Improving Transformation Performance
Improving Job Performance
Summary
Chapter 16: Parallelization, Clustering, and Partitioning
Multi-Threading
Using Carte as a Slave Server
Clustering Transformations
Partitioning
Summary
Chapter 17: Dynamic Clustering in the Cloud
Dynamic Clustering
Cloud Computing
EC2
Summary
Chapter 18: Real-Time Data Integration
Introduction to Real-Time ETL
Transformation Streaming
Summary
Part V: Advanced Topics
Chapter 19: Data Vault Management
Introduction to Data Vault Modeling
Do You Need a Data Vault?
Data Vault Building Blocks
Transforming Sakila to the Data Vault Model
Loading the Data Vault: A Sample ETL Solution
Updating a Data Mart from a Data Vault
Summary
Chapter 20: Handling Complex Data Formats
Non-Relational and Non-Tabular Data Formats
Non-Relational Tabular Formats
Semi- and Unstructured Data
Key/Value Pairs
Summary
Chapter 21: Web Services
Web Pages and Web Services
Data Formats
XML Examples
SOAP Examples
JSON Example
RSS
Summary
Chapter 22: Kettle Integration
The Kettle API
Executing Existing Transformations and Jobs
Embedding Kettle
OEM Versions and Forks
Summary
Chapter 23: Extending Kettle
Plugin Architecture Overview
Transformation Step Plugins
The User-Defined Java Class Step
Job Entry Plugins
Partitioning Method Plugins
Repository Type Plugins
Database Type Plugins
Summary
Appendix A: The Kettle Ecosystem
Kettle Development and Versions
The Pentaho Community Wiki
Using the Forums
Jira
##pentaho
Appendix B: Kettle Enterprise Edition Features
Appendix C: Built-in Variables and Properties Reference
Internal Variables
Kettle Variables
Variables for Configuring VFS
Noteworthy JRE Variables
Index
Pentaho® Kettle Solutions: Building Open Source ETL Solutions with Pentaho Data Integration
Published by
Wiley Publishing, Inc.
10475 Crosspoint Boulevard
Indianapolis, IN 46256
www.wiley.com
Copyright © 2010 by Wiley Publishing, Inc., Indianapolis, Indiana
Published simultaneously in Canada
ISBN: 978-0-470-63517-9
ISBN: 9780470942420 (ebk)
ISBN: 9780470947524 (ebk)
ISBN: 9780470947524 (ebk)
Manufactured in the United States of America
10 9 8 7 6 5 4 3 2 1
No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.
Limit of Liability/Disclaimer of Warranty: The publisher and the author make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation warranties of fitness for a particular purpose. No warranty may be created or extended by sales or promotional materials. The advice and strategies contained herein may not be suitable for every situation. This work is sold with the understanding that the publisher is not engaged in rendering legal, accounting, or other professional services. If professional assistance is required, the services of a competent professional person should be sought. Neither the publisher nor the author shall be liable for damages arising herefrom. The fact that an organization or Web site is referred to in this work as a citation and/or a potential source of further information does not mean that the author or the publisher endorses the information the organization or Web site may provide or recommendations it may make. Further, readers should be aware that Internet Web sites listed in this work may have changed or disappeared between when this work was written and when it is read.
For general information on our other products and services please contact our Customer Care Department within the United States at (877) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.
Library of Congress Control Number: 2010932421
Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates, in the United States and other countries, and may not be used without written permission. Pentaho is a registered trademark of Pentaho, Inc. All other trademarks are the property of their respective owners. Wiley Publishing, Inc. is not associated with any product or vendor mentioned in this book.
For my wife and kids, Kathleen, Sam and Hannelore. Your love and joy keeps me sane in crazy times.
—Matt
For my wife, Annemarie, and my children, David, Roos, Anne and Maarten. Thanks for bearing with me—I love you!
—Roland
For my children Thomas and Lisa, and for Yvonne, to whom I owe more than words can express.
—Jos
About the Authors
Matt Casters has been an independent business intelligence consultant for many years and has implemented numerous data warehouses and BI solutions for large companies. For the last 8 years, Matt kept himself busy with the development of an ETL tool called Kettle. This tool was open sourced in December 2005 and acquired by Pentaho early in 2006. Since then, Matt took up the position of Chief Data Integration at Pentaho. His responsibility is to continue to be lead developer for Kettle. Matt tries to help the Kettle community in any way possible; he answers questions on the forum and speaks occasionally at conferences all around the world. He has a blog at http://www.ibridge.be and you can follow his @mattcasters account on Twitter.
Roland Bouman has been working in the IT industry since 1998 and is currently working as a web and business intelligence developer. Over the years he has focused on open source software, in particular database technology, business intelligence, and web development frameworks. He’s an active member of the MySQL and Pentaho communities, and a regular speaker at international conferences, such as the MySQL User Conference, OSCON and at Pentaho community events. Roland co-authored the MySQL 5.1. Cluster Certification Guide and Pentaho Solutions, and was a technical reviewer for a number of MySQL and Pentaho related book titles. He maintains a technical blog at http://rpbouman.blogspot.com and tweets as @rolandbouman on Twitter.
Jos van Dongen is a seasoned business intelligence professional and well-known author and presenter. He has been involved in software development, business intelligence, and data warehousing since 1991. Before starting his own consulting practice, Tholis Consulting, in 1998, he worked for a top tier systems integrator and a leading management consulting firm. Over the past years, he has successfully implemented BI and data warehouse solutions for a variety of organizations, both commercial and non-profit. Jos covers new BI developments for the Dutch Database Magazine and speaks regularly at national and international conferences. He authored one book on open source BI and is co-author of the book Pentaho Solutions. You can find more information about Jos on http://www.tholis.com or follow @josvandongen on Twitter.
Credits
Executive Editor
Robert Elliott
Project Editor
Sara Shlaer
Technical Editors
Jens Bleuel
Sven Boden
Kasper de Graaf
Daniel Einspanjer
Nick Goodman
Mark Hall
Samatar Hassan
Benjamin Kallmann
Bryan Senseman
Johannes van den Bosch
Production Editor
Daniel Scribner
Copy Editor
Nancy Rapoport
Editorial Director
Robyn B. Siesky
Editorial Manager
Mary Beth Wakefield
Marketing Manager
Ashley Zurcher
Production Manager
Tim Tate
Vice President and Executive Group Publisher
Richard Swadley
Vice President and Executive Publisher
Barry Pruett
Associate Publisher
Jim Minatel
Project Coordinator, Cover
Lynsey Stanford
Compositor
Maureen Forys, Happenstance Type-O-Rama
Proofreader
Nancy Bell
Indexer
Robert Swanson
Cover Designer
Ryan Sneed
Acknowledgments
This book is the result of the efforts of many individuals. By convention, authors receive explicit credit, and get to have their names printed on the book cover. But creating this book would not have been possible without a lot of hard work behind the scenes. We, the authors, would like to express our gratitude to a number of people that provided substantial contributions, and thus help define and shape the final result that is Pentaho Kettle Solutions.
First, we’d like to thank those individuals that contributed directly to the material that appears in the book:
Ingo Klose suggested an elegant solution to generate keys starting from a given offset within a single transformation (this solution is discussed in Chapter 8, “Handling Dimension Tables,” subsection “Generating Surrogate Keys Based on a Counter,” shown in Figure 8-2).Samatar Hassan provided text as well as working example transformations to demonstrate Kettle’s RSS capabilities. Samatar’s contribution is included almost completely and appears in the RSS section of Chapter 21, “Web Services.”Thanks to Mike Hillyer and the MySQL documentation team for creating and maintaining the Sakila sample database, which is introduced in Chapter 4 and appears in many examples throughout this book.Although only three authors appear on the cover, there was actually a fourth one: We cannot thank Kasper de Graaf of DIKW-Academy enough for writing the Data Vault chapter, which has benefited greatly from his deep expertise on this subject. Special thanks also to Johannes van den Bosch who did a great job reviewing Kasper’s work and gave another boost to the overall quality and clarity of the chapter.Thanks to Bernd Aschauer and Robert Wintner, both from Aschauer EDV (http://www.aschauer-edv.at/en), for providing the examples and screenshots used in the section dedicated to SAP of Chapter 6, “Data Extraction.” Daniel Einspanjer of the Mozilla Foundation provided sample transformations for Chapter 7, “Cleansing and Conforming.”Thanks for your contributions. This book benefited substantially from your efforts.
Much gratitude goes out to all of our technical reviewers. Providing a good technical review is hard and time-consuming, and we have been very lucky to find a collection of such talented and seasoned Pentaho and Kettle experts willing to find some time in their busy schedules to provide us with the kind of quality review required to write a book of this size and scope.
