Pentaho Kettle Solutions - Matt Casters - E-Book

Pentaho Kettle Solutions E-Book

Matt Casters

0,0
32,99 €

oder
-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

A complete guide to Pentaho Kettle, the Pentaho Data lntegration toolset for ETL

This practical book is a complete guide to installing, configuring, and managing Pentaho Kettle. If you’re a database administrator or developer, you’ll first get up to speed on Kettle basics and how to apply Kettle to create ETL solutions—before progressing to specialized concepts such as clustering, extensibility, and data vault models. Learn how to design and build every phase of an ETL solution.

  • Shows developers and database administrators how to use the open-source Pentaho Kettle for enterprise-level ETL processes (Extracting, Transforming, and Loading data)
  • Assumes no prior knowledge of Kettle or ETL, and brings beginners thoroughly up to speed at their own pace
  • Explains how to get Kettle solutions up and running, then follows the 34 ETL subsystems model, as created by the Kimball Group, to explore the entire ETL lifecycle, including all aspects of data warehousing with Kettle
  • Goes beyond routine tasks to explore how to extend Kettle and scale Kettle solutions using a distributed “cloud”

Get the most out of Pentaho Kettle and your data warehousing with this detailed guide—from simple single table data migration to complex multisystem clustered data integration tasks.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 1147

Veröffentlichungsjahr: 2010

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Title Page

Copyright

Dedication

About the Authors

Credits

Acknowledgments

Introduction

The Origins of Kettle

About This Book

How This Book Is Organized

Prerequisites

On the Website

Further Resources

Part I: Getting Started

Chapter 1: ETL Primer

OLTP versus Data Warehousing

What Is ETL?

ETL, ELT, and EII

Data Integration Challenges

ETL Tool Requirements

Summary

Chapter 2: Kettle Concepts

Design Principles

The Building Blocks of Kettle Design

Parameters and Variables

Visual Programming

Summary

Chapter 3: Installation and Configuration

Kettle Software Overview

Installation

Configuration

Summary

Chapter 4: An Example ETL Solution—Sakila

Sakila

Prerequisites and Some Basic Spoon Skills

The Sample ETL Solution

Summary

Part II: ETL

Chapter 5: ETL Subsystems

Introduction to the 34 Subsystems

Summary

Chapter 6: Data Extraction

Kettle Data Extraction Overview

Working with ERP and CRM Systems

Data Profiling

CDC: Change Data Capture

Delivering Data

Summary

Chapter 7: Cleansing and Conforming

Data Cleansing

Error Handling

Auditing Data and Process Quality

Deduplicating Data

Scripting

Summary

Chapter 8: Handling Dimension Tables

Managing Keys

Loading Dimension Tables

Slowly Changing Dimensions

More Dimensions

Summary

Chapter 9: Loading Fact Tables

Loading in Bulk

Dimension Lookups

Fact Table Handling

Summary

Chapter 10: Working with OLAP Data

OLAP Benefits and Challenges

Working with Mondrian

Working with XML/A Servers

Working with Palo

Summary

Part III: Management and Deployment

Chapter 11: ETL Development Lifecycle

Solution Design

Agile Development

Testing and Debugging

Documenting the Solution

Summary

Chapter 12: Scheduling and Monitoring

Scheduling

Monitoring

Summary

Chapter 13: Versioning and Migration

Version Control Systems

Kettle Metadata

Managing Repositories

Version Migration System

Summary

Chapter 14: Lineage and Auditing

Batch-Level Lineage Extraction

Lineage

Logging and Operational Metadata

Summary

Part IV: Performance and Scalability

Chapter 15: Performance Tuning

Transformation Performance: Finding the Weakest Link

Improving Transformation Performance

Improving Job Performance

Summary

Chapter 16: Parallelization, Clustering, and Partitioning

Multi-Threading

Using Carte as a Slave Server

Clustering Transformations

Partitioning

Summary

Chapter 17: Dynamic Clustering in the Cloud

Dynamic Clustering

Cloud Computing

EC2

Summary

Chapter 18: Real-Time Data Integration

Introduction to Real-Time ETL

Transformation Streaming

Summary

Part V: Advanced Topics

Chapter 19: Data Vault Management

Introduction to Data Vault Modeling

Do You Need a Data Vault?

Data Vault Building Blocks

Transforming Sakila to the Data Vault Model

Loading the Data Vault: A Sample ETL Solution

Updating a Data Mart from a Data Vault

Summary

Chapter 20: Handling Complex Data Formats

Non-Relational and Non-Tabular Data Formats

Non-Relational Tabular Formats

Semi- and Unstructured Data

Key/Value Pairs

Summary

Chapter 21: Web Services

Web Pages and Web Services

Data Formats

XML Examples

SOAP Examples

JSON Example

RSS

Summary

Chapter 22: Kettle Integration

The Kettle API

Executing Existing Transformations and Jobs

Embedding Kettle

OEM Versions and Forks

Summary

Chapter 23: Extending Kettle

Plugin Architecture Overview

Transformation Step Plugins

The User-Defined Java Class Step

Job Entry Plugins

Partitioning Method Plugins

Repository Type Plugins

Database Type Plugins

Summary

Appendix A: The Kettle Ecosystem

Kettle Development and Versions

The Pentaho Community Wiki

Using the Forums

Jira

##pentaho

Appendix B: Kettle Enterprise Edition Features

Appendix C: Built-in Variables and Properties Reference

Internal Variables

Kettle Variables

Variables for Configuring VFS

Noteworthy JRE Variables

Index

Pentaho® Kettle Solutions: Building Open Source ETL Solutions with Pentaho Data Integration

Published by

Wiley Publishing, Inc.

10475 Crosspoint Boulevard

Indianapolis, IN 46256

www.wiley.com

Copyright © 2010 by Wiley Publishing, Inc., Indianapolis, Indiana

Published simultaneously in Canada

ISBN: 978-0-470-63517-9

ISBN: 9780470942420 (ebk)

ISBN: 9780470947524 (ebk)

ISBN: 9780470947524 (ebk)

Manufactured in the United States of America

10 9 8 7 6 5 4 3 2 1

No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.

Limit of Liability/Disclaimer of Warranty: The publisher and the author make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation warranties of fitness for a particular purpose. No warranty may be created or extended by sales or promotional materials. The advice and strategies contained herein may not be suitable for every situation. This work is sold with the understanding that the publisher is not engaged in rendering legal, accounting, or other professional services. If professional assistance is required, the services of a competent professional person should be sought. Neither the publisher nor the author shall be liable for damages arising herefrom. The fact that an organization or Web site is referred to in this work as a citation and/or a potential source of further information does not mean that the author or the publisher endorses the information the organization or Web site may provide or recommendations it may make. Further, readers should be aware that Internet Web sites listed in this work may have changed or disappeared between when this work was written and when it is read.

For general information on our other products and services please contact our Customer Care Department within the United States at (877) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.

Library of Congress Control Number: 2010932421

Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates, in the United States and other countries, and may not be used without written permission. Pentaho is a registered trademark of Pentaho, Inc. All other trademarks are the property of their respective owners. Wiley Publishing, Inc. is not associated with any product or vendor mentioned in this book.

For my wife and kids, Kathleen, Sam and Hannelore. Your love and joy keeps me sane in crazy times.

—Matt

For my wife, Annemarie, and my children, David, Roos, Anne and Maarten. Thanks for bearing with me—I love you!

—Roland

For my children Thomas and Lisa, and for Yvonne, to whom I owe more than words can express.

—Jos

About the Authors

Matt Casters has been an independent business intelligence consultant for many years and has implemented numerous data warehouses and BI solutions for large companies. For the last 8 years, Matt kept himself busy with the development of an ETL tool called Kettle. This tool was open sourced in December 2005 and acquired by Pentaho early in 2006. Since then, Matt took up the position of Chief Data Integration at Pentaho. His responsibility is to continue to be lead developer for Kettle. Matt tries to help the Kettle community in any way possible; he answers questions on the forum and speaks occasionally at conferences all around the world. He has a blog at http://www.ibridge.be and you can follow his @mattcasters account on Twitter.

Roland Bouman has been working in the IT industry since 1998 and is currently working as a web and business intelligence developer. Over the years he has focused on open source software, in particular database technology, business intelligence, and web development frameworks. He’s an active member of the MySQL and Pentaho communities, and a regular speaker at international conferences, such as the MySQL User Conference, OSCON and at Pentaho community events. Roland co-authored the MySQL 5.1. Cluster Certification Guide and Pentaho Solutions, and was a technical reviewer for a number of MySQL and Pentaho related book titles. He maintains a technical blog at http://rpbouman.blogspot.com and tweets as @rolandbouman on Twitter.

Jos van Dongen is a seasoned business intelligence professional and well-known author and presenter. He has been involved in software development, business intelligence, and data warehousing since 1991. Before starting his own consulting practice, Tholis Consulting, in 1998, he worked for a top tier systems integrator and a leading management consulting firm. Over the past years, he has successfully implemented BI and data warehouse solutions for a variety of organizations, both commercial and non-profit. Jos covers new BI developments for the Dutch Database Magazine and speaks regularly at national and international conferences. He authored one book on open source BI and is co-author of the book Pentaho Solutions. You can find more information about Jos on http://www.tholis.com or follow @josvandongen on Twitter.

Credits

Executive Editor

Robert Elliott

Project Editor

Sara Shlaer

Technical Editors

Jens Bleuel

Sven Boden

Kasper de Graaf

Daniel Einspanjer

Nick Goodman

Mark Hall

Samatar Hassan

Benjamin Kallmann

Bryan Senseman

Johannes van den Bosch

Production Editor

Daniel Scribner

Copy Editor

Nancy Rapoport

Editorial Director

Robyn B. Siesky

Editorial Manager

Mary Beth Wakefield

Marketing Manager

Ashley Zurcher

Production Manager

Tim Tate

Vice President and Executive Group Publisher

Richard Swadley

Vice President and Executive Publisher

Barry Pruett

Associate Publisher

Jim Minatel

Project Coordinator, Cover

Lynsey Stanford

Compositor

Maureen Forys, Happenstance Type-O-Rama

Proofreader

Nancy Bell

Indexer

Robert Swanson

Cover Designer

Ryan Sneed

Acknowledgments

This book is the result of the efforts of many individuals. By convention, authors receive explicit credit, and get to have their names printed on the book cover. But creating this book would not have been possible without a lot of hard work behind the scenes. We, the authors, would like to express our gratitude to a number of people that provided substantial contributions, and thus help define and shape the final result that is Pentaho Kettle Solutions.

First, we’d like to thank those individuals that contributed directly to the material that appears in the book:

Ingo Klose suggested an elegant solution to generate keys starting from a given offset within a single transformation (this solution is discussed in Chapter 8, “Handling Dimension Tables,” subsection “Generating Surrogate Keys Based on a Counter,” shown in Figure 8-2).Samatar Hassan provided text as well as working example transformations to demonstrate Kettle’s RSS capabilities. Samatar’s contribution is included almost completely and appears in the RSS section of Chapter 21, “Web Services.”Thanks to Mike Hillyer and the MySQL documentation team for creating and maintaining the Sakila sample database, which is introduced in Chapter 4 and appears in many examples throughout this book.Although only three authors appear on the cover, there was actually a fourth one: We cannot thank Kasper de Graaf of DIKW-Academy enough for writing the Data Vault chapter, which has benefited greatly from his deep expertise on this subject. Special thanks also to Johannes van den Bosch who did a great job reviewing Kasper’s work and gave another boost to the overall quality and clarity of the chapter.Thanks to Bernd Aschauer and Robert Wintner, both from Aschauer EDV (http://www.aschauer-edv.at/en), for providing the examples and screenshots used in the section dedicated to SAP of Chapter 6, “Data Extraction.” Daniel Einspanjer of the Mozilla Foundation provided sample transformations for Chapter 7, “Cleansing and Conforming.”

Thanks for your contributions. This book benefited substantially from your efforts.

Much gratitude goes out to all of our technical reviewers. Providing a good technical review is hard and time-consuming, and we have been very lucky to find a collection of such talented and seasoned Pentaho and Kettle experts willing to find some time in their busy schedules to provide us with the kind of quality review required to write a book of this size and scope.