32,99 €
Your all-in-one resource for using Pentaho with MySQL forBusiness Intelligence and Data Warehousing Open-source Pentaho provides business intelligence (BI) and datawarehousing solutions at a fraction of the cost of proprietarysolutions. Now you can take advantage of Pentaho for your businessneeds with this practical guide written by two major participantsin the Pentaho community. The book covers all components of the Pentaho BI Suite. You'lllearn to install, use, and maintain Pentaho-and find plenty ofbackground discussion that will bring you thoroughly up to speed onBI and Pentaho concepts. * Of all available open source BI products, Pentaho offers themost comprehensive toolset and is the fastest growing open sourceproduct suite * Explains how to build and load a data warehouse with PentahoKettle for data integration/ETL, manually create JFree (pentahoreporting services) reports using direct SQL queries, and createMondrian (Pentaho analysis services) cubes and attach them to aJPivot cube browser * Review deploying reports, cubes and metadata to the Pentahoplatform in order to distribute BI solutions to end-users * Shows how to set up scheduling, subscription and automaticdistribution The companion Web site provides complete source code examples,sample data, and links to related resources.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 922
Veröffentlichungsjahr: 2011
Table of Contents
Title Page
Copyright
Dedication
About the Authors
Credits
Acknowledgments
Introduction
About This Book
How This Book Is Organized
On the Website
Further Resources
Part I: Getting Started with Pentaho
Chapter 1: Quick Start: Pentaho Examples
Getting Started with Pentaho
Working with the Examples
Running the Examples
Summary
Chapter 2: Prerequisites
Basic System Setup
Using Symbolic Links
Java Installation and Configuration
MySQL Installation
Database Tools
Summary
Chapter 3: Server Installation and Configuration
Server Configuration
Administrative Tasks
Summary
Chapter 4: The Pentaho BI Stack
Pentaho BI Stack Perspectives
The Pentaho Business Intelligence Server
Desktop Programs
Pentaho Enterprise Edition and Community Edition
Creating Action Sequences with Pentaho Design Studio
Summary
Part II: Dimensional Modeling and Data Warehouse Design
Chapter 5: Example Business Case: World Class Movies
World Class Movies: The Basics
The WCM Data
Managing the Business: The Purpose of Business Intelligence
Summary
Chapter 6: Data Warehouse Primer
Why Do You Need a Data Warehouse?
The Big Debate: Inmon Versus Kimball
Data Warehouse Architecture
Data Warehouse Challenges
Data Warehouse Trends
Summary
Chapter 7: Modeling the Business Using Star Schemas
What Is a Star Schema?
Querying Star Schemas
The Bus Architecture
Design Principles
Handling Dimension Changes
Advanced Dimensional Model Concepts
Summary
Chapter 8: The Data Mart Design Process
Requirements Analysis
Data Analysis
Developing the Model
Data Modeling with Power*Architect
Building the WCM Data Marts
Summary
Part III: ETL and Data Integration
Chapter 9: Pentaho Data Integration Primer
Data Integration Overview
Getting Started with Spoon
Summary
Chapter 10: Designing Pentaho Data Integration Solutions
Generating Dimension Table Data
Loading Data from Source Systems
Summary
Chapter 11: Deploying Pentaho Data Integration Solutions
Configuration Management
Running in the Deployment Environment
Summary
Part IV: Business Intelligence Applications
Chapter 12: The Metadata Layer
Metadata Overview
Pentaho Metadata Features
Creation and Maintenance of Metadata
Summary
Chapter 13: Using the Pentaho Reporting Tools
Reporting Architecture
Web-Based Reporting
Practical Uses of WAQR
Pentaho Report Designer
Summary
Chapter 14: Scheduling, Subscription, and Bursting
Scheduling
Background Execution and Subscription
Bursting
Summary
Chapter 15: OLAP Solutions Using Pentaho Analysis Services
Overview of Pentaho Analysis Services
MDX Primer
Creating Mondrian Schemas
Visualizing Mondrian Cubes with JPivot
Enhancing Performance Using the Pentaho Aggregate Designer
Summary
Chapter 16: Data Mining with Weka
Data Mining Primer
The Weka Workbench
Using Weka with Pentaho
Further Reading
Summary
Chapter 17: Building Dashboards
The Community Dashboard Framework
CDF Concepts and Architecture
Example: Customers and Websites Dashboard
Summary
Index
Pentaho® Solutions: Business Intelligence and Data Warehousing with Pentaho and MySQL®
Published by
Wiley Publishing, Inc.
10475 Crosspoint Boulevard
Indianapolis, IN 46256
www.wiley.com
Copyright © 2009 by Wiley Publishing, Inc., Indianapolis, Indiana
Published simultaneously in Canada
ISBN: 978-0-470-48432-6
No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.
Limit of Liability/Disclaimer of Warranty: The publisher and the author make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation warranties of fitness for a particular purpose. No warranty may be created or extended by sales or promotional materials. The advice and strategies contained herein may not be suitable for every situation. This work is sold with the understanding that the publisher is not engaged in rendering legal, accounting, or other professional services. If professional assistance is required, the services of a competent professional person should be sought. Neither the publisher nor the author shall be liable for damages arising herefrom. The fact that an organization or Web site is referred to in this work as a citation and/or a potential source of further information does not mean that the author or the publisher endorses the information the organization or Web site may provide or recommendations it may make. Further, readers should be aware that Internet Web sites listed in this work may have changed or disappeared between when this work was written and when it is read.
For general information on our other products and services please contact our Customer Care Department within the United States at (877) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Library of Congress Control Number: 2009930282
Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates, in the United States and other countries, and may not be used without written permission. Pentaho is a registered trademark of Pentaho, Inc. All other trademarks are the property of their respective owners. Wiley Publishing, Inc. is not associated with any product or vendor mentioned in this book.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.
With love, for Annemarie, David, Roos and Anne.
—Roland
To Yvonne, Thomas and Lisa, for the many evenings and weekends I spent working on this book. Sorry to have missed almost six months of your lives, but I promise to make it up!
—Jos
About the Authors
Roland Bouman has been working in the IT industry since 1998, mostly as a web and database application developer. Over the years, he has focused on open source Web technology, databases, and Business Intelligence. He is an active member of the MySQL and Pentaho communities, and was awarded the MySQL Forum Leader of the Year award in 2006. Roland is regular speaker at conferences. He is also co-author of the MySQL 5.1 Cluster Certification Guide, and technical reviewer of a number of MySQL-related titles. You can follow his blog at http://rpbouman.blogspot.com/.
Jos van Dongen is a seasoned Business Intelligence professional and well-known author and presenter. He has been involved in software development, Business Intelligence and data warehousing since 1991. Before starting his own consulting practice, Tholis Consulting, in 1998, he worked for a top-tier systems integrator and a leading management consulting firm. Over the past years, he has successfully implemented several data warehouses for a variety of organisations, both for-profit and non-profit. Jos covers new BI developments for the Dutch Database Magazine and speaks regularly at national and international conferences. In addition to this book, he authored another book on open source BI. You can find more information about Jos on http://www.tholis.com.
Credits
Executive Editor
Robert Elliott
Project Editor
Sara Shlaer
Technical Editors
Tom Barber
Jens Bleuel
Jeroen Kuiper
Thomas Morgner
Senior Production Editor
Debra Banninger
Copy Editor
Nancy Rapoport
Editorial Manager
Mary Beth Wakefield
Production Manager
Tim Tate
Vice President and Executive Group Publisher
Richard Swadley
Vice President and Executive Publisher
Barry Pruett
Associate Publisher
Jim Minatel
Project Coordinator, Cover
Lynsey Stanford
Proofreader
Josh Chase, Word One
Scott Klemp, Word One
Indexer
J & J Indexing
Cover Image
Ryan Sneed
Cover Designer
Maciej Frolow / Brand X Pictures / jupiterimages
Acknowledgments
This book is the result of the work and ideas of many different people. We, the authors, happen to be the ones that get to put our names on the cover, but we couldn't have done it without the help of these people. Therefore, we'd like to use this opportunity to pay our respects.
One thing that characterizes healthy open source projects is the passion and level of involvement of the developers and software engineers that create the project. Despite their busy schedules, we found the developers from the Pentaho Corporation always willing to make the effort to explain a particular detail of their software. This makes them not only great software developers, but also valuable and respected members of the community. In particular, we'd like to thank Doug Moran, Gretchen Moran, Jens Bleuel, Julian Hyde, Matt Casters, and Thomas Morgner.
Good software never fails to grow a vibrant and lively community. This is even more true for open source software. In a fairly small amount of time, the Pentaho community has matured remarkably, giving rise to a group of Pentaho experts that not only write high quality blogs and help each other on the official Pentaho forums and the (unofficial) ##pentaho IRC channel on freenode.net, but also actively participate in and contribute to the development of the Pentaho product. We would like to thank everybody who helped us out on the forums and the IRC channel. In particular, we'd like to thank Daniel Einspanjer, Harris Ward, Nicholas Goodman, Prashant Raju, Tom Barber, and Yassine Elassad for their role in shaping this community. As is to be expected with an open source project like Pentaho, some community members double as product developers. Special thanks goes out to Ingo Klose, and extra special thanks to Pedro Alves. Together, they created the Community Dashboard Framework, and Pedro was very helpful to us explaining its architecture and design. Other people that deserve a special thank you note are Mark Hall, the lead developer of the Weka project, Kasper Sørensen, the architect of eobjects DataCleaner, and Ronald Damhof, for his valuable insights in the Data Vault modeling technique.
Finally, we'd like to thank Sara Shlaer and Bob Elliott for managing this project, and note the great work that both Sara and Nancy Rapoport did with the documents we delivered. We realize it took some extra effort to transform the writings of these two Dutch chaps into readable English text. The collaboration with everyone involved at Wiley has always been very efficient and pleasant; perhaps we will be able to work together again on another project.
—Roland Bouman and Jos van Dongen
First, I'd like to thank Jos, my co-author. Without him, writing this book would have been so time-consuming that I doubt it would have been practically feasible. And of course, I'd like to thank the readers of http://rpbouman.blogspot.com/. The idea to write this book developed over time and was fed mainly by the comments I received in response to a number of posts I dedicated to Pentaho. In addition, I received many encouraging comments and emails from my blog's readers during the writing process—there couldn't have been a better incentive to keep working to finish this book.
—Roland Bouman
My open source BI journey started almost three years ago when I published one of my first feature articles in the Dutch Database Magazine, titled “Pentaho, Promising Open Source BI Suite”. I couldn't imagine back then that this would ultimately lead to writing a complete book on the subject, but somehow we did! So let me thank my co-author Roland for his relentless efforts in getting this project off the ground. Frankly, I don't have a clue how he managed to get all this work done with a full time job and three little children at home to take care of. Special thanks also to Jeroen Kuiper, my good friend and former colleague, for shaping up the data warehouse sections of this book, and for being a very critical reviewer of the material.
—Jos van Dongen
Introduction
In 1958 IBM Research Fellow Hans Peter Luhn wrote a seminal paper for the IBM Systems Journal called “A Business Intelligence System”. In this paper the term intelligence was defined as “the ability to apprehend the interrelationships of presented facts in such a way as to guide action towards a desired goal.” Nowadays this paper is generally considered as the spark that started the development of Business Intelligence (BI) systems as we know them today.
For a long time the software market for supporting BI has been the domain of a handful of proprietary vendors who could charge huge sums of money for their specialized solutions. The late nineties marked an important turning point for the worldwide software market when open source solutions started to become viable alternatives for supporting mission-critical systems. First the various flavors of Linux came of age, and in 2001 and 2002 several new projects were initiated, all in the areas of data integration, reporting, analysis and data mining, the typical cornerstones of a modern BI solution. In 2004 both Jaspersoft and Pentaho were founded to offer complete BI suites that leveraged the existing stand-alone BI solutions. Since then, these companies have seen strong growth and market adoption, making open source BI solutions a serious alternative to the established proprietary vendors.
You might ask yourself now, what is Business Intelligence and why does it matter? In fact, the definition given by Luhn is still a valid one, but in 1993, Gartner Analyst Howard Dresner reintroduced the term Business Intelligence and defined it as follows:
Concepts and methods to improve business decision making by using fact-based support systems.
This is a somewhat less abstract definition compared to Luhn's, and one that is still used by most people to explain what BI is. The definition is not primarily technology-focused (“concepts and methods”), but it is the last part—“fact-based support systems”—that is the subject of this book. This book is all about creating a fact-based support system using the toolset delivered by Pentaho. To be able to make better decisions based on facts, you'll need to get this factual information from one or more information systems, integrate this data in a useful way, and present users with reports and analysis that will help them to understand past and present organizational performance.
The real value of BI systems lies in their use to support organizations to make better informed decisions that will lead to increased profitability, lowered costs, improved efficiency, market share growth, higher employee satisfaction, or whatever the goals of your organization might be. The added benefit of using the open source Pentaho solution to accomplish this is the great value for money and flexibility of the software. This allows any organization, for-profit or non-profit, big or small, to implement and use this software to make better decisions.
About This Book
The beginnings of Pentaho Solutions go back more than a year ago when we, the authors, noticed an increasing interest in open source and free software solutions, combined with a still-growing awareness that BI software is essential in measuring and improving an organization's performance.
Over the past decade, open source variants of more and more types of software have become commonly accepted and respected alternatives to their more costly and less flexible proprietary counterparts. The fact that software is open source is often mistaken for being free of cost, and though that might be true if you only look at the license costs, a BI solution cannot (and never will) be free of cost. There are costs associated with hardware, implementation, maintenance, training and migration, and if this is all summed up it turns out that licenses make up only a small portion of the total lifecycle cost of any software solution. Open source, however is much more than a cheaper way of acquiring software. The fact that the source code is freely available to anyone ensures better code quality since it is more likely that bugs are found when more people have access to the source than just the core developers. The fact that open source software is built on open standards using standard programming languages (mostly Java) makes it extremely flexible and extensible. And the fact that most open source software is not tied to a particular operating system extends this flexibility and freedom even further.
What is usually lacking, though, is a good set of documentation and manuals. Most open source projects provide excellent quality software, but developers usually care more about getting great software out than delivering proper documentation. And although you can find many good sources of information about each piece of the Pentaho BI solutions, we felt there was a need for a single source of information to help the novice user on his or her way in discovering the Pentaho toolset and implementing the first solution. That is exactly what this book is for—to help you to build your first BI solutions using Pentaho, from the very beginning (discovering Pentaho) to the very end (building dashboards for end users).
Who Should Read This Book
This book is meant for anyone who wants to know how to deliver BI solutions using Pentaho. Maybe you are an IT manager looking for a cost efficient BI solution, an IT professional looking to broaden your skill set, or a BI or data warehouse consultant responsible for developing BI solutions in your organization. Maybe you're a software developer with a lot of experience building open source solutions but still new to the world of Business Intelligence. And maybe you're already an experienced BI or data warehouse developer with deep knowledge of one or more of the existing proprietary tools. In any case, we assume you have a hands-on mentality since this is a hands-on book. We do expect some familiarity with using computers to deliver information, installing software, and working with databases, but most of the topics will be explained right from the start. So if you are not a seasoned SQL expert, don't worry: we'll cover the basics of that query language to get you on your way. Of course the BI and data warehouse concepts are explained as well, but the primary focus is on how to transform these concepts into a working solution. That is exactly why the book is called Pentaho Solutions.
What You Will Need to Use This Book
In order to use this book, you only need two things: a computer and an Internet connection. All the software we discuss and use in this book is freely available over the Internet for download and use. The system requirements for the computer you will need are fairly moderate; in fact, any computer that is less than four years old will do the job just fine, as long as you have at least 1 Gigabyte of RAM installed and 2 Gigabytes of free disk space available for downloading and installing software.
The various chapters contain URLs where you can find and download the software being used and the accompanying installation instructions. As for Pentaho, there are, apart from the actual source code of course, three versions of the software that you can use:
GA releases—These are the stable builds of the software, usually not the most recent ones but surely the most reliable.Release candidates—The “almost ready” next versions of the software, possibly with a few minor bugs still in them.Milestone releases—These are created more frequently and allow you to work with recent versions introducing new features.Nightly builds—The most up-to-date versions of the software, but also the least stable ones.When writing this book, we mostly worked with the nightly builds that generally precede the GA releases by three months or more. This means that when you read this book, the software used in this book is at least a milestone release or already GA. This allows you to work through the material using a stable, bug-free product and you can concentrate on building solutions, not fixing bugs.
Note
As this book goes to press, the next major release of Pentaho is expected in Fall 2009. The final version number for this release is not yet public knowledge, but the working version is currently designated as “Citrus.”
The complete list with download options is available online at http://wiki.pentaho.com/display/COM/Community+Edition+Downloads.
What You Will Learn from This Book
This book will teach you:
What Business Intelligence is, and why you need itThe components and products that form the Pentaho Business Intelligence suite, and how these products and components fulfill particular BI needsHow to install and configure Pentaho, and how to connect it to a data warehouseHow to design a data warehouse using open source toolsHow to build and load a data warehouse with Pentaho Data Integration (Kettle)How to set up a metadata layer to allow ad-hoc and self-service reporting without using direct SQL queriesHow to create reports using the Pentaho Reporting toolsHow to create Pentaho Analysis Services (Mondrian) cubes, and visualize them using the JPivot cube browserHow to set up scheduling, subscription, and automatic distribution of BI contentHow to get started with Pentaho Data Mining (Weka)How to build dashboards using the Community Dashboard Framework for PentahoHow This Book Is Organized
This book explains BI concepts, technologies, and solutions. We use a fictional online video sales and rental business (think Netflix) that appears throughout the book. For each distinct part, sample implementations are created using Pentaho. When the example relies on a database, we have taken care to ensure the sample code is compatible with the popular and ubiquitous MySQL database (version 5.1).
These samples provide the technical details necessary to understand how you can build BI solutions for real-world situations. The scope of these BI solutions is mainly on the level of the departmental data mart, which we believe is the most common business case for BI/data warehousing.
Part I: Getting Started with Pentaho
Part I is focused on gaining a quick and high-level understanding of the Pentaho software, its architecture, and its capabilities. In addition this part introduces you to a number of supporting open source tools that can help in developing real-world data warehouses and BI applications.
Chapter 1: Quick Start: Pentaho Examples
Business Intelligence is a vast subject, and Pentaho is a complex piece of software. At the same time, it is easy to grasp why you would need it, and how it might apply to you if you follow along with a few examples. This chapter does exactly that: it provides a practical, hands-on demonstration of what you can do with BI and how Pentaho helps you do it.
Chapter 2: Prerequisites
In order to develop BI applications and the supporting architecture such as a data warehouse, you need several software products, such as data modeling tools and a database server. This chapter introduces you to a number of products that are key to the examples developed in this book, as well as a number of supportive tools to increase productivity. Like Pentaho, all products mentioned here are open source/free software. We are confident you'll find a number of valuable additions to your professional toolkit here.
Chapter 3: Server Installation and Configuration
Although this book does not provide a full reference to Pentaho administration and configuration, the most important configuration and installation tasks are described in this chapter. As such, this chapter is not so much about explaining concepts; rather, it is about knowing what configuration files to edit to set things up to your liking. You should at least read through this chapter once before building Pentaho applications. Much of the information here has the character of a reference. As we cover more aspects of Pentaho throughout the book, you may want to revisit this chapter to look up certain details concerning the Pentaho configuration.
Chapter 4: The Pentaho BI Stack
This chapter provides an overview of Pentaho, its components, its capabilities, and its architecture. It introduces you to important Pentaho concepts, such as action sequences and the solution repository. Although there is a fair amount of theory in this chapter, it also provides practical explanations, tying together a lot of material covered in the previous chapters.
Part II: Dimensional Modeling and Data Warehouse Design
Part II introduces you to key concepts and techniques concerning dimensional modeling and data warehousing. These concepts are made tangible by using a single example business case based on a (fictional) online DVD rental business, World Class Movies. By constantly and consciously mixing theory and practical, hands-on examples, we hope to lay a sound foundation for developing BI applications in the remainder of the book.
Chapter 5: Example Business Case: World Class Movies
In this chapter, we introduce the World Class Movies online DVD rental business. We provide a detailed explanation of its business and underlying OLTP database schema.
Chapter 6: Data Warehouse Primer
This chapter introduces and explains fundamental concepts of dimensional modeling and data warehousing. It explains the benefits of using a data warehouse and how a data warehouse differs from other types of databases. The chapter covers the history, current state, and future outlook of data warehouse technology and architecture.
Chapter 7: Modeling the Business Using Star Schemas
This chapter takes the dimensional modeling and data warehousing concepts from the previous chapter and applies them to the World Class Movies business case in order to develop the various parts of the data mart model. This model serves as the cornerstone for BI applications (which are developed in the next part of the book).
Chapter 8: The Data Mart Design Process
In this chapter, the logical design from the previous chapter is further developed, culminating in a series of star schemas that serve to achieve the physical implementation of the World Class Movies data warehouse, which is the basis of virtually all hands-on examples in the remainder of the book.
Part III: ETL and Data Integration
Part III is devoted to the process of filling the data warehouse using Pentaho Data Integration tools and features.
Chapter 9: Pentaho Data Integration Primer
This chapter provides an overview of all the tools that make up the Pentaho Data Integration (PDI) toolkit. It explains the architecture and introduces you to a number of concepts that are fundamental to ETL design within the Pentaho platform. At the same time, it provides you with the basic hands-on skills that will help you to use Pentaho Data Integration tools effectively to build ETL applications.
Chapter 10: Designing Pentaho Data Integration Solutions
Using the concepts and basic skills gained from the previous chapter, this chapter focuses on designing and building a hands-on practical solution to load the Orders data mart of the World Class Movies data warehouse. The example transformations are accompanied by an in-depth description of commonly used transformation steps.
Chapter 11: Deploying Pentaho Data Integration Solutions
This chapter focuses on managing and deploying Pentaho Data Integration solutions. In addition, it explains how individual transformations can be combined to build jobs. Several techniques for managing static resources such as database connections and files are discussed, along with some of the more advanced PDI features such as remote execution and clustering.
Part IV: Business Intelligence Applications
Part IV explains how to use the data warehouse to create BI content the end users care about.
Chapter 12: The Metadata Layer
This chapter introduces Pentaho metadata and the metadata editor. In addition to explaining metadata concepts and the purpose of metadata in BI solutions, this chapter provides detailed instructions for creating a metadata domain that can be used to create self-service reports.
Chapter 13: Using the Pentaho Reporting Tools
This chapter provides an in-depth tutorial about designing and deploying reports using the Pentaho Report Designer. You will learn how to create queries using the visual SQL designer and the metadata query tool, add parameters to the report for interactive analysis, and build great looking and insightful reports using tables, graphs and charts.
Chapter 14: Scheduling, Subscription, and Bursting
This chapter is all about automatic production and delivery of BI content. You will learn how to use Pentaho's built-in scheduler and how it ties into features such as subscription and scheduling.
Chapter 15: OLAP Solutions Using Pentaho Analysis Services
This chapter explains the Pentaho OLAP components. In addition to explaining OLAP and MDX in general, this chapter helps you create Pentaho analysis cubes and views. The last part of this chapter introduces the aggregate designer that helps you to improve the performance of the Pentaho Analysis tool.
Chapter 16: Data Mining with Weka
In this chapter we introduce basic data mining concepts and practices such as clustering and classification using Weka, the Pentaho data mining component. We end this chapter with an example of how you can use a data mining model created with Weka in a Pentaho Data Integration transformation.
Chapter 17: Building Dashboards
This chapter explains the concepts underlying the Community Dashboard Framework. Using a step-by step method, this chapter explains in detail how to combine a number of different Pentaho solution items and bring them together on one dashboard.
On the Website
All the example material used in the book is available for download from the companion website at Wiley (www.wiley.com/go/pentahosolutions) and at www.worldclassmovies.com. The downloads include the following items:
Power*Architect data models for the databases in the bookData files for customers, products, and employeesMySQL create scripts for the databasesMySQL scripts for generating sales transactionsAll PDI jobs and transformationsMetadata models for building reportsSample reportsMondrian schemasDashboard definition filesAction sequence examplesFurther Resources
There are numerous books available on the specific topics covered in this book. Many chapters contain references for further reading and links to websites that contain additional information. If you are new to Business Intelligence and data warehousing in general (or want to keep up with the latest developments), here are some good places to start:
http://en.wikipedia.org/wiki/Business_intelligencehttp://www.kimballgroup.comhttp://b-eye-network.comhttp://www.tdwi.orgWe also encourage you to visit our websites, http://rpbouman.blogspot.com and www.tholis.com, where you can find our contact information in case you want to get in touch with us directly.
Part I
Getting Started with Pentaho
In This Part
Chapter 1: Quick Start: Pentaho Examples
Chapter 2: Prerequisites
Chapter 3: Server Installation and Configuration
Chapter 4: The Pentaho BI Stack
Chapter 1
Quick Start: Pentaho Examples
Pentaho is a powerful Business Intelligence Suite offering many features: reporting, OLAP pivot tables, dashboarding and more. In this book you will find a lot of detailed information about Pentaho's components, how they work and interact, the features they deliver, and how to use the Pentaho BI Suite to create solutions for real-world problems. However, it's a good idea to try and grasp the big picture before diving into the details.
This chapter helps you get started by showing you where to get the software and how to install and run it. The Pentaho BI Suite includes many examples demonstrating its features to give new users an idea of what kind of solutions you can build with it. Most of these examples work “out of the box” and are thus ideal for an introduction to the product. By reading this chapter, you'll get acquainted with Pentaho by looking at some examples.
Getting Started with Pentaho
In this section, we describe how to obtain the software, install it, and run it. To run the software, you need a regular desktop or laptop computer running any popular operating system, such as Ubuntu Linux, Mac OS X, or Microsoft Windows 7, XP, or Vista. To download the necessary software you will need an Internet connection with sufficient bandwidth to download tens to hundreds of megabytes.
Downloading and Installing the Software
The Pentaho BI Suite is open source software; you are free to use and distribute its programs, and if you like, you can study and even modify its source code. You may do all of this free of charge.
Pentaho is programmed in the Java programming language. Before you can run Java programs, you need to install Java. For Pentaho, you need at least Java version 1.5. You should also be able to use Java 1.6. We assume you already have a recent version of Java installed on your system. You can find more details on downloading and installing Java in Chapter 2.
You can download all of Pentaho's released software from the SourceForge website. The easiest way to find the software is to navigate to http://sourceforge.net/projects/pentaho/ and click the Download link. You will see a list of products you can download.
For now, you won't need all of the software—all you're interested in at the moment is the Business Intelligence Server. Click the Download link in the far right column. This takes you to a page containing a list of different versions of the software. Here you should take care to find the latest version of the generally available (GA) release, packaged in a way that is appropriate for your platform. For example, Microsoft Windows users should download the .zip compressed package, and users of UNIX-based systems should download the .tar.gz compressed package.
Note
In Pentaho's download pages on SourceForge, you can usually find at least the latest generally available (GA) release as well as a so-called milestone release of the new, upcoming version. If you really want to be on the bleeding edge of development, you can download nightly builds of the software from http://ci.pentaho.com/. For this book, we mostly worked with the nightly builds of the Citrus release, which was still being developed at the time of writing, but which should be available as a milestone or GA release by the time of publishing.
It is always a good idea to try out the milestone releases to keep track of future changes and additions. But beware that milestone releases are still in development; they are not intended for production use, and you may find bugs or experience usability issues. However, this is one of the best reasons why you should run milestone releases—by reporting any issues you experience, you can directly influence the improvement of the software for your own benefit (as well as that of all other users).
After downloading the .zip or .tar.gz compressed package, you must extract the actual software from the compressed package and copy it to some place you find convenient. Windows users can right-click the .zip file and choose Extract Here (in new folder) in the context menu. Alternatively, you can use a third-party program such as Peazip to extract the programs from the compressed package. Users of UNIX-like systems can open a terminal and extract the package from the command line.
Extraction should result in a single folder containing all of the Pentaho BI Server software. Windows users can place this folder anywhere they like, but it makes most sense to put it in the Program Files directory. For UNIX-like systems, the proper location depends on the exact UNIX flavor, but for checking out the examples, it is best to move the Pentaho Server directory to your home directory. In the rest of this chapter, we refer to the directory containing the Pentaho Server software as the Pentaho home directory or simply Pentaho home.
Running the Software
Now that you have downloaded and installed the software, you can start using it.
Starting the Pentaho BI Server
In the Pentaho home directory, you will find a few scripts that can be used to start the server. Microsoft Windows users can double-click the script named start-pentaho.bat.
For UNIX-based systems, the script is called start-pentaho.sh. You may first need to allow this script to be executed. Modern Linux desktop environments such as GNOME and KDE will let you do this in the file's Properties dialog, which you can invoke from the file browser. For example, in Ubuntu Linux, you can right-click the file and choose Properties from the context menu to invoke the dialog. In the Permissions tab in the dialog, you can select a checkbox to allow the file to be executed, as illustrated in Figure 1.1.
Figure 1.1 Making the start-pentaho.sh script executable
Alternatively, you can open a terminal and change directory (using the cd command) to the Pentaho home directory. From there, you can use the following command to make all .sh scripts executable:
shell> chmod ug+x *.sh
Now you can simply start the script by double-clicking it (you may need to confirm in a dialog) or by typing it in the terminal:
shell> ./start-pentaho.sh
After starting the script, you will see quite some output appearing in the console. You should leave open the terminal window in which you started the script.
Note
The start-pentaho script does two things.
First, it starts a HSQLDB database server, which is used by the Pentaho server to store system data, as well as a sample database, which is used by most examples. By default, the HSQLDB database runs on port 9001. You should make sure no other server is running on that port.
Second, it starts a Tomcat server. By default, the Tomcat server listens on port 8080 for web requests. You should make sure no other server is running on that port, or the Pentaho BI Server will not be started successfully.
Logging in
After starting the server you can start your Internet browser to connect to the server. You should be able to use any of the major browsers (such as Mozilla Firefox, Microsoft Internet Explorer, Apple Safari, Opera, or Google Chrome) to do this. Navigate your browser to the following address:
http://localhost:8080
You are automatically redirected to the following:
http://localhost:8080/pentaho/Login
Shortly, you should see a Welcome page for the Pentaho user console. From there, you can log in to the server by pressing the large orange Login button. If you press the button, a Login box appears. From there, you can select a username from the drop-down list. For now, log in as the user Joe, as shown in Figure 1.2.
Figure 1.2 The Pentaho welcome screen and login dialog
After selecting the username, you can press Login button to actually log in.
Mantle, the Pentaho User Console
After confirming the login, you should see the Pentaho user console, as shown in Figure 1.3.
Figure 1.3 The Pentaho user console, also known as Mantle
In the user console, you'll find a few elements to control the Pentaho BI Server:
A menu bar, which is located at the top of the page and spans the page horizontally. Here you can find some standard menu items: File, View, Tools and Help.A toolbar containing several buttons, located immediately beneath the menu.A side pane, located on the left of the page, can be dynamically resized using the gray vertical bar at the far right of the pane. The pane can also be hidden/displayed in its entirety using the Toggle Browser button, which is the rightmost button on the toolbar.The tree view that is visible in the upper half of the side pane is called the Repository Browser. In Figure 1.3, this is labelled Browse. You can use this to browse through all BI content available in the Pentaho BI Server.A folder contents pane is located in the side pane, right beneath the solution repository browser. In Figure 1.3 this is labelled Files. It shows any contents of the selected folder in the solution repository (such as reports, dashboards and OLAP pivot tables) as a list of items. You can open an item by double-clicking it.A workspace. This is the larger pane on the right. When you double-click an item in the folder contents pane, it will be displayed here using a tab interface.Working with the Examples
The community edition of the Pentaho BI Server comes with two sets of examples:
BI Developer ExamplesSteel WheelsEach set of examples resides in its own Pentaho solution and is visible in the solution repository browser (see Figure 1.4).
Figure 1.4 Two example solutions included in the Pentaho BI Server
Both of these Pentaho solutions contain good examples to demonstrate the types of reports you can create with Pentaho. Both solutions use the same sample data set. The BI Developer Examples focus more on the technical aspect of accomplishing a particular task, whereas the Steel Wheels examples illustrate how to combine techniques to build an application to support a classic cars business. The Steel Wheels examples also pay more attention to customizing look and feel.
Using the Repository Browser
You can access all of the examples using the repository browser. (This is the top pane of the left side bar in the user console, labelled Browse.) The repository browser offers a tree view that can be used to open and close the folders in the repository. To open a folder and reveal its subfolders, simply click once on the plus icon immediately on the left side of the folder icon. The folder's subfolders will become visible right beneath the parent folder, and the icon left of the folder icon changes to display a minus, indicating the folder is currently expanded. To close a folder and hide its subfolders, click on the minus icon.
To view the contents of a folder, click the folder icon or the folder name that appears directly on the right of the folder icon. The folder title will display a gray highlighting and its contents will become visible in the folder contents pane directly beneath the repository browser (in Figure 1.3, this is labelled Files).
To open an item that appears in the Files pane, double-click it. This will open a new tab page in the workspace, showing the output created by the item.
Understanding the Examples
Although you can learn a lot from the examples by simply running them, you can learn even more if you can see how they were built. Especially if you are a Business Intelligence developer, you should consider examining the examples more closely using Pentaho Design Studio.
You'll learn the details about Pentaho Design Studio in Chapter 4, but you can follow these steps to get started quickly:
1. Download Pentaho Design Studio from the Pentaho downloads page at SourceForge.net.
2. Unzip the download to some location you find convenient.
3. Start Pentaho Design Studio. Microsoft Windows users can double-click PentahoDesignStudio.exe; users of UNIX-based systems can execute the PentahoDesignStudio binary file.
4. Use the main menu (File ⇒ Switch Workspace) to change the workspace to the directory where you installed the Pentaho BI Server. The program will restart. In the opening splash screen, choose Workbench.
5. Create a new project by choosing File ⇒ New ⇒ Project. In the dialog, expand the General folder and choose Project to create a plain project. Click Next.
6. In the next dialog, enter pentaho-solutions for the project name. Make sure that whatever you type here corresponds exactly to the name of the pentaho-solutions directory located in the home directory of the Pentaho BI Server. The Use Default Location checkbox should be selected, and the location should automatically point to the Pentaho BI Server home directory.
7. Confirm the dialog.
In the Navigator tab page in the left side pane in Pentaho Design Studio, you should now see the pentaho-solutions project folder (which corresponds exactly with the actual pentaho-solutions folder). You can expand this folder and browse through the Pentaho solution repository.
Double-clicking on any items inside the folders will usually load the file in a new tab page in the Pentaho Design Studio Workspace. You can learn a lot, especially from opening the .xaction files that are present throughout the repository. Refer to Chapter 4 for more details on these files.
Beware that the items that show up in the repository browser in the user console of the Pentaho BI Server usually have a label that is distinct from the actual file name. This complicates things a bit in case you're looking for the corresponding item in Pentaho Design Studio, as the navigator there only displays file names. To discover the corresponding file name for any item shown in the repository browser, right-click the item and choose Properties in the context menu. This will pop up a dialog with a few tabs. The actual file name is shown in the General tab.
Note
The .xaction extension indicates an action sequence. Action sequences are Pentaho-specific lightweight processes to run or deliver BI content. In this particular case, the action sequence simply calls a Pentaho report. Action sequences are coded in a specific XML-format and typically stored in .xaction files. Action sequences are discussed in more detail in Chapter 4.
Running the Examples
In the remainder of this chapter, we discuss a few items from these examples to give you a feel for what you can do with Pentaho solutions. For each item, we include references to the chapters of this book that relate to the example. We hope this will allow you to quickly get an overview of Pentaho's features and see how this book can help you master them.
Reporting Examples
Reporting is often one of the first requirements of any BI solution. Reporting is covered in detail in Chapter 13. Most of the reports discussed here are invoked from an action sequence; you can find more details on action sequences in Chapter 4.
The following sections examine a few of the reporting examples.
BI Developer Examples: Regional Sales - HTML
The Regional Sales - HTML example is one of the most straightforward reporting examples; as you would assume, it shows the sales figures for an example company broken down by region. You can find it in the Reporting folder in the BI Developer Examples set. The corresponding file name is JFree_Quad.xaction.
When you run the example, the report output is immediately shown in the workspace (see Figure 1.5).
Figure 1.5 The Regional Sales - HTML sample report
In the report output you see an organization detailed by region (Central), department (Executive Management, Finance) and then position title (SVP Partnerships, CEO, and so on). For the position title level, you see the actual data. In this case, the data pertains to sales and shows the actual and projected (budgeted) sales numbers in the first two columns and the variance in the third column. You also see a totals line that sums up the figures for the department level, and if you could scroll down further you would also see the totals for the regional level, followed by the figures for another region. All the way down at the bottom of the report you would see totals for the entire business.
Steel Wheels: Income Statement
The Income Statement example report from the Steel Wheels example set is another typical report with a self-explanatory name. You can find it in the Reporting folder beneath the Steel Wheels solution, and the corresponding file name is Income Statement.xaction. Figure 1.6 shows the report.
Figure 1.6 The Steel Wheels Income Statement report
A few differences from the Regional Sales report in the previous example are the styling and the output format. Although both reports were created with the Pentaho Report Designer, and both are rendered by the Pentaho reporting engine (which is the component responsible for interpreting reports and generating report output), they look quite different. Whereas the Regional Sales report outputs an HTML page, this report delivers a PDF file as output. In addition, this report shows adornments using a picture for a logo and a page background picture.
Steel Wheels: Top 10 Customers
In the previous section, we mentioned that the Income Statement report delivers output in the form of a PDF file, whereas the Regional Sales example outputs a plain web page. The Top 10 Customers report illustrates two more important features of the report output format. You can find this report also in the reporting folder in the Steel Wheels example set, and its file name is Top Ten Customer ProductLine Analysis.xaction. Running this example does not immediately show the report output, but displays the dialog shown in Figure 1.7 instead.
Figure 1.7 The Top 10 Customers report
As indicated by the dialog, you can choose from as many as five different output formats. In the previous reporting examples, the desired output format was stored as part of the report, but there is nothing in the reporting engine that forces this. This allows users to choose whatever format is most appropriate for the purpose at hand.
The dialog shown in Figure 1.7 illustrates another important feature of Pentaho reporting. The user can choose to wait for the report output now, or to have the Pentaho BI Server run the report in the background. The latter option will execute the report, but does not wait for the output to be returned. Rather, the output will be stored in the user's personal storage space on the server. This feature is especially useful for long-running reports.
You can find more on background execution and related features such as scheduling and subscription in Chapter 14.
BI Developer Examples: button-single-parameter.prpt
The previous example reports were all called from action sequences. In the upcoming Citrus release, reports can also be called directly. Examples using this feature are all located in the Reporting folder in the BI Developer Examples set.
This example takes a closer look at the button-single-parameter.prpt example. When you start it, the report loads immediately in the workspace. However, the actual report output won't show until you press one of the Region buttons that appear in the Report Parameters section at the top of the page. Figure 1.8 illustrates what you might see after you press the Central button.
Figure 1.8 The button-single-parameter.prpt example
This example shows yet another feature of Pentaho, namely report parameters. Through parameters, the user can interact with the report and specify values to influence report behavior. Generally, this feature is used to allow the user to select only a portion of all possible report data.
In this example, there are two parameters. The Additional Title-Text parameter allows the user to specify a title that appears above all remaining report output. There is another parameter for Region, which allows the report to render output pertaining to only the specified region.
There are many more things you can do with report parameters, and these examples, as well as Chapter 13 of this book, should offer enough guidance for you to use this feature in a meaningful way.
Charting Examples
Whereas reports are great to communicate detailed information, they are less suitable for obtaining an overview of the data as a whole. For this purpose, charts and graphs usually work better. Charts are also better suited than reports to display trends over time.
The Pentaho BI Server ships with two different charting solutions:
JFreeChart—A 100% Java chart library.Pentaho Flash Charts—A charting solution based on open flash charts (which requires Adobe Flash).Pentaho reporting offers full integration with JFreeChart, and you will find detailed information on integrating charts with your reports in Chapter 13. You can find more information about JFreeChart charts and how to integrate them with dashboards in Chapter 17.
Steel Wheels: Chart Pick List
The Chart Pick List example is located in the Charts folder in the Steel Wheels example set. The corresponding file name is ChartComponent_ChartTypes.xaction. Executing the item loads a dialog in the workspace that allows you to choose a particular chart type. After picking the chart type, you can press the Run button to actually display the chart. Figure 1.9 shows how this works for a Pie Grid.
Figure 1.9 Pentaho charting using the JFreeChart Chart Pick List
Steel Wheels: Flash Chart List
Functionally, the Flash Chart List example is similar to the Chart Pick List example (which is based on JFreeChart). The difference is that the JFreeChart Pick List example is based on the Open Flash Chart project. You can find the Flash Chart List also in the Charts folder within the Steel Wheels example set. The corresponding file name is pentahoxml_picker.xaction.
BI Developer Examples: Regional Sales - Line/Bar Chart
The Regional Sales - Line/Bar Chart example is located in the Reporting folder in the BI Developer Examples solution. The corresponding file is JFree_SQLQuery_ComboChart.xaction.
This example report displays a chart on the top of the page, and below that, a more detailed report shows the actual figures. In this case the chart is embedded into the report. The example report is shown if Figure 1.10.
Figure 1.10 Regional Sales - Line/Bar Chart example
Analysis Examples
Like reporting, analysis is another essential feature of all BI solutions. Reports are typically static (save for parameters) and mainly used to support decisions that affect the business at the operational level. Analysis tends to be a lot more dynamic, and is typically used by managers to support decisions at the tactical and strategic level.
One of the typical elements in analytical solutions is that they allow the user to dynamically explore the data in an ad-hoc manner. Typically, the data is first presented at a highly aggregated level, say, total sales per year, and then the user can drill down to a more detailed level, say, sales per month per region. Any interesting differences between regions and/or months can then be used to drill into a new direction until a new insight or understanding of the business is obtained, which could then be used to affect plans for new promotions, next season's product catalog, or development of new products. This, in a nutshell, is what analysis is for.
Closely related to typical analytical questions and solutions is the dimensional model. Ultimately, this is what allows viewing data in aggregated form and features such as drill up/down. You will find detailed information about the dimensional model in Chapters 6, 7, and 8 of this book. In Chapter 15, we discuss the practical implementation of analytical applications using Mondrian and JPivot. All analytical examples presented in this chapter are based on Mondrian/JPivot.
BI Developer Examples: Slice and Dice
The Slice and Dice example is located in the Analysis folder in the BI Developer Examples. Its corresponding file is called query1.xaction.
The Slice and Dice example is the most basic analysis example included with the Pentaho BI Server. Running it produces a dynamic crosstab, also known as a pivot table. The pivot table shows actual and budgeted sales figures, as well as actual versus budget variance. In the context of Analytics, figures like these are called measures or metrics. The measures can be split according to Region, Department, and Position. These headings are shown at the left side of the pivot table and represent dimensions, which are aspects that describe the context of the metrics.
A typical feature is that the pivot table not only shows the figures themselves but also totals, and that the totals can be computed at several levels of the dimensions (see Figure 1.11).
Figure 1.11 The Slice and Dice pivot table example
In Figure 1.11, you can see the columns for Region, Department, and Positions. The first row in the pivot table shows the results for All Regions, Departments, and Positions, and the figures are aggregated or “rolled up” along these dimensions. This represents the highest level of aggregation. Below that, you see that the data is split; in the first column, All Regions is split into Central, Eastern, Southern, and Western, forming the second-highest level of aggregation for the Region dimension. In the first row for each individual region, you see the data rolled up only across Department and Positions. For the Central region, the data is again split, this time showing all individual departments. Finally, for the Executive Management department, data is again split according to position.
The splitting and rolling up is achieved dynamically by clicking on the plus and minus icons that appear next to the labels identifying Region, Department, and Positions. For example, by clicking on the plus icon next to any of the All Departments labels appearing in the second column, you can drill down and see how the rolled-up total value for any of the Sales metrics can be split up. Clicking a minus icon will roll the values back together into the total again, thus drilling up.
Steel Wheels Analysis Examples
In addition to the basic Slice and Dice example, you can find other interesting Analytics examples in the Analysis folder in the Steel Wheels example set. There you will find two examples:
Market Analysis By YearProduct Line AnalysisLike the basic Slice and Dice example, these examples display a pivot table, showing aggregated sales figures. In these examples, sales figures can be sliced along Product, Market (region), and Time.
Whereas the Slice and Dice example displayed only the measures on the horizontal axis, these examples show some more variety by placing the market on the horizontal axis. The Product Line Analysis example also places Time on the horizontal axis, beneath the Markets.
If you like, you can use alternative ways to set up the axes using the OLAP Navigator. You can invoke the OLAP Navigator by pressing the button with the cube icon on the toolbar that appears in the very top of the pages showing the analysis examples. The OLAP Navigator and a part of that toolbar are shown in Figure 1.12.
Figure 1.12 The OLAP Navigator
The OLAP Navigator shown in Figure 1.12 was taken from the Product Line Analysis example. In the top of the OLAP Navigator, you can see the caption Columns, and below that are two rows, Markets and Time. This corresponds directly with the Markets and Time shown along the horizontal axis of the pivot table. In the section below that, you see a Rows caption, with one row below it, Product. This corresponds with the products that are listed along the vertical axis of the pivot table. You can move the items in the Columns section to the Rows section and vice versa by clicking the small square in front of it.
There's a third section in the OLAP navigator labelled Filter. In this section, you find Customers, Measures, and Order Status. These items do not currently appear along one of the axes of the pivot table. You can move items from the Rows and Columns sections to the filter by clicking the filter icon. Moving items from the filter to either one of the axes is done by clicking the little square icon that corresponds to the axis to which you want to move the item.
We discuss the OLAP Navigator in detail in Chapter 15.
Dashboarding Examples
Dashboards are discussed in detail in Chapter 17. If you are interested in dashboards, you are strongly encouraged to check out the Community Dashboard Framework (CDF) dashboards examples included in the Pentaho BI Server. You can find them in the CDF folder in the BI Developer Examples solution.
A good way to start with Pentaho Dashboards is by navigating to the Samples subfolder of the CDF folder in the BI Developer Examples solution. Here you will find examples to use Charts, Reports, Analytic Pivot tables, and Maps in a dashboard, and see how you can tie these elements together.
Once you have a taste for what you can do with dashboards, you can read Chapter 17 and follow the detailed steps described there to build your own dashboard. When you are in the process of building your own dashboards, you will find the documentation included with the CDF examples indispensable. You can find detailed documentation in the Documentation subfolder of the CDF folder. The documentation found in the Component Reference folder will be an especially invaluable companion.
Other Examples
Many more examples are included in the Pentaho BI Server. These include examples to start ETL processes, to call web services, to send report output to a printer or by e-mail, and much more. However, we will not discuss these examples here. Many of these examples require additional setup, and others are not particularly instructive unless you have need for that particular feature. However, readers are encouraged to experiment with the remaining examples.
Summary
This chapter provided an introduction to the Pentaho software and walked you through some of the examples that are shipped with it. After installing the software and exploring the examples, you should have a good idea of what you can do with Pentaho. The rest of this book will teach you how to work with each part of Pentaho to create your own Pentaho solutions.
