E-Book
36,99 €

The Data Warehouse ETL Toolkit E-Book

Ralph Kimball

0,0

36,99 €

oder

Leseprobe lesen

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.

Herausgeber: John Wiley & Sons
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

* Cowritten by Ralph Kimball, the world's leading data warehousing authority, whose previous books have sold more than 150,000 copies * Delivers real-world solutions for the most time- and labor-intensive portion of data warehousing-data staging, or the extract, transform, load (ETL) process * Delineates best practices for extracting data from scattered sources, removing redundant and inaccurate data, transforming the remaining data into correctly formatted data structures, and then loading the end product into the data warehouse * Offers proven time-saving ETL techniques, comprehensive guidance on building dimensional structures, and crucial advice on ensuring data quality

Details

Sie lesen das E-Book in den Legimi-Apps auf:

Android

iOS

von Legimi
zertifizierten E-Readern

Seitenzahl: 821

Veröffentlichungsjahr: 2011

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Contents

Cover

Half Title page

Title page

Acknowledgments

About the Authors

Introduction

Overview of the Book: Two Simultaneous Threads

The Planning & Design Thread

The Data Flow Thread

How the Book Is Organized

Who Should Read this Book

Summary

Part I: Requirements, Realities, and Architecture

Chapter 1: Surrounding the Requirements

Requirements

Architecture

The Mission of the Data Warehouse

The Mission of the ETL Team

Chapter 2: ETL Data Structures

To Stage or Not to Stage

Designing the Staging Area

Data Structures in the ETL System

Planning and Design Standards

Summary

Part II: Data Flow

Chapter 3: Extracting

Part 1: The Logical Data Map

Inside the Logical Data Map

Building the Logical Data Map

Integrating Heterogeneous Data Sources

Part 2: The Challenge of Extracting from Disparate Platforms

Mainframe Sources

Flat Files

XML Sources

Web Log Sources

ERP System Sources

Part 3: Extracting Changed Data

Summary

Chapter 4: Cleaning and Conforming

Defining Data Quality

Assumptions

Part 1: Design Objectives

Part 2: Cleaning Deliverables

Part 3: Screens and Their Measurements

Part 4: Conforming Deliverables

Summary

Chapter 5: Delivering Dimension Tables

The Basic Structure of a Dimension

The Grain of a Dimension

The Basic Load Plan for a Dimension

Flat Dimensions and Snowflaked Dimensions

Date and Time Dimensions

Big Dimensions

Small Dimensions

One Dimension or Two

Dimensional Roles

Dimensions as Subdimensions of Another Dimension

Degenerate Dimensions

Slowly Changing Dimensions

Type 1 Slowly Changing Dimension (Overwrite)

Type 2 Slowly Changing Dimension (Partitioning History)

Precise Time Stamping of a Type 2 Slowly Changing Dimension

Type 3 Slowly Changing Dimension (Alternate Realities)

Hybrid Slowly Changing Dimensions

Late-Arriving Dimension Records and Correcting Bad Data

Multivalued Dimensions and Bridge Tables

Ragged Hierarchies and Bridge Tables

Technical Note: Populating Hierarchy Bridge Tables

Using Positional Attributes in a Dimension to Represent Text Facts

Summary

Chapter 6: Delivering Fact Tables

The Basic Structure of a Fact Table

Guaranteeing Referential Integrity

Surrogate Key Pipeline

Fundamental Grains

Preparing for Loading Fact Tables

Factless Fact Tables

Augmenting a Type 1 Fact Table with Type 2 History

Graceful Modifications

Multiple Units of Measure in a Fact Table

Collecting Revenue in Multiple Currencies

Late Arriving Facts

Aggregations

Delivering Dimensional Data to OLAP Cubes

Summary

Part III: Implementation and Operations

Chapter 7: Development

Current Marketplace ETL Tool Suite Offerings

Current Scripting Languages

Time Is of the Essence

Using Database Bulk Loader Utilities to Speed Inserts

Managing Database Features to Improve Performance

Troubleshooting Performance Problems

Increasing ETL Throughput

Summary

Chapter 8: Operations

Scheduling and Support

Migrating to Production

Achieving Optimal ETL Performance

Purging Historic Data

Monitoring the ETL System

Tuning ETL Processes

ETL System Security

Short-Term Archiving and Recovery

Long-Term Archiving and Recovery

Summary

Chapter 9: Metadata

Defining Metadata

Business Metadata

Technical Metadata

ETL-Generated Metadata

Metadata Standards and Practices

Impact Analysis

Summary

Chapter 10: Responsibilities

Planning and Leadership

Managing the Project

Summary

Part IV: Real Time Streaming ETL Systems

Chapter 11: Real-Time ETL Systems

Why Real-Time ETL?

Defining Real-Time ETL

Challenges and Opportunities of Real-Time Data Warehousing

Real-Time Data Warehousing Review

Categorizing the Requirement

Real-Time ETL Approaches

Summary

Chapter 12: Conclusions

Deepening the Definition of ETL

The Future of Data Warehousing and ETL in Particular

Index

The Data Warehouse ETL Toolkit

Published byWiley Publishing, Inc. 10475 Crosspoint BoulevardIndianapolis, IN 46256www.wiley.com

Published simultaneously in Canada

ISBN: 0-764-56757-8

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600. Requests to the Publisher for permission should be addressed to the Legal Department, Wiley Publishing, Inc., 10475 Crosspoint Blvd., Indianapolis, IN 46256, (317) 572-3447, fax (317) 572-4355, e-mail: [email protected].

Limit of Liability/Disclaimer of Warranty: The publisher and the author make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation warranties of fitness for a particular purpose. No warranty may be created or extended by sales or promotional materials. The advice and strategies contained herein may not be suitable for every situation. This work is sold with the understanding that the publisher is not engaged in rendering legal, accounting, or other professional services. If professional assistance is required, the services of a competent professional person should be sought. Neither the publisher not the author shall be liable for damages arising herefrom. The fact that an organization or Website is referred to in this work as a citation and/or a potential source of further information does not mean that the author or the publisher endorses the information the organization or Website may provide or recommendations it may make. Further, readers should be aware that Internet Websites listed in this work may have changed or disappeared between when this work was written and when it is read.

For general information on our other products and services please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.

Library of Congress Cataloging-in-Publication Data

Kimball, Ralph. The data warehouse ETL toolkit : practical techniques for extracting, cleaning, conforming, and delivering data / Ralph Kimball, Joe Caserta. p.cm. Includes index. ISBN 0-7645-6757-8 (paper/website) 1. Data warehousing. 2. Database design.I. Caserta, Joe, 1965-II. Title. QA76.9.D37K532004 005.74—dc22 2004016909

Trademarks: Wiley, the Wiley Publishing logo, and related trade dress are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates. All other trademarks are the property of their respective owners. Wiley Publishing, Inc., is not associated with any product or vendor mentioned in this book.

Credits

Vice President and Executive Group Publisher:

Richard Swadley

Vice President and Publisher:Joseph B. Wikert

Executive Editorial Director:Mary Bednarek

Executive Editor:Robert Elliot

Editorial Manager:Kathryn A. Malm

Development Editor:Adaobi Obi Tulton

Production Editor:Pamela Hanley

Media Development Specialist:Travis Silvers

Text Design & Composition:TechBooks Composition Services

Acknowledgments

First of all we want to thank the many thousands of readers of the Toolkit series of data warehousing books. We appreciate your wonderful support and encouragement to write a book about data warehouse ETL. We continue to learn from you, the owners and builders of data warehouses.

Both of us are especially indebted to Jim Stagnitto for encouraging Joe to start this book and giving him the confidence to go through with the project. Jim was a virtual third author with major creative contributions to the chapters on data quality and real-time ETL.

Special thanks are also due to Jeff Coster and Kim M. Knyal for significant contributions to the discussions of pre- and post-load processing and project managing the ETL process, respectively.

We had an extraordinary team of reviewers who crawled over the first version of the manuscript and made many helpful suggestions. It is always daunting to make significant changes to a manuscript that is “done” but this kind of deep review has been a tradition with the Toolkit series of books and was successful again this time. In alphabetic order, the reviewers included:

Wouleta Ayele, Bob Becker, Jan-Willem Beldman, Ivan Chong, Maurice Frank, Mark Hodson, Paul Hoffman, Qi Jin, David Lyle, Michael Martin, Joy Mundy, Rostislav Portnoy, Malathi Vellanki, Padmini Ramanujan, Margy Ross, Jack Serra-Lima, and Warren Thornthwaite.

We owe special thanks to our spouses Robin Caserta and Julie Kimball for their support throughout this project and our children Tori Caserta, Brian Kimball, Sara (Kimball) Smith, and grandchild(!) Abigail Smith who were very patient with the authors who always seemed to be working.

Finally, the team at Wiley Computer books has once again been a real asset in getting this book finished. Thank you Bob Elliott, Kevin Kent, and Adaobi Obi Tulton.

About the Authors

Ralph Kimball, Ph.D., founder of the Kimball Group, has been a leading visionary in the data warehouse industry since 1982 and is one of today’s most well-known speakers, consultants, teachers, and writers. His books include The Data Warehouse Toolkit (Wiley, 1996), The Data Warehouse Toolkit (Wiley, 1998), The Data Webhouse Toolkit (Wiley, 2000), and The Data Warehouse Toolkit, Second Edition (Wiley, 2002). He also has written for Intelligent Enterprise magazine since 1995, receiving the Readers’ Choice Award since 1999.

Ralph earned his doctorate in electrical engineering at Stanford University with a specialty in man-machine systems design. He was a research scientist, systems development manager, and product marketing manager at Xerox PARC and Xerox Systems’ Development Division from 1972 to 1982. For his work on the Xerox Star Workstation, the first commercial product with windows, icons, and a mouse, he received the Alexander C. Williams award from the IEEE Human Factors Society for systems design. From 1982 to 1986 Ralph was Vice President of Applications at Metaphor Computer Systems, the first data warehouse company. At Metaphor, Ralph invented the “capsule" facility, which was the first commercial implementation of the graphical data flow interface now in widespread use in all ETL tools. From 1986 to 1992 Ralph was founder and CEO of Red Brick Systems, a provider of ultra-fast relational database technology dedicated to decision support. In 1992 Ralph founded Ralph Kimball Associates, which became known as the Kimball Group in 2004. The Kimball Group is a team of highly experienced data warehouse design professionals known for their excellence in consulting, teaching, speaking, and writing.

Joe Caserta is the founder and Principal of Caserta Concepts, LLC. He is an influential data warehousing veteran whose expertise is shaped by years of industry experience and practical application of major data warehousing tools and databases. Joe is educated in Database Application Development and Design, Columbia University, New York.

Introduction

The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system extracts data from the source systems, enforces data quality and consistency standards, conforms data so that separate sources can be used together, and finally delivers data in a presentation-ready format so that application developers can build applications and end users can make decisions. This book is organized around these four steps.

The ETL system makes or breaks the data warehouse. Although building the ETL system is a back room activity that is not very visible to end users, it easily consumes 70 percent of the resources needed for implementation and maintenance of a typical data warehouse.

The ETL system adds significant value to data. It is far more than plumbing for getting data out of source systems and into the data warehouse.

Specifically, the ETL system:

Removes mistakes and corrects missing data

Provides documented measures of confidence in data

Captures the flow of transactional data for safekeeping

Adjusts data from multiple sources to be used together

Structures data to be usable by end-user tools

ETL is both a simple and a complicated subject. Almost everyone understands the basic mission of the ETL system: to get data out of the source and load it into the data warehouse. And most observers are increasingly appreciating the need to clean and transform data along the way. So much for the simple view. It is a fact of life that the next step in the design of the ETL system breaks into a thousand little subcases, depending on your own weird data sources, business rules, existing software, and unusual destination-reporting applications. The challenge for all of us is to tolerate the thousand little subcases but to keep perspective on the simple overall mission of the ETL system. Please judge this book by how well we meet this challenge!

The Data Warehouse ETL Toolkit is a practical guide for building successful ETL systems. This book is not a survey of all possible approaches! Rather, we build on a set of consistent techniques for delivery of dimensional data. Dimensional modeling has proven to be the most predictable and cost effective approach to building data warehouses. At the same time, because the dimensional structures are the same across many data warehouses, we can count on reusing code modules and specific development logic.

This book is a roadmap for planning, designing, building, and running the back room of a data warehouse. We expand the traditional ETL steps of extract, transform, and load into the more actionable steps of extract, clean, conform, and deliver, although we resist the temptation to change ETL into ECCD!

In this book, you’ll learn to:

Plan and design your ETL system

Choose the appropriate architecture from the many possible choices

Manage the implementation

Manage the day-to-day operations

Build the development/test/production suite of ETL processes

Understand the tradeoffs of various back-room data structures, including flat files, normalized schemas, XML schemas, and star join (dimensional) schemas

Analyze and extract source data

Build a comprehensive data-cleaning subsystem

Structure data into dimensional schemas for the most effective delivery to end users, business-intelligence tools, data-mining tools, OLAP cubes, and analytic applications

Deliver data effectively both to highly centralized and profoundly distributed data warehouses using the same techniques

Tune the overall ETL process for optimum performance

The preceding points are many of the big issues in an ETL system. But as much as we can, we provide lower-level technical detail for:

Implementing the key enforcement steps of a data-cleaning system for column properties, structures, valid values, and complex business rules

Conforming heterogeneous data from multiple sources into standardized dimension tables and fact tables

Building replicatable ETL modules for handling the natural time variance in dimensions, for example, the three types of slowly changing dimensions (SCDs)

Building replicatable ETL modules for multivalued dimensions and hierarchical dimensions, which both require associative bridge tables

Processing extremely large-volume fact data loads

Optimizing ETL processes to fit into highly constrained load windows

Converting batch and file-oriented ETL systems into continuously streaming real-time ETL systems

For illustrative purposes, Oracle is chosen as a common dominator when specific SQL code is revealed. However, similar code that presents the same results can typically be written for DB2, Microsoft SQL Server, or any popular relational database system.

And perhaps as a side effect of all of these specific recommendations, we hope to share our enthusiasm for developing, deploying, and managing data warehouse ETL systems.

Overview of the Book: Two Simultaneous Threads

Building an ETL system is unusually challenging because it is so heavily constrained by unavoidable realities. The ETL team must live with the business requirements, the formats and deficiencies of the source data, the existing legacy systems, the skill sets of available staff, and the ever-changing (and legitimate) needs of end users. If these factors aren’t enough, the budget is limited, the processing-time windows are too narrow, and important parts of the business come grinding to a halt if the ETL system doesn’t deliver data to the data warehouse!

Two simultaneous threads must be kept in mind when building an ETL system: the Planning & Design thread and the Data Flow thread. At the highest level, they are pretty simple. Both of them progress in an orderly fashion from left to right in the diagrams. Their interaction makes life very interesting. In Figure Intro-1 we show the four steps of the Planning & Design thread, and in Figure Intro-2 we show the four steps of the Data Flow thread.

Figure intro-1 The Planning and Design Thread.

Figure intro-2 The Data Flow Thread.

To help you visualize where we are in these two threads, in each chapter we call out process checks. The following example would be used when we are discussing the requirements for data cleaning:

PROCESS CHECK Planning & Design:

Requirements/Realities → Architecture → Implementation → Test/Release

Data Flow: Extract → Clean → Conform → Deliver

The Planning & Design Thread

The first step in the Planning & Design thread is accounting for all the requirements and realities. These include:

Business needs

Data profiling and other data-source realities

Compliance requirements

Security requirements

Data integration

Data latency

Archiving and lineage

End user delivery interfaces

Available development skills

Available management skills

Legacy licenses

We expand these individually in the Chapter 1, but we have to point out at this early stage how much each of these bullets affects the nature of your ETL system. For this step, as well as all the steps in both major threads, we point out the places in this book when we are talking specifically about the given step.

The second step in this thread is the architecture step. Here is where we must make big decisions about the way we are going to build our ETL system. These decisions include:

Hand-coded versus ETL vendor tool

Batch versus streaming data flow

Horizontal versus vertical task dependency

Scheduler automation

Exception handling

Quality handling

Recovery and restart

Metadata

Security

The third step in the Planning & Design thread is system implementation. Let’s hope you have spent some quality time on the previous two steps before charging into the implementation! This step includes:

Hardware

Software

Coding practices

Documentation practices

Specific quality checks

The final step sounds like administration, but the design of the test and release procedures is as important as the more tangible designs of the preceding two steps. Test and release includes the design of the:

Development systems

Test systems

Production systems

Handoff procedures

Update propagation approach

System snapshoting and rollback procedures

Performance tuning

The Data Flow Thread

The Data Flow thread is probably more recognizable to most readers because it is a simple generalization of the old E-T-L extract-transform-load scenario. As you scan these lists, begin to imagine how the Planning & Design thread affects each of the following bullets. The extract step includes:

Reading source-data models

Connecting to and accessing data

Scheduling the source system, intercepting notifications and daemons

Capturing changed data

Staging the extracted data to disk

The clean step involves:

Enforcing column properties

Enforcing structure

Enforcing data and value rules

Enforcing complex business rules

Building a metadata foundation to describe data quality

Staging the cleaned data to disk

This step is followed closely by the conform step, which includes:

Conforming business labels (in dimensions)

Conforming business metrics and performance indicators (in fact tables)

Deduplicating

Householding

Internationalizing

Staging the conformed data to disk

Finally, we arrive at the payoff step where we deliver our wonderful data to the end-user application. We spend most of Chapters 5 and 6 on delivery techniques because, as we describe in Chapter 1, you still have to serve the food after you cook it! Data delivery from the ETL system includes:

Loading flat and snowflaked dimensions

Generating time dimensions

Loading degenerate dimensions

Loading subdimensions

Loading types 1, 2, and 3 slowly changing dimensions

Conforming dimensions and conforming facts

Handling late-arriving dimensions and late-arriving facts

Loading multi-valued dimensions

Loading ragged hierarchy dimensions

Loading text facts in dimensions

Running the surrogate key pipeline for fact tables

Loading three fundamental fact table grains

Loading and updating aggregations

Staging the delivered data to disk

In studying this last list, you may say, “But most of that list is modeling, not ETL. These issues belong in the front room.” We respectfully disagree. In our interviews with more than 20 data warehouse teams, more than half said that the design of the ETL system took place at the same time as the design of the target tables. These folks agreed that there were two distinct roles: data warehouse architect and ETL system designer. But these two roles often were filled by the same person! So this explains why this book carries the data all the way from the original sources into each of the dimensional database configurations.

The basic four-step data flow is overseen by the operations step, which extends from the beginning of the extract step to the end of the delivery step. Operations includes:

Scheduling

Job execution

Exception handling

Recovery and restart

Quality checking

Release

Support

Understanding how to think about these two fundamental threads (Planning & Design and Data Flow) is the real goal of this book.

How the Book Is Organized

To develop the two threads, we have divided the book into four parts:

I. Requirements, Realities and Architecture

II. Data Flow

III. Implementation and Operations

IV. Real Time Streaming ETL Systems

This book starts with the requirements, realities, and architecture steps of the planning & design thread because we must establish a logical foundation for the design of any kind of ETL system. The middle part of the book then traces the entire data flow thread from the extract step through to the deliver step. Then in the third part we return to implementation and operations issues. In the last part, we open the curtain on the exciting new area of real time streaming ETL systems.

Part I: Requirements, Realities, and Architecture

Part I sets the stage for the rest of the book. Even though most of us are eager to get started on moving data into the data warehouse, we have to step back to get some perspective.

Chapter 1: Surrounding the Requirements

The ETL portion of the data warehouse is a classically overconstrained design challenge. In this chapter we put some substance on the list of requirements that we want you to consider up front before you commit to an approach. We also introduce the main architectural decisions you must take a stand on (whether you realize it or not).

This chapter is the right place to define, as precisely as we can, the major vocabulary of data warehousing, at least as far as this book is concerned. These terms include:

Data warehouse

Data mart

ODS (operational data store)

EDW (enterprise data warehouse)

Staging area

Presentation area

We describe the mission of the data warehouse as well as the mission of the ETL team responsible for building the back room foundation of the data warehouse. We briefly introduce the basic four stages of Data Flow: extracting, cleaning, conforming, and delivering. And finally we state as clearly as possible why we think dimensional data models are the keys to success for every data warehouse.

Chapter 2: ETL Data Structures

Every ETL system must stage data in various permanent and semipermanent forms. When we say staging, we mean writing data to the disk, and for this reason the ETL system is sometimes referred to as the staging area. You might have noticed that we recommend at least some form of staging after each of the major ETL steps (extract, clean, conform, and deliver). We discuss the reasons for various forms of staging in this chapter.

We then provide a systematic description of the important data structures needed in typical ETL systems: flat files, XML data sets, independent DBMS working tables, normalized entity/relationship (E/R) schemas, and dimensional data models. For completeness, we mention some special tables including legally significant audit tracking tables used to prove the provenance of important data sets, as well as mapping tables used to keep track of surrogate keys. We conclude with a survey of metadata typically surrounding these types of tables, as well as naming standards. The metadata section in this chapter is just an introduction, as metadata is an important topic that we return to many times in this book.

Part II: Data Flow

The second part of the book presents the actual steps required to effectively extract, clean, conform, and deliver data from various source systems into an ideal dimensional data warehouse. We start with instructions on selecting the system-of-record and recommend strategies for analyzing source systems. This part includes a major chapter on building the cleaning and conforming stages of the ETL system. The last two chapters then take the cleaned and conformed data and repurpose it into the required dimensional structures for delivery to the end-user environments.

Chapter 3: Extracting

This chapter begins by explaining what is required to design a logical data mapping after data analysis is complete. We urge you to create a logical data map and to show how it should be laid out to prevent ambiguity in the mission-critical specification. The logical data map provides ETL developers with the functional specifications they need to build the physical ETL process.

A major responsibility of the data warehouse is to provide data from various legacy applications throughout the enterprise data in a single cohesive repository. This chapter offers specific technical guidance for integrating the heterogeneous data sources found throughout the enterprise, including mainframes, relational databases, XML sources, flat files, Web logs, and enterprise resource planning (ERP) systems. We discuss the obstacles encountered when integrating these data sources and offer suggestions on how to overcome them. We introduce the notion of conforming data across multiple potentially incompatible data sources, a topic developed fully in the next chapter.

Chapter 4: Cleaning and Conforming

After data has been extracted, we subject it to cleaning and conforming. Cleaning means identifying and fixing the errors and omissions in the data. Conforming means resolving the labeling conflicts between potentially incompatible data sources so that they can be used together in an enterprise data warehouse.

This chapter makes an unusually serious attempt to propose specific techniques and measurements that you should implement as you build the cleaning and conforming stages of your ETL system. The chapter focuses on data-cleaning objectives, techniques, metadata, and measurements.

In particular, the techniques section surveys the key approaches to data profiling and data cleaning, and the measurements section gives examples of how to implement data-quality checks that trigger alerts, as well as how to provide guidance to the data-quality steward regarding the overall health of the data.

Chapter 5: Delivering Dimension Tables

This chapter and Chapter 6 are the payoff chapters in this book. We believe that the whole point of the data warehouse is to deliver data in a simple, actionable format for the benefit of end users and their analytic applications. Dimension tables are the context of a business’ measurements. They are also the entry points to the data because they are the targets for almost all data warehouse constraints, and they provide the meaningful labels on every row of output.

The ETL process that loads dimensions is challenging because it must absorb the complexities of the source systems and transform the data into simple, selectable dimension attributes. This chapter explains step-by-step how to load data warehouse dimension tables, including the most advanced ETL techniques. The chapter clearly illustrates how to:

Assign surrogate keys

Load Type 1, 2 and 3 slowly changing dimensions

Populate bridge tables for multivalued and complex hierarchical dimensions

Flatten hierarchies and selectively snowflake dimensions

We discuss the advanced administration and maintenance issues required to incrementally load dimensions, track the changes in dimensions using CRC codes, and contend with late-arriving data.

Chapter 6: Delivering Fact Tables

Fact tables hold the measurements of the business. In most data warehouses, fact tables are overwhelmingly larger than dimension tables, but at the same time they are simpler. In this chapter we explain the basic structure of all fact tables, including foreign keys, degenerate dimension keys, and the numeric facts themselves. We describe the role of the fact-table provider, the information steward responsible for the delivery of the fact tables to end-user environments.

Every fact table should be loaded with a surrogate key pipeline, which maps the natural keys of the incoming fact records to the correct contemporary surrogate keys needed to join to the dimension tables.

We describe the three fundamental grains of fact tables, which are sufficient to support all data warehouse applications.

We discuss some unusual fact table varieties, including factless fact tables and fact tables whose sole purpose is to register the existence of complex events, such as automobile accidents.

Finally, we discuss the basic architecture of aggregations, which are physically stored summaries that, much like indexes, serve solely to improve performance.

Part III: Implementation and Operations

The third part of the book assumes the reader has analyzed his or her requirements, heeded the realities of his or her data and available resources, and visualized the flow of data from extraction to delivery. Keeping all this in mind, Part 3 describes in some detail the main approaches to system implementation and to organizing the operations of the ETL system. We discuss the role of metadata in the ETL system and finally the various responsibilities of the ETL team members.

Chapter 7: Development

Chapter 7 develops the techniques that you’ll need to develop the initial data load for your data warehouse, such as recreating history for slowly changing dimensions and integrating historic offline data with current online transactions, as well as historic fact loading.

The chapter also provides estimation techniques to calculate the time it should take to complete the initial load, exposes vulnerabilities to long-running ETL processes, and suggests methods to minimize your risk.

Automating the ETL process is an obvious requirement of the data warehouse project, but how is it done? The order and dependencies between table loads is crucial to successfully load the data warehouse. This chapter reviews the fundamental functionality of ETL scheduling and offers criteria and options for executing the ETL schedule. Once the fundamentals are covered, topics such as enforcing referential integrity with the ETL and maintaining operational metadata are examined.

Chapter 8: Operations

We begin this chapter by showing the approaches to scheduling the various ETL system jobs, responding to alerts and exceptions, and finally running the jobs to completion with all dependencies satisfied.

We walk through the steps to migrate the ETL system to the production environment. Since the production environment of the ETL system must be supported like any other mission-critical application, we describe how to set up levels of support for the ETL system that must be utilized upon failure of a scheduled process.

We identify key performance indicators for rating ETL performance and explore how to monitor and capture the statistics. Once the ETL key performance indicators are collected, you are armed with the information you need to address the components within the ETL system to look for opportunities to modify and increase the throughput as much as possible.

Chapter 9: Metadata

The ETL environment often assumes the responsibility of storing and managing the metadata for the entire data warehouse. After all, there is no better place than the ETL system for storing and managing metadata because the environment must know most aspects of the data to function properly. Chapter 9 defines the three types of metadata—business, technical, and process—and presents the elements within each type as they apply to the ETL system. The chapter offers techniques for producing, publishing, and utilizing the various types of metadata and also discusses the opportunity for improvement in this area of the data warehouse. We finish the chapter by discussing metadata standards and best practices and provide recommended naming standards for the ETL.

Chapter 10: Responsibilities

The technical aspects of the ETL process are only a portion of the ETL lifecycle. Chapter 10 is dedicated to the managerial aspects of the lifecycle required for a successful implementation. The chapter describes the duties and responsibilities of the ETL team and then goes on to outline a detailed project plan that can be implemented in any data warehouse environment. Once the basics of managing the ETL system are conveyed, the chapter dives into more-detailed project management activities such as project staffing, scope management, and team development. This somewhat nontechnical chapter provides the greatest benefit to ETL and data warehouse project managers. It describes the roles and skills that are needed for an effective team; and offers a comprehensive ETL project plan that can be repeated for each phase of the data warehouse. The chapter also includes forms that managers need to lead their teams through the ETL lifecycle. Even if you are not a manager, this chapter is required reading to adequately understand how your role works with the other members of the ETL team.

Part IV: Real Time Streaming ETL Systems

Since real-time ETL is a relatively young technology, we are more likely to come up against unique requirements and solutions that have not yet been perfected. In this chapter, we share our experiences to provide insight on the latest challenges in real-time data warehousing and offer recommendations on overcoming them. The crux of real-time ETL is covered in this chapter, and the details of actual implementations are described.

Chapter 11: Real-Time ETL

In this chapter, we begin by defining the real-time requirement. Next, we review the different architecture options available today and appraise each. We end the chapter with a decision matrix to help you decide which real-time architecture is right for your specific data warehouse environment.

Chapter 12: Conclusion

The final chapter summarizes the unique contributions made in this book and provides a glimpse into the future for ETL and data warehousing as a whole.

Who Should Read this Book

Anyone who is involved or intends to be involved in a data-warehouse initiative should read this book. Developers, architects, and managers will benefit from this book because it contains detailed techniques for delivering a dimensionally oriented data warehouse and provides a project management perspective for all the back room activities.

Chapters 1, 2, and 10 offer a functional view of the ETL that can easily be read by anyone on the data warehouse team but is intended for business sponsors and project managers. As you progress through these chapters, expect their technical level to increase, eventually getting to the point where it transforms into a developers handbook. This book is a definitive guide for advice on the tasks required to load the dimensional data warehouse.

Summary

The goal of this book is to make the process of building an ETL system understandable with specific checkpoints along the way. This book shows the often under-appreciated value the ETL system brings to data warehouse data. We hope you enjoy the book and find it valuable in your workplace. We intentionally remain vendor-neutral throughout the book so you can apply the techniques within to the technology to your liking. If this book accomplishes nothing else, we hope it encourages you to get thinking and start breaking new ground to challenge the vendors to extend their product offerings to incorporate the features that the ETL team requires to bring the ETL (and the data warehouse) to full maturity.

PART I

Requirements, Realities, and Architecture

CHAPTER 1

Surrounding the Requirements

Ideally, you must start the design of your ETL system with one of the toughest challenges: surrounding the requirements. By this we mean gathering in one place all the known requirements, realities, and constraints affecting the ETL system. We’ll refer to this list as the requirements, for brevity.

The requirements are mostly things you must live with and adapt your system to. Within the framework of your requirements, you will have many places where you can make your own decisions, exercise your judgment, and leverage your creativity, but the requirements are just what they are named. They are required. The first section of this chapter is intended to remind you of the relevant categories of requirements and give you a sense of how important the requirements will be as you develop your ETL system.

Following the requirements, we identify a number of architectural decisions you need to make at the beginning of your ETL project. These decisions are major commitments because they drive everything you do as you move forward with your implementation. The architecture affects your hardware, software, coding practices, personnel, and operations.

The last section describes the mission of the data warehouse. We also carefully define the main architectural components of the data warehouse, including the back room, the staging area, the operational data store (ODS), and the presentation area. We give a careful and precise definition of data marts and the enterprise data warehouse (EDW). Please read this chapter very carefully. The definitions and boundaries we describe here drive the whole logic of this book. If you understand our assumptions, you will see why our approach is more disciplined and more structured than any other data warehouse design methodology. We conclude the chapter with a succinct statement of the mission of the ETL team.

PROCESS CHECK

Planning & Design: Requirements/Realities → Architecture → Implementation → Test/Release

Data Flow: Haven’t started tracing the data flow yet.

Requirements

In this book’s introduction, we list the major categories of requirements we think important. Although every one of the requirements can be a showstopper, business needs have to be more fundamental and important.

Lesen Sie weiter in der vollständigen Ausgabe!

Tausende von E-Books und Hörbücher

Ihre Zahl wächst ständig und Sie haben eine Fixpreisgarantie.

Sie haben über uns geschrieben: