Microsoft SQL Server 2008 Integration Services - Erik Veerman - E-Book

Microsoft SQL Server 2008 Integration Services E-Book

Erik Veerman

0,0
32,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

An authoritative guide to designing effective solutions for datacleansing, ETL, and file management with SQL Server 2008Integration Services SQL Server Integration Services (SSIS) is the leading tool inthe data warehouse industry, used for performing extraction,transformation, and load operations. After an overview of SSIS architecture, the authors walk you aseries of real-world problems and show various techniques forhandling them. * Shows you how to design SSIS solutions for data cleansing, ETLand file management * Demonstrates how to integrate data from a variety of datasources, * Shows how to monitor SSIS performance, * Demonstrates how to avoid common pitfalls involved with SSISdeployment * Explains how to ensure performance of the deployed solution andeffectively handle unexpected system failures and outages * The companion Web site provides sample code and databasescripts that readers can directly implement This book shows you how to design, build, deploy, and managesolutions to real-world problems that SSIS administrators anddevelopers face day-to-day.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 621

Veröffentlichungsjahr: 2010

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Contents

Introduction

Chapter 1: SSIS Solution Architecture

Problem

Macro Challenge: Enterprise Data Spaghetti

Micro Challenge: Data-Processing Confusion

Problems with Execution and Troubleshooting

Infrastructure Challenges

Other Challenges

Design

Choosing the Right Tool

Overall Solution Architecture

Data Integration or Consolidation

Data Warehouse ETL

Project Planning and Team Development

Agile Benefits

Agile Cautions and Planning

Data Element Documentation

Package Design Patterns

Modular Packages

Master Packages

Server and Storage Hardware

Server Hardware

Development and Test Servers

ETL Collocation

Storage Hardware

Package Execution Location

Package Storage Location Versus Execution Location

Execute SQL Task and Bulk Insert Task Execution

Package Execution and the Data Flow

Design Review

Solution

Setting the Stage: Management and Deployment

Source Data: Files, Tables, and Data Cleansing

Data Warehouse ETL and Cube Processing

Advanced ETL: Scripting, High Availability, and Performance

Summary

Chapter 2: SSIS Management Framework Design

Problem

Challenges of Not Having a Framework Implementation

Different Development Methods

Changing Metadata

Getting the Right Information to the Right People at the Right Time

Reusability

Framework Implementation

Configuration Management Scheme

Logging and Auditing

Template Package

Framework Benefits and Value

Design

Configuration Management

Overall Design

Environment Variable Configuration

XML Configuration File

SQL Server Configuration

Logging and Auditing Mechanism

Storage

Tables

Stored Procedures

Template Package

Implementation Guidelines

Solution

Configuration Management

Logging and Auditing Mechanism

Storage and Tables

Stored Procedures

Template Package

Development

Installation

Other Considerations

Customizations

ETL Process Framework

Process Owner

Reporting

Summary

Chapter 3: Package Deployment and Storage Decisions

Problem

Standardization

Environment

Application

Desired Implementation

Design

Storage Methods

SSIS Service

File System Storage

SQL Server Storage

Storage Comparison

Deployment Methods

SSIS Deployment Wizard

Manual Deployment of Packages

DTUtil Scripting

Solution

Storage Methodology

Existing Infrastructure

Security

Package Makeup

Back to the Flowchart

Deployment Methodology

Storage Method

Existing Infrastructure

Deployment Needs

Back to the Flowchart

Total Methodology

Summary

Chapter 4: File-Handling and Processing Methods

Problem

Simple File Operations

Remote File Operations

File Criteria

File Iteration

File Notification

Design

File System Task

Source and Destination Connection

Attributes

Overwriting the Destination

Multiple Files

FTP Task

FTP Connection

HTTP

Foreach Loop

Applying Advanced Criteria

File Watcher Task

WMI Event Watcher Task

Solution

Summary

Chapter 5: Data Extraction Best Practices

Problem

Extraction Data Criteria

Source System Impact

Incremental Extraction

Deleted Rows

Staging Database

Data Lineage and Metadata

File Sources

Design

Package Connections and Source Adapters

Package Connections

Source Adapters

Incremental Data Extraction

Incremental Extraction Using a Change Identifier Value

Targeting Change Records through Dynamic Source Queries

Retrieving Incremental Identifier Values and Updating Package Variables

Capturing the Maximum Change Identifier Value

Incremental Extraction from SQL Server without a Trigger

Using SSIS to Handle All Aspects of an Incremental Extraction

Data Staging Method

Tracking Data Lineage Identifiers

Solution

Metadata-Driven Extraction Example

Metadata Tables

Control Flow

Looping through the Sources

Inside the Sources Loop

The Script

Read Variables

Open Connections

Get Max Change Indicator

Extract Changed Source Rows

Close Connections

Summary

Chapter 6: Data-Cleansing Design

Problem

Candidate Key Profiles

Column Length Distribution Profiles

Column Null Ratio Profiles

Column Pattern Profiles

Column Statistics Profiles

Column Value Distribution Profiles

Functional Dependency Profiles

Design

Using the Script Transform for Data Scrubbing

Using the Fuzzy Grouping Transform to De-duplicate Data

Using the Fuzzy Lookup Transform to Cleanse Data

Dealing with Multiple Record Types in a Single File

Using the Raw File

Solution

Summary

Chapter 7: Dimension Table ETL

Problem — Fundamental Dimension ETL

Dimensions: The Basics

Dimension ETL: The Challenge

Design — Fundamental Dimension ETL

Data Preparation

Dimension Change Types

Type 1 (Changing) Attribute: A Closer Look

Type 2 (Historical) Attribute: A Closer Look

Inferred Members

Solution — Fundamental Dimension ETL

Preparing Your Source Data for Dimension ETL

SSIS Slowly Changing Dimension Wizard

Advanced Properties and Additional Outputs of the SCD

Problem — Advanced Dimension ETL

SCD Wizard Advantages and Disadvantages

Dimension Volume and Complexity

Design — Advanced Dimension ETL

Optimizing the Built-in SCD

Index Optimizations

Update Optimizations

Snowflake Dimension Tables

Parent-Child Dimension ETL

Date Dimension ETL

Profile Dimension and Junk Dimension ETL

Creating a Custom Slowly Changing Package

Solution — Advanced Dimension ETL

Snowflake Dimensions

Parent-Child Dimensions

Profile and Junk Dimensions

Date Dimension ETL

Custom Dimension ETL

Determining Dimension Changes

Inserts and Updates

Summary

Chapter 8: Fact Table ETL

Problem

Fact Tables: The Basics

Fact Table ETL: The Challenge

Preparing the Data

Mapping Dimension Keys

Calculating Measures

Adding Metadata

Fact Table Updates

Fact Table Inserts

Design

Data Preparation

Data Preparation with SSIS Transformations

Data Preparation Examples

Acquiring the Dimension Surrogate Key in SSIS

Identifying the Dimension Surrogate Keys with SSIS

Surrogate Key Examples in SSIS

Measure Calculations

Measure Calculation Types

Handling Measure Calculations in SSIS

Managing Fact Table Changes

Approaches to Identifying Changed Fact Records

Fact Update Examples in SSIS

Optimizing Fact Table Inserts

Optimizing Inserts with Fast Load

Optimizing Inserts with Index Management

Solution

Internet and Reseller Sales Fact Table ETL

Fact Internet and Reseller Sales Extraction and Transform Process

Fact Internet and Reseller Sales Load Process

Snapshot Fact Table Example — Call Center Fact Table

Advanced Fact Table ETL Concepts

Handling Missing Dimension Lookups

Handling Late-Arriving Facts

Summary

Chapter 9: SSAS Processing Architecture

Problem

SSAS Object Population

Schedule

Partition Management

The Complete Package

Design

SSAS Objects and Processing Basics

Dimensions

Partitions

Mining Structures

SSIS Tasks and Components

Analysis Services Processing Task

Analysis Services Execute DDL Task

Execute Process Task with ASCMD

Data Flow Destinations for SSAS Objects

Script Task with AMO

Creating and Managing Partitions

Overall Design

Solution

Preparation for SSAS Integration

Process Dimensions Package

Process Task

Parallel XMLA

Process Partitions Package

Storing and Loading Metadata

SSAS Processing

Overall Solution

Summary

Chapter 10: Implementing Scale-Out ETL Process

Problem

Design

Design Components Overview

Central Common Services Server

File Processor and Pre-Aggregation Scale-Out Processes Servers

Design Details

File Management Tasks

Data File Management

Work Allocation Process

Scale-Out Source File Process

Work Reassignment Process

Data Aggregation Tasks

Hourly Data Pre-Aggregation Process

Hourly Data Aggregation Process

Daily Data Aggregation Process

Archival and Clean-up Processes

Data File Archival Process

Stage Table Clean-up Process

Design Conclusions

Solution

Central Server Services

Multiple File Processor and Pre-Aggregator Processes

Database Tables Required on the Central Server

Stored Procedures

Procedures on the Central Server

Procedures on the Staging Servers

SSIS Packages

File-Processing Server Packages

Central Server Packages

Summary

Chapter 11: Scripting Design Patterns

Problem — Advanced File Management

Script Task

Scenario

Design — Advanced File Management

Script Language

Accessing Objects

Custom Assemblies

Scripting

Solution — Advanced File Management

Create an External Assembly

Access the External Assembly

Archive the Files

Summary — Advanced File Management

Problem — Call Center Fact ETL

Reasons to Use Scripting

Scenario

Design — Call Center Fact ETL

Component Type

Output Type

Synchronous Outputs

Asynchronous Outputs

Other Component Types

Design Choice

Overridable Methods

Solution — Call Center Fact ETL

Package Setup

Script Component

Reusable Methods

Row-By-Row Processing

Process Input

Entire Script

Package Wrap-Up

Summary — Call Center Fact ETL

Summary

Chapter 12: SSIS Package Scaling

Problem

Identifying Task Durations

Identifying Data Flow Destination Issues

Identifying Transformation and Memory Bottlenecks

Identifying Data Flow Source Performance Issues

Design

Balancing SQL Operations with the Data Flow

Data Flow Advantages as Compared with SQL

SQL Advantages when Compared with the Data Flow

Applying SQL Code in Your SSIS Packages

SSIS Pipeline Architecture Concepts

Data Flow Design Example

SQL-Centric SSIS Process

Rewritten Data Flow–Centric SSIS Process

Solution

Tuning Your Data Flow

Use Blocking Transformations Sparingly

Limit Row-by-Row Operations

Manage Data Flow Backpressure

Pre-sort Sources as an Alternative to the Sort Transformation

Optimize the Lookup and Managing the Cache Size

Remove Unused Columns from the Data Flow

Be Mindful of the SSIS Logging Impact

Regulate Transactions

Setting Data Flow Properties

Up the EngineThreads Value

Optimize the Temporary Storage Locations

Leave RunInOptimizedMode as True

Tune Buffers

Database and Destination Optimization

Limiting Database Growth

Consider Dropping and Re-creating Indexes on Destination Tables

Using the OLE DB Destination Adapter

Use Advanced Oracle and Teradata Destination Adapters

Handling Partitioned Tables

Summary

Index

Microsoft® SQL Server® 2008 Integration Services: Problem–Design–Solution

Published byWiley Publishing, Inc.10475 Crosspoint BoulevardIndianapolis, IN 46256

www.wiley.com

Copyright © 2010 by Wiley Publishing, Inc., Indianapolis, Indiana

Published by Wiley Publishing, Inc., Indianapolis, Indiana

Published simultaneously in Canada

ISBN: 978-0-470-52576-0

Manufactured in the United States of America

10 9 8 7 6 5 4 3 2 1

No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.

Limit of Liability/Disclaimer of Warranty: The publisher and the author make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation warranties of fitness for a particular purpose. No warranty may be created or extended by sales or promotional materials. The advice and strategies contained herein may not be suitable for every situation. This work is sold with the understanding that the publisher is not engaged in rendering legal, accounting, or other professional services. If professional assistance is required, the services of a competent professional person should be sought. Neither the publisher nor the author shall be liable for damages arising herefrom. The fact that an organization or Web site is referred to in this work as a citation and/or a potential source of further information does not mean that the author or the publisher endorses the information the organization or Web site may provide or recommendations it may make. Further, readers should be aware that Internet Web sites listed in this work may have changed or disappeared between when this work was written and when it is read.

For general information on our other products and services please contact our Customer Care Department within the United States at (877) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.

Library of Congress Control Number: 2009935233

Trademarks: Wiley, the Wiley logo, Wrox, the Wrox logo, Programmer to Programmer, and related trade dress are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates, in the United States and other countries, and may not be used without written permission. All other trademarks are the property of their respective owners. Wiley Publishing, Inc., is not associated with any product or vendor mentioned in this book.

 

To my family: Amy, Meg, Nate, Kate, and Caleb

—Erik Veerman

About the Authors

Erik Veerman is a Mentor for Solid Quality Mentors, focusing on training, mentoring, and architecting solutions on the SQL Server Business Intelligence (BI) platform. His industry recognition includes Microsoft’s Worldwide BI Solution of the Year and SQL Server Magazine’s Innovator Cup winner. Veerman has designed dozens of BI solutions across a broad business spectrum — telecommunications, marketing, retail, commercial real estate, finance, supply chain, and information technology. His experience with high-volume, multi-terabyte environments and SQL Server 64-bit has enabled clients to scale their Microsoft-based BI solutions for optimal potential. As an expert in on-line analytical processing (OLAP) design, extract, transform, and load (ETL) processing, and dimensional modeling, Veerman is a presenter, author, and instructor. He led the ETL architecture and design for the first production implementation of SQL Server Integration Services (SSIS), and helped to drive the ETL standards and best practices for SSIS on Microsoft’s SQL Server 2005 reference initiative, Project REAL. Veerman is also co-author of Professional SQL Server 2008 Integration Services (Indianapolis: Wiley, 2008), Expert SQL Server 2005 Integration Services (Indianapolis: Wiley, 2006), Professional SQL Server 2005 Integration Services (Indianapolis: Wiley, 2007), and lead author for the Microsoft Press SQL Server 2005 and SQL Server 2008 series, focusing on Business Intelligence Development and Implementation. As a resident of Atlanta, Georgia, Veerman participates in the local Atlanta SQL Server User’s Group, a Professional Association of SQL Server (PASS) chapter.

Jessica M. Moss (SQL Server MVP) is a well-known practitioner, trainer, author, and speaker for Microsoft SQL Server Business Intelligence (BI). As a Mentor with Solid Quality Mentors, she has created numerous data warehousing solutions for a variety of companies in the retail, Internet, health services, finance, and energy industries, and has delivered training courses on SQL Server Integration Services (SSIS), SQL Server Retail Services (SSRS), and SQL Server Analysis Services (SSAS). While working for a major clothing retailer, Moss participated in the SQL Server 2005 TAP program, where she developed best implementation practices for SSIS. Moss has authored technical content for multiple magazines and Web sites, and has spoken internationally at conferences such as the Professional Association for SQL Server (PASS) Community Summit, SQL Teach International Conference, and the devLink Technical Conference. As a strong proponent of developing user-to-user community relations, Moss organizes meetings for a User Group that discusses different technologies. She also actively participates in local User Groups and code camps in central Virginia. In addition, Moss volunteers her time to help educate people through the PASS organization.

Brian Knight (SQL Server MVP, MCITP, MCSE, MCDBA) is the owner and founder of Pragmatic Works. He is also the co-founder of SQLServerCentral.com and JumpstartTV.com. He runs the local SQL Server Users Group in Jacksonville, Florida (JSSUG). Brian is a contributing columnist for several technical magazines and does regular Webcasts at “Jumpstart TV.” He is the author of ten books on the SQL Server technology. Knight has spoken at conferences such as the Professional Association of SQL Server (PASS), SQL Connections, and TechEd, as well as many Code Camps. His blog can be found at www.pragmaticworks.com. Knight lives in Jacksonville, where he loves watching the National Football League Jaguars lose on a regular basis.

Jay Hackney is a mentor and consultant for Solid Quality Mentors, where he helps customers utilize Microsoft technology to build real-world solutions. He has 15 years of development and consulting experience on the Microsoft data platform, and has worked with SQL Server Integration Services (SSIS) since the earliest private beta. Since then, he has used SSIS to deliver data warehousing, extract, transform, and load (ETL), and data migration solutions for many well-known companies, and has presented on SSIS topics at events such as Professional Association for SQL Server (PASS) Summit, Microsoft Tech-Ed, and SQL Server product launches.

Rushabh Mehta is a Mentor for Solid Quality Mentors, and is also the Managing Director for Solid Quality India Pvt. Ltd. Mehta has actively worked with the Microsoft SQL Server and Business Intelligence (BI) platform since 1998. He has developed numerous BI solutions, including predictive-analysis systems, health-care solutions, multi-terabyte financial decision support systems, as well as designing and building an analytical platform to support the Microsoft Windows Azure platform. His experience spans a wide breadth of industrial sectors, including manufacturing, retail, finance, and government, working with clients such as Raymond James Financials, Jackson Hewitt, Publix, the U.S. Department of Veterans Affairs, the U.S. Department of Defense, and Microsoft. Mehta has also developed a number of best practices and standards for implementing BI solutions. He has also developed and delivered advanced training courses on Microsoft BI practices and technologies. An active speaker at a number of large conferences such as the Professional Association for SQL Server (PASS) and TechEd, Mehta also frequently speaks at SQL Server and BI User Groups around the world. Mehta is the Executive Vice President for PASS, the largest global user-driven community for SQL Server professionals.

Credits

Executive EditorRobert Elliott

Project EditorKevin Shafer

Technical EditorPaul S. Waters

Production EditorKathleen Wisor

Copy EditorPaula Lowell

Editorial DirectorRobyn B. Siesky

Editorial ManagerMary Beth Wakefield

Production ManagerTim Tate

Vice President and Executive Group PublisherRichard Swadley

Vice President and Executive PublisherBarry Pruett

Associate PublisherJim Minatel

ProofreaderScott Klemp, Word One

IndexerRobert Swanson

Cover Image©iStockphoto.com/wolv

Acknowledgments

I want to first thank Jessica, Brian, and Jay for adding their valuable expertise and experience to several of the chapters of this book, and sharing their approach and code practices from the SSIS framework, to the dynamic extraction logic, to data cleansing best practices. A big collective thanks to our clients! The unique projects and challenges helped shape and refine this content. I added up the years of experience for all of us, and these pages represent about 50 years of work with data processing and ETL — and that doesn’t include late nights and long weekends.

I also want to thank Paul Waters and Rushabh Mehta for their roles in this book as Technical Editor and contributing author, respectively. A big thanks to the Wrox team for their support, hard work, and valuable feedback. And a thanks to you, the reader, because I actually read the Amazon feedback and try to address the constructive criticism in each new work. Hopefully, the diligence has paid off. If so, help us and let the world know!

Last, but not least, thanks to my family for their support. A special thanks to Amy and her encouragement and hard work in so many ways.

— Erik Veerman

As a first-time book author, this acknowledgment holds special meaning to me. I will try to refrain from mentioning everyone I have ever met, and focus on a few special people that truly affected the outcome of this book.

First of all, a huge thank you is due to Erik Veerman, the lead author on this book. He not only convinced me to join this troupe of authors, but also took the time to patiently answer all of my questions about the process. Next, thank you to Jay Hackney, Brian Knight, Rushabh Mehta, Paul Waters, and Wrox for their authoring, technical, and editing contributions. I truly appreciated your support and help.

On a personal note, I would like to thank my friends and family, Mom, Dad, and Lianna, for their never-ending support and patience as I disappeared for days at a time. Finally, a few other people who I would like to recognize for their previous support are Fred Creasy, Georgia Willis, Kathi Walker, and Paul Reynolds — thank you.

— Jessica M. Moss

Thanks to everyone who made this book possible. As always, I owe a huge debt to my wife, Jenn, for putting up with my late nights, and to my children, Colton, Liam, and Camille, for being so patient with their tired dad who has always overextended. Thanks also to the makers of Guinness and other hard liquors for sustaining my ability to keep on writing technical books. Finally, thanks for all the User Group leaders out there who work so hard to help others become proficient in technology. You make a huge difference!

—Brian Knight

I’d like to thank Erik for this opportunity and his hard work on this book.

— Jay Hackney

Introduction

Our world is a world full of data and, as developers, database administrators (DBAs), or data architects, your job will no doubt require processing data — consolidating data, integrating system data, migrating data, cleansing data, transforming data, or loading data into a data warehouse. This book is about solving problems and building solutions with SQL Server 2008 Integration Services (SSIS). It focuses on the design best practices for the top uses and challenge of SSIS.

In this book, you will learn about how to approach the following kinds of projects, and how to effectively work within SSIS to apply its functionality to these common project requirements and challenges:

Designing Extraction Transformation and Loading (ETL) solutions with SSIS to handle file management, data warehouse processing, systems integration, and data cleansing.

Resolving common SSIS package deployment and auditing challenges that often get in the way of designing a solid solution.

Building an SSIS-based solution that performs well, and can deal with failures or unplanned system outages

As an author team, our vision for this book is very different than the other books on the shelves, which focus mainly on how to use the knobs and buttons in SSIS. Books such as the Professional Microsoft SQL Server 2008 Integration Services (Indianapolis: Wiley, 2008) are very valuable to have on your bookshelf for reference and for learning SSIS. But this book is about going to the next level of solution design. Our goal in writing this book was to focus on solving problems, building solutions, and providing design best practices.

In summary, the difference between this SSIS books and all the others out there is that other books simply focus on the product features with little emphasis on solution design. If you go out and buy a new power saw, the manual is going to tell you how to angle the blade, set the right depth, and make a clean cut. But the manual doesn’t tell you how to build something. This book shows you how to build the furniture, not just how to use the saw.

To be sure, you must know how to use SSIS before you can build a solution. But going from knowledge to design requires guidance on the right approach, and how to avoid the common pitfalls.

This book empowers you with the confidence, the knowledge, and the understanding to make the right choices in your ETL design that enables easy administration, meets your data processing requirements, and performs well for current and future scalability.

Who This Book Is For

This book is targeted at three primary SSIS users: data architects, ETL developers, and DBAs responsible for ETL administration and support.

Because this book focuses on problems and solutions, a base understanding of SSIS is required. A couple of areas of the book walk you through the more advanced features of SSIS, but most of the book builds on top of a foundation of SSIS knowledge and experience. If you have taken an SSIS class, or have read another book and tried out SSIS, or you have built a few packages for various purposes, then that base knowledge will give you enough background for this book. But you should be up for the challenge of learning a new tool in the context of applying it!

The perfect reader of this book is someone who is in the early stages of a new ETL or data integration project (or a redesign project), and is eager to know how to approach the effort with the best practices when designing SSIS packages.

If you are supporting an existing project and must make some changes to aid in administration, deployment, or scalability, then you will also benefit from several of the chapters herein.

What This Book Covers

This book is an expanded, revised, and rewritten version of the Expert SQL Server 2005 Integration Services book (Indianapolis: Wiley, 2007). The authors of this book have expanded the coverage to address the current trends in ETL and SSIS, including creating a scaling-out execution model, performing advanced data profiling and cleansing, and handling file management and file processing.

This book also addresses some of the challenges in SSIS surrounding auditing, configurations, and execution management. Two chapters focus on solving these administrative challenges:

Chapter 2 covers building an SSIS package framework, and provides you with an auditing, configuration, and reporting structure that you can use in the packages of your solution.

Chapter 3 also helps you in the area of administration by addressing package deployment and storage. It answers the questions, “Where should I store my packages — in the file system or in SQL Server?” and “What is the best SSIS deployment approach for my environment?” The right answer is that “it depends,” and Chapter 3 walks you through all the dependencies, while leading you to the right answer for your environment.

Every ETL or data integration solution involves data extraction of one kind or another. Maybe you’re extracting data from flat files, or perhaps your source is coming from an enterprise resource planning (ERP) system, or a line-of-business application. Regardless, you must implement a data extraction methodology that is efficient, reduces the impact on the source, and adequately handles changes and data tracking. Chapter 5 dives into many of the data extraction areas, and even provides you with a dynamic data extraction approach.

Another area that this book covers is data warehouse ETL. In fact, one of the more common uses of SSIS is data warehousing ETL. Chapters 7, 8, and 9 are dedicated to solving the data warehouse ETL challenges all the way from dimension and fact table ETL, to processing cubes in SQL Server Analysis Services (SSAS).

Not to be overlooked, Chapter 11 examines control flow and data flow scripting, and provides examples of common (but complex) challenges that can be solved by scripting, including advanced file management and data processing. Chapter 12 provides performance troubleshooting steps and best practices on data flow design. This chapter also contrasts the right use of SQL commands versus the data flow.

Before you even begin diving into the details of an SSIS-based solution, you must start out on the right foot! Chapter 1 reviews solution architecture, project approach and methodology, package design patterns, the importance of I/O performance, and data element tracking. It answers the question, “Where do I begin and how do I approach an ETL solution?

In all, this book presents a comprehensive picture of SSIS solution challenges and design best practices.

How This Book Is Structured

Before reviewing the chapter organization, let’s first note how each chapter is internally structured. Every chapter contains three primary elements — a “Problem” section, a “Design” section, and a “Solution” section. In fact, some chapters are structured with more than one Problem–Design–Solution grouping. Each collection of Problem–Design–Solution addresses the following:

“Problem”

— Discusses either the current challenges related to the area of focus, or the SSIS-specific problems when trying to design an aspect of the solution.

“Design”

— Presents the SSIS best practices, or outlines the design aspects and recommendations for a specific solution area.

“Solution”

— Ties all the aspects of the solution together, and gives either an example solution related to the area, or considers the SSIS solution approach, and factors involved in solving the problem outlined earlier.

This book is generally organized in the way that you would approach a data integration or ETL project. The chapter flow builds on the natural progression that a developer or administrator would go through when designing an SSIS solution.

After beginning with an overview of architecture, the book then moves into putting together the underlying support structure of a solution — the storage, deployment, and management framework. Next, the natural progression is to handle the source data, whether that is in files or extracted from a relational database system (RDBMS), and often requiring a data-cleansing process. Next, the chapters delve into dimension and fact table loading, as well as the cube-processing steps. The final chapters address advanced data handling through scripting, and provide the package availability and performance that many solutions require.

The following chapter provides a solution overview and discusses planning:

Chapter 1,

“SSIS Solution Architecture,” addresses the overall solution challenges and planning best practices.

The following chapters examine SSIS administration and the deployment foundation:

Chapter 2,

“SSIS Management Framework Design,” reviews and demonstrates the importance of building an SSIS auditing and execution framework.

Chapter 3,

“Package Deployment and Storage Decisions,” discusses the practices and decisions related to SSIS package deployment and storage.

The following chapters discuss file management and data extraction:

Chapter 4,

“File Handling and Processing Methods,” covers how to deal with different files types, file sources, and file management.

Chapter 5,

“Data Extraction Best Practices,” dives into how to manage data source extractions, including incremental and change data management.

The following chapter discusses data cleansing:

Chapter 6,

“Data-Cleansing Design,” reviews data profiling in SSIS, and how to perform data cleansing of complex and dirty data.

The following chapters cover data warehouse ETL:

Chapter 7,

“Dimension Table ETL,” covers the challenges related to dimension table data changes, and loading for different dimension table types.

Chapter 8,

“Fact Table ETL,” addresses ETL for data warehouse fact table.

Chapter 9,

“SSAS Processing Architecture,” looks at how to handle SSAS cube processing in different ways, based on your processing requirements.

The following chapters examine advanced ETL concepts:

Chapter 10,

“Implementing Scale-Out ETL Process,” looks at how to deal with package reliability, and provides an approach to scaling out an SSIS solution.

Chapter 11,

“Scripting Design Patterns,” presents a few different solution examples geared toward handling advanced data and processing requirements with scripts.

Chapter 12,

“SSIS Package Scaling,” presents troubleshooting steps, and provides best practices for package design with a focus on performance.

The Problem–Design–Solution format and order of the chapters together provide a well thought-out, organized, and systematic approach to SSIS solution design.

What You Need To Use This Book

Following is a breakdown of what you need to use this book:

Since this book is about SQL Server 2008 Integration Services, you will get more out of it if you have the Developer Edition of SQL Server 2008 installed, including the sample applications and the tools.

If you don’t have a licensed copy of SQL Server 2008 Developer Edition, you can download a trial version from the Microsoft Web site,

http://download.microsoft.com

.

Search for “SQL Server 2008 Evaluation.”

The

Adventure Works 2008

samples are required for several of the examples in the chapters. The samples are available on Microsoft Open Source community site,

www.codeplex.com

.

Search for “SQL Server 2008 Product Samples.”

The source code for the chapters is posted on the Wrox Web site (

www.wrox.com

). See the later section, “Source Code,” for directions on how to access to the material. Each chapter has the source code. In addition, a database called

SSIS_PDS

is included in the source code download, and has all the database objects required for the examples.

This book has been written for SQL Server 2008 with SP1 installed (which was released in February 2007).

Conventions

To help you get the most from the text and keep track of what’s happening, a number of conventions have been used throughout the book.

Boxes like this one hold important, not-to-be forgotten information that is directly relevant to the surrounding text.

Tips, hints, tricks, and asides to the current discussion are offset and placed in italics like this.

As for styles in the text:

Important new terms and important words are

highlighted

when we introduce them.

Keyboard strokes are shown like this: Ctrl+A.

File names, URLs, and code within the text are shown like this:

persistence.properties

.

Code is presented in the following two ways:

In code examples we highlight new and important code with boldfaced text. The boldface is not used for code that's less important in the present context, or has been shown before.

Source Code

As you work through the examples in this book, you may choose either to type in all the code manually, or use the source code files that accompany the book. All of the source code used in this book is available for download at www.wrox.com. Once at the site, simply locate the book’s title (either by using the Search box, or by using one of the title lists), and click the Download Code link on the book’s detail page to obtain all the source code for the book.

Because many books have similar titles, you may find it easiest to search by ISBN; for this book, the ISBN is 978-0-470-52576-0.

Once you download the code, just decompress it with your favorite compression tool. Alternately, you can go to the main Wrox code download page at www.wrox.com/dynamic/books/download.aspx to see the code available for this book and all other Wrox books.

Errata

We make every effort to ensure that there are no errors in the text or in the code. However, no one is perfect, and mistakes do occur. If you find an error in one of our books (such as a spelling mistake or faulty piece of code), we would be very grateful for your feedback. By sending in errata, you may save another reader hours of frustration, and, at the same time, you will be helping us provide even higher-quality information.

To find the errata page for this book, go to www.wrox.com and locate the title using the Search box or one of the title lists. Then, on the book details page, click the Book Errata link. On this page, you can view all errata that has been submitted for this book and posted by Wrox editors. A complete book list including links to each book’s errata is also available at www.wrox.com/misc-pages/booklist.shtml.

If you don’t spot “your” error on the Book Errata page, go to www.wrox.com/contact/techsupport.shtml and complete the form there to send us the error you have found. We’ll check the information and, if appropriate, post a message to the book’s errata page and fix the problem in subsequent editions of the book.

p2p.wrox.com

For author and peer discussion, join the P2P forums at p2p.wrox.com. The forums are a Web-based system for you to post messages relating to Wrox books and related technologies, and to interact with other readers and technology users. The forums offer a subscription feature to email you topics of interest of your choosing when new posts are made to the forums. Wrox authors, editors, other industry experts, and your fellow readers are present on these forums.

At http://p2p.wrox.com you will find a number of different forums that will help you not only as you read this book, but also as you develop your own applications. To join the forums, just follow these steps:

1.  Go to p2p.wrox.com and click the Register link.

2.  Read the terms of use and click Agree.

3.  Complete the required information to join, as well as any optional information you wish to provide, and click Submit.

4.  You will receive an email with information describing how to verify your account and complete the joining process.

You can read messages in the forums without joining P2P, but to post your own messages, you must join.

Once you join, you can post new messages and respond to messages other users post. You can read messages at any time on the Web. If you would like to have new messages from a particular forum emailed to you, click the “Subscribe to this Forum” icon by the forum name in the forum listing.

For more information about how to use the Wrox P2P, be sure to read the P2P FAQs for answers to questions about how the forum software works, as well as many common questions specific to P2P and Wrox books. To read the FAQs, click the FAQ link on any P2P page.

1

SSIS Solution Architecture

Imagine that this is the first day of a new internal or client project. You will have responsibility on the data and processing layer of the solution, which involves processing data — a lot of data — from several sources, and then either integrating systems, or maybe consolidating data for reporting. Maybe your background is a developer, or a database administrator (DBA), or a data designer, and you know SSIS fairly well. But now they are calling you the “data and ETL architect.”

ETL is the acronym for Extraction, Transformation, and Loading, which is used to describe the data-processing layer in a data-integration or data warehouse solution.

The project manager approaches you and says that the Vice President of Technology has asked the team to give him an estimate of the infrastructure needed. Furthermore, the business owner wants a high-level overview of how the solution architecture will help the company achieve the business need most efficiently. The project manager also wants your thoughts on the best way to approach the solution, how the development should be coordinated between team members, and how deployment should be handled.

Three meetings are scheduled for tomorrow to address these things, and you’ve been asked to provide your feedback in those areas.

Where do you start? How should you approach the solution design with SSIS as the main technology? How should all the pieces work together?

This chapter examines how to lay the foundation for successful solutions based on SQL Server Integration Services (SSIS). And, in fact, this whole book is about SSIS solutions to real-world requirements and challenges. It addresses questions such as the following:

What are the problems and issues?

What are the design principles?

How do you put it all together for a complete and successful solution?

Before you dive into the technical challenges of a project, you must first step back and ensure that you are laying the right foundation. Jumping right in is tempting! But resist the urge, because you want to (and need to) set the precedence and patterns for the solution upfront. If you don’t, chances are you won’t be able to go back and make changes down the road.

As with all chapters in this book, this chapter is organized into the following three major sections:

“Problem”

— Coordinating and architecting an SSIS solution is not an easy task. The “Problem” section reveals some common challenges and common mistakes when planning and extending your ETL environment.

“Design”

— The “Design” section in this chapter examines the right approach to a project, and the long-term project aspects that you should set in motion early in a project.

“Solution” —

In many ways, the remainder of this book provides you with the solutions to make it all work together. This section launches you into the rest of the book, and shows how you can follow the chapters to build or redesign your SSIS solution.

Problem

Data and ETL projects have many challenges. Some challenges relate to data, some to enterprise integration, some to project coordination, and some to general expectations. This section begins by looking at the bigger picture of data within an organization, but then quickly looks at ETL projects and SSIS packages and execution.

Macro Challenge: Enterprise Data Spaghetti

Maybe your SSIS project is only a small solution in a bigger enterprise pond. The problem is that it can still cause a ripple effect when you tie it into your environment. Or, you can have challenges caused by an unwieldy enterprise environment when you try to implement your solution.

Figure 1-1 shows a not-so-nice telephone/electricity pole that illustrates the data nervous system of many organizations.

Figure 1-1

The problem with Figure 1-1 is that this mess didn’t happen overnight! It grew into this twisted unorganized process because of poor planning, coordination, and execution. However, be aware that, a lot of the time, a corporation’s politics may lead to this type of situation. Departments hire their own technical people and try to go around IT. Systems don’t talk to each other nicely. Project pressures (such as time and budget) cause designers to cut corners.

Following are a few reasons why this kind of tangled mess happens in an organization’s data processing, and examples of the many problems that this “unorganization” causes:

Debilitating dependency chains

— The biggest problem is that often systems are built on top of systems on top of systems. The core source data has been connected through so many precedence links that it takes more time and administrative and development overhead. Systems at the source and in the middle of dependencies become un-replaceable because of the amount of effort that switching to a new system would take.

Unknown and uncontrollable data processing

— The operations that process data are not centralized, and, in many cases, unknown because of department applications that are created without coordination with other applications. Processes run at uncontrolled times, and may impact systems within the processes even during peak times, which affects work efficiency.

Fragile enterprise changes

— Changes to applications are difficult and costly. They may break processes, or cause data integration or reporting applications to be inaccurate.

Delayed data access

— Even when the processes are somewhat controlled, the complicated system dependencies cause delays in data availability and nightly overhead processes that often run into mid-morning schedules. When they break, customer perception and employee efficiency are affected.

The “Design” section later in this chapter discusses how to approach your SSIS-based ETL project in the right way, and ensure that you are helping to solve the problem, rather than adding to it.

Micro Challenge: Data-Processing Confusion

Another common problem with data processing is when the logic contained to process data is overly complicated and confusing. Just like the macro enterprise problem, this problem usually is the result of changes over time where logic is modified and appended. It usually comes in one of two ways:

Runaway stored procedures

— Procedures that run with complicated logic and lots of temporary tables, inserts, updates, and/or deletes can be difficult to manage, and are often inefficient. Supporting the procedures is also very difficult because the logic is difficult to follow, and, many times, the developers or DBAs who wrote the code are unavailable. Overall, this type of process requires a lot of administration and wasted time spent on following and learning the process.

Unmanageable packages

— SSIS packages can also be designed with difficult-to-follow logic and sometimes complex precedence with hundreds of components used in a single package. These kinds of packages have challenges similar to those of runaway stored procedures, such as troubleshooting and the learning curve required for the process.

Figure 1-2

shows the control flow of a package that has too many components to effectively manage. (The SSIS designer is zoomed in at 50 percent to fit on the screen.)

Figure 1-2

The overly complex control flow shown in Figure 1-2 is similar to an overly complex data flow, where too many components are used, thus making the development, troubleshooting, and support difficult to manage. The “Design” section later in this chapter proposes a better approach for SSIS packages called the modular package approach.

In summary, both of these types of processes (runaway procedures and unmanageable packages) are very difficult to support, and not suited to team development, error handling, and scalability (all of which are addressed in Chapter 12).

Problems with Execution and Troubleshooting

A couple of other issues that often come up in an ETL or data-integration solution are poor process coordination and difficulty doing root cause analysis. If the “what happened?” question can’t be answered quickly and with confidence, then likely there is a problem with the overall solution execution and logging strategy.

Figure 1-3 shows the command-line output of an example SSIS package execution.

Figure 1-3

If you were to consider spending time trying to work through this output when trying to figure out what went wrong, then you should consider implementing a better execution and auditing structure. This includes package execution in your development environment.

If you have just turned on the out-of-the-box SSIS logging and are capturing results to output to a table, it still may not be enough. If you write custom queries every time against the SSIS logging table to figure out what happened, then you also need a better strategy.

Infrastructure Challenges

When designing an SSIS ETL solution, how do you determine the infrastructure components such as server hardware, memory, processor cores, network switches, disk arrays, storage networks, and I/O controllers? Related to that, where should you run your SSIS packages taking into consideration sources, destinations, and other applications, while balancing hardware scalability and location within your network topology?

These questions are not trivial, and the answers depend on a lot of factors, including processing windows, source and destination availability, application impact and availability, network bandwidth, fault-tolerance requirements, and so on.

I/O is usually the biggest bottleneck, and the one most often overlooked. I/O (or, more precisely, disk I/O) is the throughput that the system can handle on the drive volumes. And this challenge is not just about trying to get the greatest throughput on a single drive. You must consider staging and temporary environments, logging, and current and historical data. And you must balance it all with hardware availability and budget.

The reason disk I/O is so important when considering a data-integration or ETL effort is because of the nature of what you are doing, including the following:

Bulk operations

— ETL and integration efforts typically process data in bulk, meaning that, when a process is kicked off (hourly, daily, or weekly), data is moved from sources to destinations with some transformation processes in-between. The processes usually move or integrate thousands or millions of records. That can be a lot of data that moves between systems, and it generates a lot of disk activity.

Source databases

— Processes that are extracting a lot of data from sources incur disk overhead either by the sheer volume of the data, or when complicated queries against large tables require temporary operations that use disks (such as the SQL Server

TempDB

).

Destination databases

— The nature of relational databases requires that data be stored to disk before a transaction is marked as complete. When inserting or updating a lot of data, the server must wait until the data is committed for the process to be complete.

Staging databases

— Staging databases and tables are a common intermediate step in an ETL process and can be used for many reasons. Any time that data is landed to a staging database and then extracted from the staging database, it has the overhead of both the insert and the select, which can, at times, be done simultaneously with inserts into the destination database and, therefore, the I/O is compounded.

File management

— A lot of ETL operations deal with files such as delimited files, XML, EDI, and so on. Large files require file management, archiving, and sometimes processing, and, therefore, incur disk I/O overhead.

The bottom line is that you will most likely have a disk I/O bottleneck in your data-processing operations, and you’ll need to plan and manage for that to meet your service level agreements (SLAs) and performance requirements.

Other Challenges

The list of common project challenges can go on and on, but here are a few more:

Data challenges

— Of course, you will have data challenges in your project anywhere from missing records to dirty data to bad data, and you will need to understand those problems as soon as possible so that you can set the expectations upfront about what can and what cannot be done about them. Although you can do a lot in SSIS (including fuzzy matching), magic is not on the component list — you can’t make up data that doesn’t exist. Don’t overpromise. Be realistic.

Corporate political challenges

— This book is about solving problems with a technology, namely SSIS. But, because you are trying to solve problems, you are going to be dealing with people. Everyone has an agenda, and a lot of times those agendas will not be in your project’s best interest. Watch out for people who are threatened because you are changing the way things are done (even when it is for the better), or because your solution will be replacing one of their legacy applications, or because they are territorial about their “sandbox.” You want to fix the enterprise spaghetti shown in

Figure 1-1

, but don’t forget that some people have their tribal knowledge and make their living by keeping the corporate data nervous system tangled.

Requirement and scope challenges

— Any project has scope creep. Just be careful about how the changes affect the project timeline, and don’t leave data validation until the last day. You’ll get burned!

Design

Now that you are scared, step back and take a deep breath. Designing an ETL process is doable, and, with the right approach in mind, you can be successful. This section discusses the overall design approach to an SSIS-based solution by examining the following:

Choosing the right tool

Solution architecture

Project planning

Package design patterns

Server and storage hardware

Package execution location

Choosing the Right Tool

This book is about applying SSIS. You are probably reading it because you assume that SSIS is the right tool for the job. That’s probably the case. However, be sure to consider what you are doing, and ensure that using SSIS is in line with what you are doing.

Think about all the different types of data-processing needs that you have across your organization:

Data synchronization between systems

Data extraction from enterprise resource planning (ERP) systems

Ad hoc reporting

Replication (both homogeneous and heterogeneous)

PDA data synchronization

Legacy system integration

Data warehouse ETL processing

Vendors and partner data files integration

Line-of-business data processing

Customer and employee directory synchronization

As you may know, when it comes to data processing, a lot of tools are out there. Some are created for specific situations (such as folder synchronizing tools), whereas other tools are designed to perform a variety of functions for different situations. So, the traditional question often posed is which tool can best meet the business and logical requirements to perform the tasks needed?

Consider the host of tools found in the ever-evolving Microsoft toolset. You can use Transact SQL (TSQL) to hand-code a data load, Host Integration Server to communicate with a heterogeneous data source, BizTalk to orchestrate messages in a transactional manner, or SSIS to load data in batches. Each of these tools plays a role in the data world.

Although overlaps exist, each tool has a distinct focus and target purpose. When you become comfortable with a technology, there’s always the tendency to want to apply that technology beyond its intended “sweet spot” when another tool would be better for the job. You’ve no doubt heard the phrase “when you’re a hammer, everything looks like a nail.” For example, C# developers may want to build an application to do something that SSIS could potentially do in an hour of development time. The challenge everyone faces entails time and capacity. There is no way everyone can be an expert across the board. Therefore, developers and administrators alike should be diligent about performing research on tools and technologies that complement each other, based on different situations.

For example, many organizations use BizTalk for a host of purposes beyond the handling of business-to-business communication and process workflow automation. These same organizations may be perplexed as to why BizTalk doesn’t scale to meet the needs of the organization’s terabyte data warehousing ETL. The easy answer is that the right tool for bulk Business Intelligence (BI) processing is an ETL tool such as SSIS.

Be Careful About Tool Selection

In some client environments, an ETL tool may be chosen without consideration for the availability of industry skills, support, or even the learning curve. Even though the tool could perform “magic,” it usually doesn’t come with a pocket magician — just the magic of emptying your corporate wallet. In many cases, thousands of dollars have been spent on an ETL tool that takes too long to master, implement, and support. Beyond the standard functionality questions you should ask about a tool, be sure to also consider the following:

Your internal skill setsThe trend of industry use of the toolHow easy it is to learnThe ease of supporting the tool

Overall Solution Architecture

The reality is that creating a perfectly efficient enterprise data ecosystem is impossible. But there are certainly levels of efficiency that can be gained when your SSIS solution is planned and implemented thoughtfully. Figure 1-4 contrasts Figure 1-1 by showing a city’s central power station, organized and efficient.

Figure 1-4

The tendency when developing a new integration or ETL system is to get it done as quickly as possible. What often happens is that the overall architecture is not integrated well into an organization’s environment. Maybe some time is saved (and that is even questionable), but in the end, more time and money will be wasted.

A solution architecture should have several key data-processing objectives. The following apply to SSIS-based solutions, but also relate generically to any data-processing solution architecture:

The solution should coordinate with other data-centric solutions in the enterprise. Do not build a separate data silo, especially if your effort is a data warehouse or data mart — that causes multiple versions and variations of the data.

Source data that is required for the solution must be extracted as close to the source as possible and not plugged at the end of a long dependency chain. (Be sure to follow the previous bullet point).

The solution should have a centralized administration and execution strategy that coordinates with other systems, or follows the practices of other corporate systems. This does not require limiting a scale-out architecture, but simply that the support structures are centralized.

Real-time execution auditing is also needed to know what is happening and what did happen. This information will go a long way in supporting a system. In addition, you should have a way to track data back to the source. This tracking is critical for both data validation and troubleshooting data errors.

The processes must have rollback layers. In other words, if your requirements necessitate a complicated or long process, don’t require the whole process to be re-run from the start if the last part breaks. Plan for restarting at interim points after the issues are identified. Doing so also enables you to easily compartmentalize changes in the ETL solution.

These objectives represent the larger picture of an overall solution architecture. Other aspects, of course, are important and situational to what you are building in SSIS. Two common types of data-processing efforts are discussed in the following sections: system integration and data warehouse ETL. These fit well into the SSIS toolset.

Data Integration or Consolidation

One common use of SSIS is to integrate data between systems, or to synchronize data. For example, you may want to create a business-to-business portal site, and you may need the site to interface with the source data on the mainframe. In this case, you may get the data delivered in nightly extracts from the mainframe and load it into your SQL Server table.

Another very common ETL task that DBAs face is receiving files from File Transfer Protocol (FTP) servers (or on network shares) that must be processed and loaded into another system. This type of process involves moving files, and then processing the data, which may involve de-duping (removing duplicates), combining files, cleaning bad data, and so on. Two systems may also need to talk to one another or pass business keys in order for records to be matched between environments.

Figure 1-5 shows an example solution architecture that integrates data between different systems in an enterprise.

Figure 1-5

In this diagram, data from different systems is integrated. Often, a master data management service (sometimes called master dimension management) helps to synchronize entities between systems so that an organization’s data relates (for example, so that a customer record from the customer relationship management, or CRM, system can tie into the ERP sales system). This data process contains some aspects that are bidirectional, and other parts that perform extraction and loads. Data staging is used in this example to help integrate the data, and a data store is used to centralize many of the corporate data processes (which helps alleviate the long chains of system dependencies).

Of course, other variations of system integration solutions exist, such as consolidation of different divisional data, especially when companies go through mergers and acquisitions.

Data Warehouse ETL

One of the more common uses of SSIS is for performing data warehouse ETL. Data warehousing focuses on decision support, or enabling better decision making through organized accessibility of information. As opposed to a transactional system such as a point of sale (POS), Human Resources (HR), or CRM that is designed to allow rapid transactions to capture information data, a data warehouse is tuned for reporting and analysis. Because data warehousing is focused on the extraction, consolidation, and reporting of information to show trending and data summaries, the ETL part of a data warehouse is important and critical.

Processing ETL for data warehousing involves extracting data from source systems or files, performing transformation logic on the data (to correlate, cleanse, and consolidate), and then loading a data warehouse environment (for reporting and analysis). Figure 1-6 shows common data-processing architecture for a data warehouse ETL system.

Figure 1-6

For those who are already versed in ETL concepts and practice, you may know that when it comes to developing a data warehouse ETL system, moving from theory to practice often presents the biggest challenge. Did you know that ETL typically takes up between 50 and 70 percent of a data warehousing project? That is quite a daunting statistic. What it means is that even though presenting the data is the end goal and the driving force for business, the largest portion of developing a data warehouse is spent not on the presentation and organization of the data, but rather on the behind-the-scenes processing to get the data ready.

Project Planning and Team Development

This is not a project methodology book, but you should give some thought to your solution approach. Whether your overall objective is system integration or warehouse ETL, you should give consideration to using an agile development methodology. An agile methodology is an iterative approach to development. You add features of the solution through smaller development cycles, and refine requirements as the solution progresses.

Agile Benefits

Even if your solution does not involve a user interface (such as a system integration), an agile approach enables you to tackle aspects of the solution in smaller development cycles, and to troubleshoot data issues along the way. Following are some the general benefits of this approach:

Project and task definition

— The agile approach requires definition of the tasks in a project and the prioritization of the tasks, which are set and known by the technology team and the business or project owners. Tasks can change as a better understanding of the requirements is defined.

Scope management

— Given the clear definition of the activities in an iteration, the scope is set, and any deviation is known to the parties involved, and agreed upon. In essence, project communication is clearer for all parties — developer, management, and ownership.

Addressing of issues and obstacles

— Part of the ongoing process involves identifying the areas in the project that can impede progress. These are highlighted and addressed soon in the process.

Roles and responsibility clarity

— Roles are clearly defined and tasks are assigned, which limits the possibility of team members spinning their wheels in the process.

Agile Cautions and Planning

However, you must exercise some caution. Do not use an agile methodology to foster bad architecture practices. In other words, if you are just using the agile approach and you or your developers’ ultimate goal is to meet the tasks and deadlines in whatever way possible, you are going to fall into the trap of compounding the problems of your overall data nervous system (that is, those problems outlined in the earlier “Problem” section).

You must ensure that you have an overall solution architecture, and your agile tasks must fit in that plan and support the cause.

Therefore, whatever project methodology you use, be sure to push for an upfront plan and architecture. If you don’t, you will likely run into the tyranny-of-the-urgent problem — that means that you will get overwhelmed with the tasks and, as you throw them together, your solution gets out of control, code is messy, and your stress will be compounded over time.

Following are a few things to consider in your development process:

Plan for an early-on proof-of-concept, and use the proof-of-concept to iron out your data process architecture.

Set your SSIS package and database conventions upfront, including your auditing structure (as discussed in Chapter 2).

Estimate your data volumes in one of the initial development cycles so that you can purchase the right hardware.

Get your server storage ironed out upfront. Be sure to set expectations with the storage group or vendor early on in the process.