32,99 €
An authoritative guide to designing effective solutions for datacleansing, ETL, and file management with SQL Server 2008Integration Services SQL Server Integration Services (SSIS) is the leading tool inthe data warehouse industry, used for performing extraction,transformation, and load operations. After an overview of SSIS architecture, the authors walk you aseries of real-world problems and show various techniques forhandling them. * Shows you how to design SSIS solutions for data cleansing, ETLand file management * Demonstrates how to integrate data from a variety of datasources, * Shows how to monitor SSIS performance, * Demonstrates how to avoid common pitfalls involved with SSISdeployment * Explains how to ensure performance of the deployed solution andeffectively handle unexpected system failures and outages * The companion Web site provides sample code and databasescripts that readers can directly implement This book shows you how to design, build, deploy, and managesolutions to real-world problems that SSIS administrators anddevelopers face day-to-day.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 621
Veröffentlichungsjahr: 2010
Contents
Introduction
Chapter 1: SSIS Solution Architecture
Problem
Macro Challenge: Enterprise Data Spaghetti
Micro Challenge: Data-Processing Confusion
Problems with Execution and Troubleshooting
Infrastructure Challenges
Other Challenges
Design
Choosing the Right Tool
Overall Solution Architecture
Data Integration or Consolidation
Data Warehouse ETL
Project Planning and Team Development
Agile Benefits
Agile Cautions and Planning
Data Element Documentation
Package Design Patterns
Modular Packages
Master Packages
Server and Storage Hardware
Server Hardware
Development and Test Servers
ETL Collocation
Storage Hardware
Package Execution Location
Package Storage Location Versus Execution Location
Execute SQL Task and Bulk Insert Task Execution
Package Execution and the Data Flow
Design Review
Solution
Setting the Stage: Management and Deployment
Source Data: Files, Tables, and Data Cleansing
Data Warehouse ETL and Cube Processing
Advanced ETL: Scripting, High Availability, and Performance
Summary
Chapter 2: SSIS Management Framework Design
Problem
Challenges of Not Having a Framework Implementation
Different Development Methods
Changing Metadata
Getting the Right Information to the Right People at the Right Time
Reusability
Framework Implementation
Configuration Management Scheme
Logging and Auditing
Template Package
Framework Benefits and Value
Design
Configuration Management
Overall Design
Environment Variable Configuration
XML Configuration File
SQL Server Configuration
Logging and Auditing Mechanism
Storage
Tables
Stored Procedures
Template Package
Implementation Guidelines
Solution
Configuration Management
Logging and Auditing Mechanism
Storage and Tables
Stored Procedures
Template Package
Development
Installation
Other Considerations
Customizations
ETL Process Framework
Process Owner
Reporting
Summary
Chapter 3: Package Deployment and Storage Decisions
Problem
Standardization
Environment
Application
Desired Implementation
Design
Storage Methods
SSIS Service
File System Storage
SQL Server Storage
Storage Comparison
Deployment Methods
SSIS Deployment Wizard
Manual Deployment of Packages
DTUtil Scripting
Solution
Storage Methodology
Existing Infrastructure
Security
Package Makeup
Back to the Flowchart
Deployment Methodology
Storage Method
Existing Infrastructure
Deployment Needs
Back to the Flowchart
Total Methodology
Summary
Chapter 4: File-Handling and Processing Methods
Problem
Simple File Operations
Remote File Operations
File Criteria
File Iteration
File Notification
Design
File System Task
Source and Destination Connection
Attributes
Overwriting the Destination
Multiple Files
FTP Task
FTP Connection
HTTP
Foreach Loop
Applying Advanced Criteria
File Watcher Task
WMI Event Watcher Task
Solution
Summary
Chapter 5: Data Extraction Best Practices
Problem
Extraction Data Criteria
Source System Impact
Incremental Extraction
Deleted Rows
Staging Database
Data Lineage and Metadata
File Sources
Design
Package Connections and Source Adapters
Package Connections
Source Adapters
Incremental Data Extraction
Incremental Extraction Using a Change Identifier Value
Targeting Change Records through Dynamic Source Queries
Retrieving Incremental Identifier Values and Updating Package Variables
Capturing the Maximum Change Identifier Value
Incremental Extraction from SQL Server without a Trigger
Using SSIS to Handle All Aspects of an Incremental Extraction
Data Staging Method
Tracking Data Lineage Identifiers
Solution
Metadata-Driven Extraction Example
Metadata Tables
Control Flow
Looping through the Sources
Inside the Sources Loop
The Script
Read Variables
Open Connections
Get Max Change Indicator
Extract Changed Source Rows
Close Connections
Summary
Chapter 6: Data-Cleansing Design
Problem
Candidate Key Profiles
Column Length Distribution Profiles
Column Null Ratio Profiles
Column Pattern Profiles
Column Statistics Profiles
Column Value Distribution Profiles
Functional Dependency Profiles
Design
Using the Script Transform for Data Scrubbing
Using the Fuzzy Grouping Transform to De-duplicate Data
Using the Fuzzy Lookup Transform to Cleanse Data
Dealing with Multiple Record Types in a Single File
Using the Raw File
Solution
Summary
Chapter 7: Dimension Table ETL
Problem — Fundamental Dimension ETL
Dimensions: The Basics
Dimension ETL: The Challenge
Design — Fundamental Dimension ETL
Data Preparation
Dimension Change Types
Type 1 (Changing) Attribute: A Closer Look
Type 2 (Historical) Attribute: A Closer Look
Inferred Members
Solution — Fundamental Dimension ETL
Preparing Your Source Data for Dimension ETL
SSIS Slowly Changing Dimension Wizard
Advanced Properties and Additional Outputs of the SCD
Problem — Advanced Dimension ETL
SCD Wizard Advantages and Disadvantages
Dimension Volume and Complexity
Design — Advanced Dimension ETL
Optimizing the Built-in SCD
Index Optimizations
Update Optimizations
Snowflake Dimension Tables
Parent-Child Dimension ETL
Date Dimension ETL
Profile Dimension and Junk Dimension ETL
Creating a Custom Slowly Changing Package
Solution — Advanced Dimension ETL
Snowflake Dimensions
Parent-Child Dimensions
Profile and Junk Dimensions
Date Dimension ETL
Custom Dimension ETL
Determining Dimension Changes
Inserts and Updates
Summary
Chapter 8: Fact Table ETL
Problem
Fact Tables: The Basics
Fact Table ETL: The Challenge
Preparing the Data
Mapping Dimension Keys
Calculating Measures
Adding Metadata
Fact Table Updates
Fact Table Inserts
Design
Data Preparation
Data Preparation with SSIS Transformations
Data Preparation Examples
Acquiring the Dimension Surrogate Key in SSIS
Identifying the Dimension Surrogate Keys with SSIS
Surrogate Key Examples in SSIS
Measure Calculations
Measure Calculation Types
Handling Measure Calculations in SSIS
Managing Fact Table Changes
Approaches to Identifying Changed Fact Records
Fact Update Examples in SSIS
Optimizing Fact Table Inserts
Optimizing Inserts with Fast Load
Optimizing Inserts with Index Management
Solution
Internet and Reseller Sales Fact Table ETL
Fact Internet and Reseller Sales Extraction and Transform Process
Fact Internet and Reseller Sales Load Process
Snapshot Fact Table Example — Call Center Fact Table
Advanced Fact Table ETL Concepts
Handling Missing Dimension Lookups
Handling Late-Arriving Facts
Summary
Chapter 9: SSAS Processing Architecture
Problem
SSAS Object Population
Schedule
Partition Management
The Complete Package
Design
SSAS Objects and Processing Basics
Dimensions
Partitions
Mining Structures
SSIS Tasks and Components
Analysis Services Processing Task
Analysis Services Execute DDL Task
Execute Process Task with ASCMD
Data Flow Destinations for SSAS Objects
Script Task with AMO
Creating and Managing Partitions
Overall Design
Solution
Preparation for SSAS Integration
Process Dimensions Package
Process Task
Parallel XMLA
Process Partitions Package
Storing and Loading Metadata
SSAS Processing
Overall Solution
Summary
Chapter 10: Implementing Scale-Out ETL Process
Problem
Design
Design Components Overview
Central Common Services Server
File Processor and Pre-Aggregation Scale-Out Processes Servers
Design Details
File Management Tasks
Data File Management
Work Allocation Process
Scale-Out Source File Process
Work Reassignment Process
Data Aggregation Tasks
Hourly Data Pre-Aggregation Process
Hourly Data Aggregation Process
Daily Data Aggregation Process
Archival and Clean-up Processes
Data File Archival Process
Stage Table Clean-up Process
Design Conclusions
Solution
Central Server Services
Multiple File Processor and Pre-Aggregator Processes
Database Tables Required on the Central Server
Stored Procedures
Procedures on the Central Server
Procedures on the Staging Servers
SSIS Packages
File-Processing Server Packages
Central Server Packages
Summary
Chapter 11: Scripting Design Patterns
Problem — Advanced File Management
Script Task
Scenario
Design — Advanced File Management
Script Language
Accessing Objects
Custom Assemblies
Scripting
Solution — Advanced File Management
Create an External Assembly
Access the External Assembly
Archive the Files
Summary — Advanced File Management
Problem — Call Center Fact ETL
Reasons to Use Scripting
Scenario
Design — Call Center Fact ETL
Component Type
Output Type
Synchronous Outputs
Asynchronous Outputs
Other Component Types
Design Choice
Overridable Methods
Solution — Call Center Fact ETL
Package Setup
Script Component
Reusable Methods
Row-By-Row Processing
Process Input
Entire Script
Package Wrap-Up
Summary — Call Center Fact ETL
Summary
Chapter 12: SSIS Package Scaling
Problem
Identifying Task Durations
Identifying Data Flow Destination Issues
Identifying Transformation and Memory Bottlenecks
Identifying Data Flow Source Performance Issues
Design
Balancing SQL Operations with the Data Flow
Data Flow Advantages as Compared with SQL
SQL Advantages when Compared with the Data Flow
Applying SQL Code in Your SSIS Packages
SSIS Pipeline Architecture Concepts
Data Flow Design Example
SQL-Centric SSIS Process
Rewritten Data Flow–Centric SSIS Process
Solution
Tuning Your Data Flow
Use Blocking Transformations Sparingly
Limit Row-by-Row Operations
Manage Data Flow Backpressure
Pre-sort Sources as an Alternative to the Sort Transformation
Optimize the Lookup and Managing the Cache Size
Remove Unused Columns from the Data Flow
Be Mindful of the SSIS Logging Impact
Regulate Transactions
Setting Data Flow Properties
Up the EngineThreads Value
Optimize the Temporary Storage Locations
Leave RunInOptimizedMode as True
Tune Buffers
Database and Destination Optimization
Limiting Database Growth
Consider Dropping and Re-creating Indexes on Destination Tables
Using the OLE DB Destination Adapter
Use Advanced Oracle and Teradata Destination Adapters
Handling Partitioned Tables
Summary
Index
Microsoft® SQL Server® 2008 Integration Services: Problem–Design–Solution
Published byWiley Publishing, Inc.10475 Crosspoint BoulevardIndianapolis, IN 46256
www.wiley.com
Copyright © 2010 by Wiley Publishing, Inc., Indianapolis, Indiana
Published by Wiley Publishing, Inc., Indianapolis, Indiana
Published simultaneously in Canada
ISBN: 978-0-470-52576-0
Manufactured in the United States of America
10 9 8 7 6 5 4 3 2 1
No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.
Limit of Liability/Disclaimer of Warranty: The publisher and the author make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation warranties of fitness for a particular purpose. No warranty may be created or extended by sales or promotional materials. The advice and strategies contained herein may not be suitable for every situation. This work is sold with the understanding that the publisher is not engaged in rendering legal, accounting, or other professional services. If professional assistance is required, the services of a competent professional person should be sought. Neither the publisher nor the author shall be liable for damages arising herefrom. The fact that an organization or Web site is referred to in this work as a citation and/or a potential source of further information does not mean that the author or the publisher endorses the information the organization or Web site may provide or recommendations it may make. Further, readers should be aware that Internet Web sites listed in this work may have changed or disappeared between when this work was written and when it is read.
For general information on our other products and services please contact our Customer Care Department within the United States at (877) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.
Library of Congress Control Number: 2009935233
Trademarks: Wiley, the Wiley logo, Wrox, the Wrox logo, Programmer to Programmer, and related trade dress are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates, in the United States and other countries, and may not be used without written permission. All other trademarks are the property of their respective owners. Wiley Publishing, Inc., is not associated with any product or vendor mentioned in this book.
To my family: Amy, Meg, Nate, Kate, and Caleb
—Erik Veerman
About the Authors
Erik Veerman is a Mentor for Solid Quality Mentors, focusing on training, mentoring, and architecting solutions on the SQL Server Business Intelligence (BI) platform. His industry recognition includes Microsoft’s Worldwide BI Solution of the Year and SQL Server Magazine’s Innovator Cup winner. Veerman has designed dozens of BI solutions across a broad business spectrum — telecommunications, marketing, retail, commercial real estate, finance, supply chain, and information technology. His experience with high-volume, multi-terabyte environments and SQL Server 64-bit has enabled clients to scale their Microsoft-based BI solutions for optimal potential. As an expert in on-line analytical processing (OLAP) design, extract, transform, and load (ETL) processing, and dimensional modeling, Veerman is a presenter, author, and instructor. He led the ETL architecture and design for the first production implementation of SQL Server Integration Services (SSIS), and helped to drive the ETL standards and best practices for SSIS on Microsoft’s SQL Server 2005 reference initiative, Project REAL. Veerman is also co-author of Professional SQL Server 2008 Integration Services (Indianapolis: Wiley, 2008), Expert SQL Server 2005 Integration Services (Indianapolis: Wiley, 2006), Professional SQL Server 2005 Integration Services (Indianapolis: Wiley, 2007), and lead author for the Microsoft Press SQL Server 2005 and SQL Server 2008 series, focusing on Business Intelligence Development and Implementation. As a resident of Atlanta, Georgia, Veerman participates in the local Atlanta SQL Server User’s Group, a Professional Association of SQL Server (PASS) chapter.
Jessica M. Moss (SQL Server MVP) is a well-known practitioner, trainer, author, and speaker for Microsoft SQL Server Business Intelligence (BI). As a Mentor with Solid Quality Mentors, she has created numerous data warehousing solutions for a variety of companies in the retail, Internet, health services, finance, and energy industries, and has delivered training courses on SQL Server Integration Services (SSIS), SQL Server Retail Services (SSRS), and SQL Server Analysis Services (SSAS). While working for a major clothing retailer, Moss participated in the SQL Server 2005 TAP program, where she developed best implementation practices for SSIS. Moss has authored technical content for multiple magazines and Web sites, and has spoken internationally at conferences such as the Professional Association for SQL Server (PASS) Community Summit, SQL Teach International Conference, and the devLink Technical Conference. As a strong proponent of developing user-to-user community relations, Moss organizes meetings for a User Group that discusses different technologies. She also actively participates in local User Groups and code camps in central Virginia. In addition, Moss volunteers her time to help educate people through the PASS organization.
Brian Knight (SQL Server MVP, MCITP, MCSE, MCDBA) is the owner and founder of Pragmatic Works. He is also the co-founder of SQLServerCentral.com and JumpstartTV.com. He runs the local SQL Server Users Group in Jacksonville, Florida (JSSUG). Brian is a contributing columnist for several technical magazines and does regular Webcasts at “Jumpstart TV.” He is the author of ten books on the SQL Server technology. Knight has spoken at conferences such as the Professional Association of SQL Server (PASS), SQL Connections, and TechEd, as well as many Code Camps. His blog can be found at www.pragmaticworks.com. Knight lives in Jacksonville, where he loves watching the National Football League Jaguars lose on a regular basis.
Jay Hackney is a mentor and consultant for Solid Quality Mentors, where he helps customers utilize Microsoft technology to build real-world solutions. He has 15 years of development and consulting experience on the Microsoft data platform, and has worked with SQL Server Integration Services (SSIS) since the earliest private beta. Since then, he has used SSIS to deliver data warehousing, extract, transform, and load (ETL), and data migration solutions for many well-known companies, and has presented on SSIS topics at events such as Professional Association for SQL Server (PASS) Summit, Microsoft Tech-Ed, and SQL Server product launches.
Rushabh Mehta is a Mentor for Solid Quality Mentors, and is also the Managing Director for Solid Quality India Pvt. Ltd. Mehta has actively worked with the Microsoft SQL Server and Business Intelligence (BI) platform since 1998. He has developed numerous BI solutions, including predictive-analysis systems, health-care solutions, multi-terabyte financial decision support systems, as well as designing and building an analytical platform to support the Microsoft Windows Azure platform. His experience spans a wide breadth of industrial sectors, including manufacturing, retail, finance, and government, working with clients such as Raymond James Financials, Jackson Hewitt, Publix, the U.S. Department of Veterans Affairs, the U.S. Department of Defense, and Microsoft. Mehta has also developed a number of best practices and standards for implementing BI solutions. He has also developed and delivered advanced training courses on Microsoft BI practices and technologies. An active speaker at a number of large conferences such as the Professional Association for SQL Server (PASS) and TechEd, Mehta also frequently speaks at SQL Server and BI User Groups around the world. Mehta is the Executive Vice President for PASS, the largest global user-driven community for SQL Server professionals.
Credits
Executive EditorRobert Elliott
Project EditorKevin Shafer
Technical EditorPaul S. Waters
Production EditorKathleen Wisor
Copy EditorPaula Lowell
Editorial DirectorRobyn B. Siesky
Editorial ManagerMary Beth Wakefield
Production ManagerTim Tate
Vice President and Executive Group PublisherRichard Swadley
Vice President and Executive PublisherBarry Pruett
Associate PublisherJim Minatel
ProofreaderScott Klemp, Word One
IndexerRobert Swanson
Cover Image©iStockphoto.com/wolv
Acknowledgments
I want to first thank Jessica, Brian, and Jay for adding their valuable expertise and experience to several of the chapters of this book, and sharing their approach and code practices from the SSIS framework, to the dynamic extraction logic, to data cleansing best practices. A big collective thanks to our clients! The unique projects and challenges helped shape and refine this content. I added up the years of experience for all of us, and these pages represent about 50 years of work with data processing and ETL — and that doesn’t include late nights and long weekends.
I also want to thank Paul Waters and Rushabh Mehta for their roles in this book as Technical Editor and contributing author, respectively. A big thanks to the Wrox team for their support, hard work, and valuable feedback. And a thanks to you, the reader, because I actually read the Amazon feedback and try to address the constructive criticism in each new work. Hopefully, the diligence has paid off. If so, help us and let the world know!
Last, but not least, thanks to my family for their support. A special thanks to Amy and her encouragement and hard work in so many ways.
— Erik Veerman
As a first-time book author, this acknowledgment holds special meaning to me. I will try to refrain from mentioning everyone I have ever met, and focus on a few special people that truly affected the outcome of this book.
First of all, a huge thank you is due to Erik Veerman, the lead author on this book. He not only convinced me to join this troupe of authors, but also took the time to patiently answer all of my questions about the process. Next, thank you to Jay Hackney, Brian Knight, Rushabh Mehta, Paul Waters, and Wrox for their authoring, technical, and editing contributions. I truly appreciated your support and help.
On a personal note, I would like to thank my friends and family, Mom, Dad, and Lianna, for their never-ending support and patience as I disappeared for days at a time. Finally, a few other people who I would like to recognize for their previous support are Fred Creasy, Georgia Willis, Kathi Walker, and Paul Reynolds — thank you.
— Jessica M. Moss
Thanks to everyone who made this book possible. As always, I owe a huge debt to my wife, Jenn, for putting up with my late nights, and to my children, Colton, Liam, and Camille, for being so patient with their tired dad who has always overextended. Thanks also to the makers of Guinness and other hard liquors for sustaining my ability to keep on writing technical books. Finally, thanks for all the User Group leaders out there who work so hard to help others become proficient in technology. You make a huge difference!
—Brian Knight
I’d like to thank Erik for this opportunity and his hard work on this book.
— Jay Hackney
Introduction
Our world is a world full of data and, as developers, database administrators (DBAs), or data architects, your job will no doubt require processing data — consolidating data, integrating system data, migrating data, cleansing data, transforming data, or loading data into a data warehouse. This book is about solving problems and building solutions with SQL Server 2008 Integration Services (SSIS). It focuses on the design best practices for the top uses and challenge of SSIS.
In this book, you will learn about how to approach the following kinds of projects, and how to effectively work within SSIS to apply its functionality to these common project requirements and challenges:
Designing Extraction Transformation and Loading (ETL) solutions with SSIS to handle file management, data warehouse processing, systems integration, and data cleansing.
Resolving common SSIS package deployment and auditing challenges that often get in the way of designing a solid solution.
Building an SSIS-based solution that performs well, and can deal with failures or unplanned system outages
As an author team, our vision for this book is very different than the other books on the shelves, which focus mainly on how to use the knobs and buttons in SSIS. Books such as the Professional Microsoft SQL Server 2008 Integration Services (Indianapolis: Wiley, 2008) are very valuable to have on your bookshelf for reference and for learning SSIS. But this book is about going to the next level of solution design. Our goal in writing this book was to focus on solving problems, building solutions, and providing design best practices.
In summary, the difference between this SSIS books and all the others out there is that other books simply focus on the product features with little emphasis on solution design. If you go out and buy a new power saw, the manual is going to tell you how to angle the blade, set the right depth, and make a clean cut. But the manual doesn’t tell you how to build something. This book shows you how to build the furniture, not just how to use the saw.
To be sure, you must know how to use SSIS before you can build a solution. But going from knowledge to design requires guidance on the right approach, and how to avoid the common pitfalls.
This book empowers you with the confidence, the knowledge, and the understanding to make the right choices in your ETL design that enables easy administration, meets your data processing requirements, and performs well for current and future scalability.
Who This Book Is For
This book is targeted at three primary SSIS users: data architects, ETL developers, and DBAs responsible for ETL administration and support.
Because this book focuses on problems and solutions, a base understanding of SSIS is required. A couple of areas of the book walk you through the more advanced features of SSIS, but most of the book builds on top of a foundation of SSIS knowledge and experience. If you have taken an SSIS class, or have read another book and tried out SSIS, or you have built a few packages for various purposes, then that base knowledge will give you enough background for this book. But you should be up for the challenge of learning a new tool in the context of applying it!
The perfect reader of this book is someone who is in the early stages of a new ETL or data integration project (or a redesign project), and is eager to know how to approach the effort with the best practices when designing SSIS packages.
If you are supporting an existing project and must make some changes to aid in administration, deployment, or scalability, then you will also benefit from several of the chapters herein.
What This Book Covers
This book is an expanded, revised, and rewritten version of the Expert SQL Server 2005 Integration Services book (Indianapolis: Wiley, 2007). The authors of this book have expanded the coverage to address the current trends in ETL and SSIS, including creating a scaling-out execution model, performing advanced data profiling and cleansing, and handling file management and file processing.
This book also addresses some of the challenges in SSIS surrounding auditing, configurations, and execution management. Two chapters focus on solving these administrative challenges:
Chapter 2 covers building an SSIS package framework, and provides you with an auditing, configuration, and reporting structure that you can use in the packages of your solution.
Chapter 3 also helps you in the area of administration by addressing package deployment and storage. It answers the questions, “Where should I store my packages — in the file system or in SQL Server?” and “What is the best SSIS deployment approach for my environment?” The right answer is that “it depends,” and Chapter 3 walks you through all the dependencies, while leading you to the right answer for your environment.
Every ETL or data integration solution involves data extraction of one kind or another. Maybe you’re extracting data from flat files, or perhaps your source is coming from an enterprise resource planning (ERP) system, or a line-of-business application. Regardless, you must implement a data extraction methodology that is efficient, reduces the impact on the source, and adequately handles changes and data tracking. Chapter 5 dives into many of the data extraction areas, and even provides you with a dynamic data extraction approach.
Another area that this book covers is data warehouse ETL. In fact, one of the more common uses of SSIS is data warehousing ETL. Chapters 7, 8, and 9 are dedicated to solving the data warehouse ETL challenges all the way from dimension and fact table ETL, to processing cubes in SQL Server Analysis Services (SSAS).
Not to be overlooked, Chapter 11 examines control flow and data flow scripting, and provides examples of common (but complex) challenges that can be solved by scripting, including advanced file management and data processing. Chapter 12 provides performance troubleshooting steps and best practices on data flow design. This chapter also contrasts the right use of SQL commands versus the data flow.
Before you even begin diving into the details of an SSIS-based solution, you must start out on the right foot! Chapter 1 reviews solution architecture, project approach and methodology, package design patterns, the importance of I/O performance, and data element tracking. It answers the question, “Where do I begin and how do I approach an ETL solution?
In all, this book presents a comprehensive picture of SSIS solution challenges and design best practices.
How This Book Is Structured
Before reviewing the chapter organization, let’s first note how each chapter is internally structured. Every chapter contains three primary elements — a “Problem” section, a “Design” section, and a “Solution” section. In fact, some chapters are structured with more than one Problem–Design–Solution grouping. Each collection of Problem–Design–Solution addresses the following:
“Problem”
— Discusses either the current challenges related to the area of focus, or the SSIS-specific problems when trying to design an aspect of the solution.
“Design”
— Presents the SSIS best practices, or outlines the design aspects and recommendations for a specific solution area.
“Solution”
— Ties all the aspects of the solution together, and gives either an example solution related to the area, or considers the SSIS solution approach, and factors involved in solving the problem outlined earlier.
This book is generally organized in the way that you would approach a data integration or ETL project. The chapter flow builds on the natural progression that a developer or administrator would go through when designing an SSIS solution.
After beginning with an overview of architecture, the book then moves into putting together the underlying support structure of a solution — the storage, deployment, and management framework. Next, the natural progression is to handle the source data, whether that is in files or extracted from a relational database system (RDBMS), and often requiring a data-cleansing process. Next, the chapters delve into dimension and fact table loading, as well as the cube-processing steps. The final chapters address advanced data handling through scripting, and provide the package availability and performance that many solutions require.
The following chapter provides a solution overview and discusses planning:
Chapter 1,
“SSIS Solution Architecture,” addresses the overall solution challenges and planning best practices.
The following chapters examine SSIS administration and the deployment foundation:
Chapter 2,
“SSIS Management Framework Design,” reviews and demonstrates the importance of building an SSIS auditing and execution framework.
Chapter 3,
“Package Deployment and Storage Decisions,” discusses the practices and decisions related to SSIS package deployment and storage.
The following chapters discuss file management and data extraction:
Chapter 4,
“File Handling and Processing Methods,” covers how to deal with different files types, file sources, and file management.
Chapter 5,
“Data Extraction Best Practices,” dives into how to manage data source extractions, including incremental and change data management.
The following chapter discusses data cleansing:
Chapter 6,
“Data-Cleansing Design,” reviews data profiling in SSIS, and how to perform data cleansing of complex and dirty data.
The following chapters cover data warehouse ETL:
Chapter 7,
“Dimension Table ETL,” covers the challenges related to dimension table data changes, and loading for different dimension table types.
Chapter 8,
“Fact Table ETL,” addresses ETL for data warehouse fact table.
Chapter 9,
“SSAS Processing Architecture,” looks at how to handle SSAS cube processing in different ways, based on your processing requirements.
The following chapters examine advanced ETL concepts:
Chapter 10,
“Implementing Scale-Out ETL Process,” looks at how to deal with package reliability, and provides an approach to scaling out an SSIS solution.
Chapter 11,
“Scripting Design Patterns,” presents a few different solution examples geared toward handling advanced data and processing requirements with scripts.
Chapter 12,
“SSIS Package Scaling,” presents troubleshooting steps, and provides best practices for package design with a focus on performance.
The Problem–Design–Solution format and order of the chapters together provide a well thought-out, organized, and systematic approach to SSIS solution design.
What You Need To Use This Book
Following is a breakdown of what you need to use this book:
Since this book is about SQL Server 2008 Integration Services, you will get more out of it if you have the Developer Edition of SQL Server 2008 installed, including the sample applications and the tools.
If you don’t have a licensed copy of SQL Server 2008 Developer Edition, you can download a trial version from the Microsoft Web site,
http://download.microsoft.com
.
Search for “SQL Server 2008 Evaluation.”
The
Adventure Works 2008
samples are required for several of the examples in the chapters. The samples are available on Microsoft Open Source community site,
www.codeplex.com
.
Search for “SQL Server 2008 Product Samples.”
The source code for the chapters is posted on the Wrox Web site (
www.wrox.com
). See the later section, “Source Code,” for directions on how to access to the material. Each chapter has the source code. In addition, a database called
SSIS_PDS
is included in the source code download, and has all the database objects required for the examples.
This book has been written for SQL Server 2008 with SP1 installed (which was released in February 2007).
Conventions
To help you get the most from the text and keep track of what’s happening, a number of conventions have been used throughout the book.
Boxes like this one hold important, not-to-be forgotten information that is directly relevant to the surrounding text.
Tips, hints, tricks, and asides to the current discussion are offset and placed in italics like this.
As for styles in the text:
Important new terms and important words are
highlighted
when we introduce them.
Keyboard strokes are shown like this: Ctrl+A.
File names, URLs, and code within the text are shown like this:
persistence.properties
.
Code is presented in the following two ways:
In code examples we highlight new and important code with boldfaced text. The boldface is not used for code that's less important in the present context, or has been shown before.
Source Code
As you work through the examples in this book, you may choose either to type in all the code manually, or use the source code files that accompany the book. All of the source code used in this book is available for download at www.wrox.com. Once at the site, simply locate the book’s title (either by using the Search box, or by using one of the title lists), and click the Download Code link on the book’s detail page to obtain all the source code for the book.
Because many books have similar titles, you may find it easiest to search by ISBN; for this book, the ISBN is 978-0-470-52576-0.
Once you download the code, just decompress it with your favorite compression tool. Alternately, you can go to the main Wrox code download page at www.wrox.com/dynamic/books/download.aspx to see the code available for this book and all other Wrox books.
Errata
We make every effort to ensure that there are no errors in the text or in the code. However, no one is perfect, and mistakes do occur. If you find an error in one of our books (such as a spelling mistake or faulty piece of code), we would be very grateful for your feedback. By sending in errata, you may save another reader hours of frustration, and, at the same time, you will be helping us provide even higher-quality information.
To find the errata page for this book, go to www.wrox.com and locate the title using the Search box or one of the title lists. Then, on the book details page, click the Book Errata link. On this page, you can view all errata that has been submitted for this book and posted by Wrox editors. A complete book list including links to each book’s errata is also available at www.wrox.com/misc-pages/booklist.shtml.
If you don’t spot “your” error on the Book Errata page, go to www.wrox.com/contact/techsupport.shtml and complete the form there to send us the error you have found. We’ll check the information and, if appropriate, post a message to the book’s errata page and fix the problem in subsequent editions of the book.
p2p.wrox.com
For author and peer discussion, join the P2P forums at p2p.wrox.com. The forums are a Web-based system for you to post messages relating to Wrox books and related technologies, and to interact with other readers and technology users. The forums offer a subscription feature to email you topics of interest of your choosing when new posts are made to the forums. Wrox authors, editors, other industry experts, and your fellow readers are present on these forums.
At http://p2p.wrox.com you will find a number of different forums that will help you not only as you read this book, but also as you develop your own applications. To join the forums, just follow these steps:
1. Go to p2p.wrox.com and click the Register link.
2. Read the terms of use and click Agree.
3. Complete the required information to join, as well as any optional information you wish to provide, and click Submit.
4. You will receive an email with information describing how to verify your account and complete the joining process.
You can read messages in the forums without joining P2P, but to post your own messages, you must join.
Once you join, you can post new messages and respond to messages other users post. You can read messages at any time on the Web. If you would like to have new messages from a particular forum emailed to you, click the “Subscribe to this Forum” icon by the forum name in the forum listing.
For more information about how to use the Wrox P2P, be sure to read the P2P FAQs for answers to questions about how the forum software works, as well as many common questions specific to P2P and Wrox books. To read the FAQs, click the FAQ link on any P2P page.
1
SSIS Solution Architecture
Imagine that this is the first day of a new internal or client project. You will have responsibility on the data and processing layer of the solution, which involves processing data — a lot of data — from several sources, and then either integrating systems, or maybe consolidating data for reporting. Maybe your background is a developer, or a database administrator (DBA), or a data designer, and you know SSIS fairly well. But now they are calling you the “data and ETL architect.”
ETL is the acronym for Extraction, Transformation, and Loading, which is used to describe the data-processing layer in a data-integration or data warehouse solution.
The project manager approaches you and says that the Vice President of Technology has asked the team to give him an estimate of the infrastructure needed. Furthermore, the business owner wants a high-level overview of how the solution architecture will help the company achieve the business need most efficiently. The project manager also wants your thoughts on the best way to approach the solution, how the development should be coordinated between team members, and how deployment should be handled.
Three meetings are scheduled for tomorrow to address these things, and you’ve been asked to provide your feedback in those areas.
Where do you start? How should you approach the solution design with SSIS as the main technology? How should all the pieces work together?
This chapter examines how to lay the foundation for successful solutions based on SQL Server Integration Services (SSIS). And, in fact, this whole book is about SSIS solutions to real-world requirements and challenges. It addresses questions such as the following:
What are the problems and issues?
What are the design principles?
How do you put it all together for a complete and successful solution?
Before you dive into the technical challenges of a project, you must first step back and ensure that you are laying the right foundation. Jumping right in is tempting! But resist the urge, because you want to (and need to) set the precedence and patterns for the solution upfront. If you don’t, chances are you won’t be able to go back and make changes down the road.
As with all chapters in this book, this chapter is organized into the following three major sections:
“Problem”
— Coordinating and architecting an SSIS solution is not an easy task. The “Problem” section reveals some common challenges and common mistakes when planning and extending your ETL environment.
“Design”
— The “Design” section in this chapter examines the right approach to a project, and the long-term project aspects that you should set in motion early in a project.
“Solution” —
In many ways, the remainder of this book provides you with the solutions to make it all work together. This section launches you into the rest of the book, and shows how you can follow the chapters to build or redesign your SSIS solution.
Problem
Data and ETL projects have many challenges. Some challenges relate to data, some to enterprise integration, some to project coordination, and some to general expectations. This section begins by looking at the bigger picture of data within an organization, but then quickly looks at ETL projects and SSIS packages and execution.
Macro Challenge: Enterprise Data Spaghetti
Maybe your SSIS project is only a small solution in a bigger enterprise pond. The problem is that it can still cause a ripple effect when you tie it into your environment. Or, you can have challenges caused by an unwieldy enterprise environment when you try to implement your solution.
Figure 1-1 shows a not-so-nice telephone/electricity pole that illustrates the data nervous system of many organizations.
Figure 1-1
The problem with Figure 1-1 is that this mess didn’t happen overnight! It grew into this twisted unorganized process because of poor planning, coordination, and execution. However, be aware that, a lot of the time, a corporation’s politics may lead to this type of situation. Departments hire their own technical people and try to go around IT. Systems don’t talk to each other nicely. Project pressures (such as time and budget) cause designers to cut corners.
Following are a few reasons why this kind of tangled mess happens in an organization’s data processing, and examples of the many problems that this “unorganization” causes:
Debilitating dependency chains
— The biggest problem is that often systems are built on top of systems on top of systems. The core source data has been connected through so many precedence links that it takes more time and administrative and development overhead. Systems at the source and in the middle of dependencies become un-replaceable because of the amount of effort that switching to a new system would take.
Unknown and uncontrollable data processing
— The operations that process data are not centralized, and, in many cases, unknown because of department applications that are created without coordination with other applications. Processes run at uncontrolled times, and may impact systems within the processes even during peak times, which affects work efficiency.
Fragile enterprise changes
— Changes to applications are difficult and costly. They may break processes, or cause data integration or reporting applications to be inaccurate.
Delayed data access
— Even when the processes are somewhat controlled, the complicated system dependencies cause delays in data availability and nightly overhead processes that often run into mid-morning schedules. When they break, customer perception and employee efficiency are affected.
The “Design” section later in this chapter discusses how to approach your SSIS-based ETL project in the right way, and ensure that you are helping to solve the problem, rather than adding to it.
Micro Challenge: Data-Processing Confusion
Another common problem with data processing is when the logic contained to process data is overly complicated and confusing. Just like the macro enterprise problem, this problem usually is the result of changes over time where logic is modified and appended. It usually comes in one of two ways:
Runaway stored procedures
— Procedures that run with complicated logic and lots of temporary tables, inserts, updates, and/or deletes can be difficult to manage, and are often inefficient. Supporting the procedures is also very difficult because the logic is difficult to follow, and, many times, the developers or DBAs who wrote the code are unavailable. Overall, this type of process requires a lot of administration and wasted time spent on following and learning the process.
Unmanageable packages
— SSIS packages can also be designed with difficult-to-follow logic and sometimes complex precedence with hundreds of components used in a single package. These kinds of packages have challenges similar to those of runaway stored procedures, such as troubleshooting and the learning curve required for the process.
Figure 1-2
shows the control flow of a package that has too many components to effectively manage. (The SSIS designer is zoomed in at 50 percent to fit on the screen.)
Figure 1-2
The overly complex control flow shown in Figure 1-2 is similar to an overly complex data flow, where too many components are used, thus making the development, troubleshooting, and support difficult to manage. The “Design” section later in this chapter proposes a better approach for SSIS packages called the modular package approach.
In summary, both of these types of processes (runaway procedures and unmanageable packages) are very difficult to support, and not suited to team development, error handling, and scalability (all of which are addressed in Chapter 12).
Problems with Execution and Troubleshooting
A couple of other issues that often come up in an ETL or data-integration solution are poor process coordination and difficulty doing root cause analysis. If the “what happened?” question can’t be answered quickly and with confidence, then likely there is a problem with the overall solution execution and logging strategy.
Figure 1-3 shows the command-line output of an example SSIS package execution.
Figure 1-3
If you were to consider spending time trying to work through this output when trying to figure out what went wrong, then you should consider implementing a better execution and auditing structure. This includes package execution in your development environment.
If you have just turned on the out-of-the-box SSIS logging and are capturing results to output to a table, it still may not be enough. If you write custom queries every time against the SSIS logging table to figure out what happened, then you also need a better strategy.
Infrastructure Challenges
When designing an SSIS ETL solution, how do you determine the infrastructure components such as server hardware, memory, processor cores, network switches, disk arrays, storage networks, and I/O controllers? Related to that, where should you run your SSIS packages taking into consideration sources, destinations, and other applications, while balancing hardware scalability and location within your network topology?
These questions are not trivial, and the answers depend on a lot of factors, including processing windows, source and destination availability, application impact and availability, network bandwidth, fault-tolerance requirements, and so on.
I/O is usually the biggest bottleneck, and the one most often overlooked. I/O (or, more precisely, disk I/O) is the throughput that the system can handle on the drive volumes. And this challenge is not just about trying to get the greatest throughput on a single drive. You must consider staging and temporary environments, logging, and current and historical data. And you must balance it all with hardware availability and budget.
The reason disk I/O is so important when considering a data-integration or ETL effort is because of the nature of what you are doing, including the following:
Bulk operations
— ETL and integration efforts typically process data in bulk, meaning that, when a process is kicked off (hourly, daily, or weekly), data is moved from sources to destinations with some transformation processes in-between. The processes usually move or integrate thousands or millions of records. That can be a lot of data that moves between systems, and it generates a lot of disk activity.
Source databases
— Processes that are extracting a lot of data from sources incur disk overhead either by the sheer volume of the data, or when complicated queries against large tables require temporary operations that use disks (such as the SQL Server
TempDB
).
Destination databases
— The nature of relational databases requires that data be stored to disk before a transaction is marked as complete. When inserting or updating a lot of data, the server must wait until the data is committed for the process to be complete.
Staging databases
— Staging databases and tables are a common intermediate step in an ETL process and can be used for many reasons. Any time that data is landed to a staging database and then extracted from the staging database, it has the overhead of both the insert and the select, which can, at times, be done simultaneously with inserts into the destination database and, therefore, the I/O is compounded.
File management
— A lot of ETL operations deal with files such as delimited files, XML, EDI, and so on. Large files require file management, archiving, and sometimes processing, and, therefore, incur disk I/O overhead.
The bottom line is that you will most likely have a disk I/O bottleneck in your data-processing operations, and you’ll need to plan and manage for that to meet your service level agreements (SLAs) and performance requirements.
Other Challenges
The list of common project challenges can go on and on, but here are a few more:
Data challenges
— Of course, you will have data challenges in your project anywhere from missing records to dirty data to bad data, and you will need to understand those problems as soon as possible so that you can set the expectations upfront about what can and what cannot be done about them. Although you can do a lot in SSIS (including fuzzy matching), magic is not on the component list — you can’t make up data that doesn’t exist. Don’t overpromise. Be realistic.
Corporate political challenges
— This book is about solving problems with a technology, namely SSIS. But, because you are trying to solve problems, you are going to be dealing with people. Everyone has an agenda, and a lot of times those agendas will not be in your project’s best interest. Watch out for people who are threatened because you are changing the way things are done (even when it is for the better), or because your solution will be replacing one of their legacy applications, or because they are territorial about their “sandbox.” You want to fix the enterprise spaghetti shown in
Figure 1-1
, but don’t forget that some people have their tribal knowledge and make their living by keeping the corporate data nervous system tangled.
Requirement and scope challenges
— Any project has scope creep. Just be careful about how the changes affect the project timeline, and don’t leave data validation until the last day. You’ll get burned!
Design
Now that you are scared, step back and take a deep breath. Designing an ETL process is doable, and, with the right approach in mind, you can be successful. This section discusses the overall design approach to an SSIS-based solution by examining the following:
Choosing the right tool
Solution architecture
Project planning
Package design patterns
Server and storage hardware
Package execution location
Choosing the Right Tool
This book is about applying SSIS. You are probably reading it because you assume that SSIS is the right tool for the job. That’s probably the case. However, be sure to consider what you are doing, and ensure that using SSIS is in line with what you are doing.
Think about all the different types of data-processing needs that you have across your organization:
Data synchronization between systems
Data extraction from enterprise resource planning (ERP) systems
Ad hoc reporting
Replication (both homogeneous and heterogeneous)
PDA data synchronization
Legacy system integration
Data warehouse ETL processing
Vendors and partner data files integration
Line-of-business data processing
Customer and employee directory synchronization
As you may know, when it comes to data processing, a lot of tools are out there. Some are created for specific situations (such as folder synchronizing tools), whereas other tools are designed to perform a variety of functions for different situations. So, the traditional question often posed is which tool can best meet the business and logical requirements to perform the tasks needed?
Consider the host of tools found in the ever-evolving Microsoft toolset. You can use Transact SQL (TSQL) to hand-code a data load, Host Integration Server to communicate with a heterogeneous data source, BizTalk to orchestrate messages in a transactional manner, or SSIS to load data in batches. Each of these tools plays a role in the data world.
Although overlaps exist, each tool has a distinct focus and target purpose. When you become comfortable with a technology, there’s always the tendency to want to apply that technology beyond its intended “sweet spot” when another tool would be better for the job. You’ve no doubt heard the phrase “when you’re a hammer, everything looks like a nail.” For example, C# developers may want to build an application to do something that SSIS could potentially do in an hour of development time. The challenge everyone faces entails time and capacity. There is no way everyone can be an expert across the board. Therefore, developers and administrators alike should be diligent about performing research on tools and technologies that complement each other, based on different situations.
For example, many organizations use BizTalk for a host of purposes beyond the handling of business-to-business communication and process workflow automation. These same organizations may be perplexed as to why BizTalk doesn’t scale to meet the needs of the organization’s terabyte data warehousing ETL. The easy answer is that the right tool for bulk Business Intelligence (BI) processing is an ETL tool such as SSIS.
Be Careful About Tool Selection
In some client environments, an ETL tool may be chosen without consideration for the availability of industry skills, support, or even the learning curve. Even though the tool could perform “magic,” it usually doesn’t come with a pocket magician — just the magic of emptying your corporate wallet. In many cases, thousands of dollars have been spent on an ETL tool that takes too long to master, implement, and support. Beyond the standard functionality questions you should ask about a tool, be sure to also consider the following:
Your internal skill setsThe trend of industry use of the toolHow easy it is to learnThe ease of supporting the toolOverall Solution Architecture
The reality is that creating a perfectly efficient enterprise data ecosystem is impossible. But there are certainly levels of efficiency that can be gained when your SSIS solution is planned and implemented thoughtfully. Figure 1-4 contrasts Figure 1-1 by showing a city’s central power station, organized and efficient.
Figure 1-4
The tendency when developing a new integration or ETL system is to get it done as quickly as possible. What often happens is that the overall architecture is not integrated well into an organization’s environment. Maybe some time is saved (and that is even questionable), but in the end, more time and money will be wasted.
A solution architecture should have several key data-processing objectives. The following apply to SSIS-based solutions, but also relate generically to any data-processing solution architecture:
The solution should coordinate with other data-centric solutions in the enterprise. Do not build a separate data silo, especially if your effort is a data warehouse or data mart — that causes multiple versions and variations of the data.
Source data that is required for the solution must be extracted as close to the source as possible and not plugged at the end of a long dependency chain. (Be sure to follow the previous bullet point).
The solution should have a centralized administration and execution strategy that coordinates with other systems, or follows the practices of other corporate systems. This does not require limiting a scale-out architecture, but simply that the support structures are centralized.
Real-time execution auditing is also needed to know what is happening and what did happen. This information will go a long way in supporting a system. In addition, you should have a way to track data back to the source. This tracking is critical for both data validation and troubleshooting data errors.
The processes must have rollback layers. In other words, if your requirements necessitate a complicated or long process, don’t require the whole process to be re-run from the start if the last part breaks. Plan for restarting at interim points after the issues are identified. Doing so also enables you to easily compartmentalize changes in the ETL solution.
These objectives represent the larger picture of an overall solution architecture. Other aspects, of course, are important and situational to what you are building in SSIS. Two common types of data-processing efforts are discussed in the following sections: system integration and data warehouse ETL. These fit well into the SSIS toolset.
Data Integration or Consolidation
One common use of SSIS is to integrate data between systems, or to synchronize data. For example, you may want to create a business-to-business portal site, and you may need the site to interface with the source data on the mainframe. In this case, you may get the data delivered in nightly extracts from the mainframe and load it into your SQL Server table.
Another very common ETL task that DBAs face is receiving files from File Transfer Protocol (FTP) servers (or on network shares) that must be processed and loaded into another system. This type of process involves moving files, and then processing the data, which may involve de-duping (removing duplicates), combining files, cleaning bad data, and so on. Two systems may also need to talk to one another or pass business keys in order for records to be matched between environments.
Figure 1-5 shows an example solution architecture that integrates data between different systems in an enterprise.
Figure 1-5
In this diagram, data from different systems is integrated. Often, a master data management service (sometimes called master dimension management) helps to synchronize entities between systems so that an organization’s data relates (for example, so that a customer record from the customer relationship management, or CRM, system can tie into the ERP sales system). This data process contains some aspects that are bidirectional, and other parts that perform extraction and loads. Data staging is used in this example to help integrate the data, and a data store is used to centralize many of the corporate data processes (which helps alleviate the long chains of system dependencies).
Of course, other variations of system integration solutions exist, such as consolidation of different divisional data, especially when companies go through mergers and acquisitions.
Data Warehouse ETL
One of the more common uses of SSIS is for performing data warehouse ETL. Data warehousing focuses on decision support, or enabling better decision making through organized accessibility of information. As opposed to a transactional system such as a point of sale (POS), Human Resources (HR), or CRM that is designed to allow rapid transactions to capture information data, a data warehouse is tuned for reporting and analysis. Because data warehousing is focused on the extraction, consolidation, and reporting of information to show trending and data summaries, the ETL part of a data warehouse is important and critical.
Processing ETL for data warehousing involves extracting data from source systems or files, performing transformation logic on the data (to correlate, cleanse, and consolidate), and then loading a data warehouse environment (for reporting and analysis). Figure 1-6 shows common data-processing architecture for a data warehouse ETL system.
Figure 1-6
For those who are already versed in ETL concepts and practice, you may know that when it comes to developing a data warehouse ETL system, moving from theory to practice often presents the biggest challenge. Did you know that ETL typically takes up between 50 and 70 percent of a data warehousing project? That is quite a daunting statistic. What it means is that even though presenting the data is the end goal and the driving force for business, the largest portion of developing a data warehouse is spent not on the presentation and organization of the data, but rather on the behind-the-scenes processing to get the data ready.
Project Planning and Team Development
This is not a project methodology book, but you should give some thought to your solution approach. Whether your overall objective is system integration or warehouse ETL, you should give consideration to using an agile development methodology. An agile methodology is an iterative approach to development. You add features of the solution through smaller development cycles, and refine requirements as the solution progresses.
Agile Benefits
Even if your solution does not involve a user interface (such as a system integration), an agile approach enables you to tackle aspects of the solution in smaller development cycles, and to troubleshoot data issues along the way. Following are some the general benefits of this approach:
Project and task definition
— The agile approach requires definition of the tasks in a project and the prioritization of the tasks, which are set and known by the technology team and the business or project owners. Tasks can change as a better understanding of the requirements is defined.
Scope management
— Given the clear definition of the activities in an iteration, the scope is set, and any deviation is known to the parties involved, and agreed upon. In essence, project communication is clearer for all parties — developer, management, and ownership.
Addressing of issues and obstacles
— Part of the ongoing process involves identifying the areas in the project that can impede progress. These are highlighted and addressed soon in the process.
Roles and responsibility clarity
— Roles are clearly defined and tasks are assigned, which limits the possibility of team members spinning their wheels in the process.
Agile Cautions and Planning
However, you must exercise some caution. Do not use an agile methodology to foster bad architecture practices. In other words, if you are just using the agile approach and you or your developers’ ultimate goal is to meet the tasks and deadlines in whatever way possible, you are going to fall into the trap of compounding the problems of your overall data nervous system (that is, those problems outlined in the earlier “Problem” section).
You must ensure that you have an overall solution architecture, and your agile tasks must fit in that plan and support the cause.
Therefore, whatever project methodology you use, be sure to push for an upfront plan and architecture. If you don’t, you will likely run into the tyranny-of-the-urgent problem — that means that you will get overwhelmed with the tasks and, as you throw them together, your solution gets out of control, code is messy, and your stress will be compounded over time.
Following are a few things to consider in your development process:
Plan for an early-on proof-of-concept, and use the proof-of-concept to iron out your data process architecture.
Set your SSIS package and database conventions upfront, including your auditing structure (as discussed in Chapter 2).
Estimate your data volumes in one of the initial development cycles so that you can purchase the right hardware.
Get your server storage ironed out upfront. Be sure to set expectations with the storage group or vendor early on in the process.
