SAP Data Services 4.x Cookbook - Ivan Shomnikov - E-Book

SAP Data Services 4.x Cookbook E-Book

Ivan Shomnikov

0,0
47,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Want to cost effectively deliver trusted information to all of your crucial business functions? SAP Data Services delivers one enterprise-class solution for data integration, data quality, data profiling, and text data processing. It boosts productivity with a single solution for data quality and data integration. SAP Data Services also enables you to move, improve, govern, and unlock big data.
This book will lead you through the SAP Data Services environment to efficiently develop ETL processes. To begin with, you’ll learn to install, configure, and prepare the ETL development environment. You will get familiarized with the concepts of developing ETL processes with SAP Data Services. Starting from smallest unit of work- the data flow, the chapters will lead you to the highest organizational unit—the Data Services job, revealing the advanced techniques of ETL design.
You will learn to import XML files by creating and implementing real-time jobs. It will then guide you through the ETL development patterns that enable the most effective performance when extracting, transforming, and loading data. You will also find out how to create validation functions and transforms.
Finally, the book will show you the benefits of data quality management with the help of another SAP solution—Information Steward.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 449

Veröffentlichungsjahr: 2015

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

SAP Data Services 4.x Cookbook
Credits
About the Author
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers, and more
Why subscribe?
Free access for Packt account holders
Instant updates on new Packt books
Preface
What this book covers
What you need for this book
Who this book is for
Sections
Getting ready
How to do it…
How it works…
There's more…
See also
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
1. Introduction to ETL Development
Introduction
Preparing a database environment
Getting ready
How to do it…
How it works…
Creating a source system database
How to do it…
How it works…
There's more…
Defining and creating staging area structures
How to do it…
Flat files
RDBMS tables
How it works…
Creating a target data warehouse
Getting ready
How to do it…
How it works…
There's more…
2. Configuring the Data Services Environment
Introduction
Creating IPS and Data Services repositories
Getting ready…
How to do it…
How it works…
See also
Installing and configuring Information Platform Services
Getting ready…
How to do it…
How it works…
Installing and configuring Data Services
Getting ready…
How to do it…
How it works…
Configuring user access
Getting ready…
How to do it…
How it works…
Starting and stopping services
How to do it…
How it works…
See also
Administering tasks
How to do it…
How it works…
See also
Understanding the Designer tool
Getting ready…
How to do it…
How it works…
Executing ETL code in Data Services
Validating ETL code
Template tables
Query transform basics
The HelloWorld example
3. Data Services Basics – Data Types, Scripting Language, and Functions
Introduction
Creating variables and parameters
Getting ready
How to do it…
How it works…
There's more…
Creating a script
How to do it…
How it works…
Using string functions
How to do it…
Using string functions in the script
How it works…
There's more…
Using date functions
How to do it…
Generating current date and time
Extracting parts from dates
How it works…
There's more…
Using conversion functions
How to do it…
How it works…
There's more…
Using database functions
How to do it…
key_generation()
total_rows()
sql()
How it works…
Using aggregate functions
How to do it…
How it works…
Using math functions
How to do it…
How it works…
There's more…
Using miscellaneous functions
How to do it…
How it works…
Creating custom functions
How to do it…
How it works…
There's more…
4. Dataflow – Extract, Transform, and Load
Introduction
Creating a source data object
How to do it…
How it works…
There's more…
Creating a target data object
Getting ready
How to do it…
How it works…
There's more…
Loading data into a flat file
How to do it…
How it works…
There's more…
Loading data from a flat file
How to do it…
How it works…
There's more…
Loading data from table to table – lookups and joins
How to do it…
How it works…
Using the Map_Operation transform
How to do it…
How it works…
Using the Table_Comparison transform
Getting ready
How to do it…
How it works…
Exploring the Auto correct load option
Getting ready
How to do it…
How it works…
Splitting the flow of data with the Case transform
Getting ready
How to do it…
How it works…
Monitoring and analyzing dataflow execution
Getting ready
How to do it…
How it works…
There's more…
5. Workflow – Controlling Execution Order
Introduction
Creating a workflow object
How to do it…
How it works…
Nesting workflows to control the execution order
Getting ready
How to do it
How it works…
Using conditional and while loop objects to control the execution order
Getting ready
How to do it…
How it works…
There is more…
Using the bypassing feature
Getting ready…
How to do it…
How it works…
There is more…
Controlling failures – try-catch objects
How to do it…
How it works…
Use case example – populating dimension tables
Getting ready
How to do it…
How it works…
Mapping
Dependencies
Development
Execution order
Testing ETL
Preparing test data to populate DimSalesTerritory
Preparing test data to populate DimGeography
Using a continuous workflow
How to do it…
How it works…
There is more…
Peeking inside the repository – parent-child relationships between Data Services objects
Getting ready
How to do it…
How it works…
Get a list of object types and their codes in the Data Services repository
Display information about the DF_Transform_DimGeography dataflow
Display information about the SalesTerritory table object
See the contents of the script object
6. Job – Building the ETL Architecture
Introduction
Projects and jobs – organizing ETL
Getting ready
How to do it…
How it works…
Hierarchical object view
History execution log files
Executing/scheduling jobs from the Management Console
Using object replication
How to do it…
How it works…
Migrating ETL code through the central repository
Getting ready
How to do it…
How it works…
Adding objects to and from the Central Object Library
Comparing objects between the Local and Central repositories
There is more…
Migrating ETL code with export/import
Getting ready
How to do it…
Import/Export using ATL files
Direct export to another local repository
How it works…
Debugging job execution
Getting ready…
How to do it…
How it works…
Monitoring job execution
Getting ready
How to do it…
How it works…
Building an external ETL audit and audit reporting
Getting ready…
How to do it…
How it works…
Using built-in Data Services ETL audit and reporting functionality
Getting ready
How to do it…
How it works…
Auto Documentation in Data Services
How to do it…
How it works…
7. Validating and Cleansing Data
Introduction
Creating validation functions
Getting ready
How to do it…
How it works…
Using validation functions with the Validation transform
Getting ready
How to do it…
How it works…
Reporting data validation results
Getting ready
How to do it…
How it works…
Using regular expression support to validate data
Getting ready
How to do it…
How it works…
Enabling dataflow audit
Getting ready
How to do it…
How it works…
There's more…
Data Quality transforms – cleansing your data
Getting ready
How to do it…
How it works…
There's more…
8. Optimizing ETL Performance
Introduction
Optimizing dataflow execution – push-down techniques
Getting ready
How to do it…
How it works…
Optimizing dataflow execution – the SQL transform
How to do it…
How it works…
Optimizing dataflow execution – the Data_Transfer transform
Getting ready
How to do it…
How it works…
Why we used a second Data_Transfer transform object
When to use Data_Transfer transform
There's more…
Optimizing dataflow readers – lookup methods
Getting ready
How to do it…
Lookup with the Query transform join
Lookup with the lookup_ext() function
Lookup with the sql() function
How it works…
Query transform joins
lookup_ext()
sql()
Performance review
Optimizing dataflow loaders – bulk-loading methods
How to do it…
How it works…
When to enable bulk loading?
Optimizing dataflow execution – performance options
Getting ready
How to do it…
Dataflow performance options
Source table performance options
Query transform performance options
lookup_ext() performance options
Target table performance options
9. Advanced Design Techniques
Introduction
Change Data Capture techniques
Getting ready
No history SCD (Type 1)
Limited history SCD (Type 3)
Unlimited history SCD (Type 2)
How to do it…
How it works…
Source-based ETL CDC
Target-based ETL CDC
Native CDC
Automatic job recovery in Data Services
Getting ready
How to do it…
How it works…
There's more…
Simplifying ETL execution with system configurations
Getting ready
How to do it…
How it works…
Transforming data with the Pivot transform
Getting ready
How to do it…
How it works…
10. Developing Real-time Jobs
Introduction
Working with nested structures
Getting ready
How to do it…
How it works…
There is more…
The XML_Map transform
Getting ready
How to do it…
How it works…
The Hierarchy_Flattening transform
Getting ready
How to do it…
Horizontal hierarchy flattening
Vertical hierarchy flattening
How it works…
Querying result tables
Configuring Access Server
Getting ready
How to do it…
How it works…
Creating real-time jobs
Getting ready
Installing SoapUI
How to do it…
How it works…
11. Working with SAP Applications
Introduction
Loading data into SAP ERP
Getting ready
How to do it…
How it works…
IDoc
Monitoring IDoc load on the SAP side
Post-load validation of loaded data
There is more…
12. Introduction to Information Steward
Introduction
Exploring Data Insight capabilities
Getting ready
How to do it…
Creating a connection object
Profiling the data
Viewing profiling results
Creating a validation rule
Creating a scorecard
How it works…
Profiling
Rules
Scorecards
There is more…
Performing Metadata Management tasks
Getting ready
How to do it…
How it works…
Working with the Metapedia functionality
How to do it…
How it works…
Creating a custom cleansing package with Cleansing Package Builder
Getting ready
How to do it…
How it works…
There is more…
Index

SAP Data Services 4.x Cookbook

SAP Data Services 4.x Cookbook

Copyright © 2015 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: November 2015

Production reference: 1261115

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78217-656-5

www.packtpub.com

Credits

Author

Ivan Shomnikov

Reviewers

Andrés Aguado Aranda

Dick Groenhof

Bernard Timbal Duclaux de Martin

Sridhar Sunkaraneni

Meenakshi Verma

Commissioning Editor

Vinay Argekar

Acquisition Editors

Shaon Basu

Kevin Colaco

Content Development Editor

Merint Mathew

Technical Editor

Humera Shaikh

Copy Editors

Brandt D'mello

Shruti Iyer

Karuna Narayanan

Sameen Siddiqui

Project Coordinator

Francina Pinto

Proofreader

Safis Editing

Indexer

Monica Ajmera Mehta

Production Coordinator

Nilesh Mohite

Cover Work

Nilesh Mohite

About the Author

Ivan Shomnikov is an SAP analytics consultant specializing in the area of Extract, Transform, and Load (ETL). He has in-depth knowledge of the data warehouse life cycle processes (DWH design and ETL development) and extensive hands-on experience with both the SAP Enterprise Information Management (Data Services) technology stack and the SAP BusinessObjects reporting products stack (Web Intelligence, Designer, Dashboards).

Ivan has been involved in the implementation of complex BI solutions on the SAP BusinessObjects Enterprise platform in major New Zealand companies across different industries. He also has a strong background as an Oracle database administrator and developer.

This is my first experience of writing a book, and I would like to thank my partner and my son for their patience and support.

About the Reviewers

Andrés Aguado Aranda is a 26-year-old computer engineer from Spain. His experience has given him a really technical background in databases, data warehouse, and business intelligence.

Andrés has worked in different business sectors, such as banking, public administrations, and energy, since 2012 in data-related positions.

This book is my first stint as a reviewer, and it has been really interesting and valuable to me, both personally and professionally.

I would like to thank my family and friends for always being willing to help me when I needed. Also, I would like to thank to my former coworker and currently friend, Antonio Martín-Cobos, a BI reporting analyst who really helped me get this opportunity.

Dick Groenhof started his professional career in 1990 after finishing his studies in business information science at Vrije Universiteit Amsterdam. Having worked as a software developer and service management consultant for the first part of his career, he became active as a consultant in the business intelligence arena since 2005.

Dick has been a lead consultant on numerous SAP BI projects, designing and implementing successful solutions for his customers, who regard him as a trusted advisor. His core competences include both frontend (such as Web Intelligence, Crystal Reports, and SAP Design Studio) and backend tools (such as SAP Data Services and Information Steward). Dick is an early adopter of the SAP HANA platform, creating innovative solutions using HANA Information Views, Predictive Analysis Library, and SQLScript.

He is a Certified Application Associate in SAP HANA and SAP BusinessObjects Web Intelligence 4.1. Currently, Dick works as senior HANA and big data consultant for a highly respected and innovative SAP partner in the Netherlands.

He is a strong believer in sharing his knowledge with regard to SAP HANA and SAP Data Services by writing blogs (at http://www.dickgroenhof.com and http://www.thenextview.nl/blog) and speaking at seminars.

Dick is happily married to Emma and is a very proud father of his son, Christiaan, and daughter, Myrthe.

Bernard Timbal Duclaux de Martin is a business intelligence architect and technical expert with more than 15 years of experience. He has been involved in several large business intelligence system deployments and administration in banking and insurance companies. In addition, Bernard has skills in modeling, data extraction, transformation, loading, and reporting design. He has authored four books, including two regarding SAP BusinessObjects Enterprise administration.

Meenakshi Verma has been a part of the IT industry since 1998. She is an experienced business systems specialist having the CBAP and TOGAF certifications. Meenakshi is well-versed with a variety of tools and techniques used for business analysis, such as SAP BI, SAP BusinessObjects, Java/J2EE technologies, and others. She is currently based in Toronto, Canada, and works with a leading utility company.

Meenakshi has helped technically review many books published by Packt Publishing across various enterprise solutions. Her earlier works include JasperReports for Java Developers, Java EE 5 Development using GlassFish Application Server, Practical Data Analysis and Reporting with BIRT, EJB 3 Developer Guide, Learning Dojo, and IBM WebSphere Application Server 8.0 Administration Guide.

I'd like to thank my father, Mr. Bhopal Singh, and mother, Mrs. Raj Bala, for laying a strong foundation in me and giving me their unconditional love and support. I also owe thanks and gratitude to my husband, Atul Verma, for his encouragement and support throughout the reviewing of this book and many others; my ten-year-old son, Prieyaansh Verma, for giving me the warmth of his love despite my hectic schedules; and my brother, Sachin Singh, for always being there for me.

www.PacktPub.com

Support files, eBooks, discount offers, and more

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www2.packtpub.com/books/subscription/packtlib

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.

Why subscribe?

Fully searchable across every book published by PacktCopy and paste, print, and bookmark contentOn demand and accessible via a web browser

Free access for Packt account holders

If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view 9 entirely free books. Simply use your login credentials for immediate access.

Instant updates on new Packt books

Get notified! Find out when new books are published by following @PacktEnterprise on Twitter or the Packt Enterprise Facebook page.

Preface

SAP Data Services delivers an enterprise-class solution to build data integration processes as well as perform data quality and data profiling tasks, allowing you to govern your data in a highly-efficient way.

Some of the tasks that Data Services helps accomplish include: migration of the data between databases or applications, extracting data from various source systems into flat files, data cleansing, data transformation using either common database-like functions or complex custom-built functions that are created using an internal scripting language, and of course, loading data into your data warehouse or external systems. SAP Data Services has an intuitive user-friendly graphical interface, allowing you to access all its powerful Extract, Transform, and Load (ETL) capabilities from the single Designer tool. However, getting started with SAP Data Services can be difficult, especially for people who have little or no experience in ETL development. The goal of this book is to guide you through easy-to-understand examples of building your own ETL architecture. The book can also be used as a reference to perform specific tasks as it provides real-world examples of using the tool to solve data integration problems.

What this book covers

Chapter 1, Introduction to ETL Development, explains what Extract, Transform, and Load (ETL) processes are, and what role Data Services plays in ETL development. It includes the steps to configure the database environment used in recipes of the book.

Chapter 2, Configuring the Data Services Environment, explains how to install and configure all Data Services components and applications. It introduces the Data Services development GUI—the Designer tool—with the simple example of "Hello World" ETL code.

Chapter 3, Data Services Basics – Data Types, Scripting Language, and Functions, introduces the reader to Data Services internal scripting language. It explains various categories of functions that are available in Data Services, and gives the reader an example of how scripting language can be used to create custom functions.

Chapter 4, Dataflow – Extract, Transform, and Load, introduces the most important processing unit in Data Service, dataflow object, and the most useful types of transformations that can be performed inside a dataflow. It gives the reader examples of extracting data from source systems and loading data into target data structures.

Chapter 5, Workflow – Controlling Execution Order, introduces another Data Services object, workflow, which is used to group other workflows, dataflows, and script objects into execution units. It explains the conditional and loop structures available in Data Services.

Chapter 6, Job – Building the ETL Architecture, brings the reader to the job object level and reviews the steps used in the development process to make a successful and robust ETL solution. It covers the monitoring and debugging functionality available in Data Services and embedded audit features.

Chapter 7, Validating and Cleansing Data, introduces the concepts of validating methods, which can be applied to the data passing through the ETL processes in order to cleanse and conform it according to the defined Data Quality standards.

Chapter 8, Optimizing ETL Performance, is one of the first advanced chapters, which starts explaining complex ETL development techniques. This particular chapter helps the user understand how the existing processes can be optimized further in Data Services in order to make sure that they run quickly and efficiently, consuming as less computer resources as possible with the least amount of execution time.

Chapter 9, Advanced Design Techniques, guides the reader through advanced data transformation techniques. It introduces concepts of Change Data Capture methods that are available in Data Services, pivoting transformations, and automatic recovery concepts.

Chapter 10, Developing Real-time Jobs, introduces the concept of nested structures and the transforms that work with nested structures. It covers the mains aspects of how they can be created and used in Data Services real-time jobs. It also introduces new a Data Services component—Access Server.

Chapter 11, Working with SAP Applications, is dedicated to the topic of reading and loading data from SAP systems with the example of the SAP ERP system. It presents the real-life use case of loading data into the SAP ERP system module.

Chapter 12, Introduction to Information Steward, covers another SAP product, Information Steward, which accompanies Data Services and provides a comprehensive view of the organization's data, and helps validate and cleanse it by applying Data Quality methods.

What you need for this book

To use the examples given in this book, you will need to download and make sure that you are licensed to use the following software products:

SQL Server Express 2012SAP Data Services 4.2 SP4 or higherSAP Information Steward 4.2 SP4 or higherSAP ERP (ECC)SoapUI—5.2.0

Who this book is for

The book will be useful to application developers and database administrators who want to get familiar with ETL development using SAP Data Services. It can also be useful to ETL developers or consultants who want to improve and extend their knowledge of this tool. The book can also be useful to data and business analysts who want to take a peek at the backend of BI development. The only requirement of this book is that you are familiar with the SQL language and general database concepts. Knowledge of any kind of programming language will be a benefit as well.

Sections

In this book, you will find several headings that appear frequently (Getting ready, How to do it, How it works, There's more, and See also).

To give clear instructions on how to complete a recipe, we use these sections as follows:

Getting ready

This section tells you what to expect in the recipe, and describes how to set up any software or any preliminary settings required for the recipe.

How to do it…

This section contains the steps required to follow the recipe.

How it works…

This section usually consists of a detailed explanation of what happened in the previous section.

There's more…

This section consists of additional information about the recipe in order to make the reader more knowledgeable about the recipe.

See also

This section provides helpful links to other useful information for the recipe.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail <[email protected]>, and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from: https://www.packtpub.com/sites/default/files/downloads/6565EN_Graphics.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the ErrataSubmissionForm link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at <[email protected]>, and we will do our best to address the problem.

Chapter 1. Introduction to ETL Development

In this chapter, we will cover:

Preparing a database environmentCreating a source system databaseDefining and creating staging area structuresCreating a target data warehouse

Introduction

Simply put, Extract-Transform-Load (ETL) is an engine of any data warehouse. The nature of the ETL system is straightforward:

Extract data from operational databases/systemsTransform data according to the requirements of your data warehouse so that the different pieces of data can be used togetherApply data quality transformation methods in order to cleanse data and ensure that it is reliable before it gets loaded into a data warehouseLoad conformed data into a data warehouse so that end users can access it via reporting tools, using client applications directly, or with the help of SQL-based query tools

While your data warehouse delivery structures or data marts represent the frontend or, in other words, what users see when they access the data, the ETL system itself is a backbone backend solution that does all the work of moving data and getting it ready in time for users to use. Building the ETL system can be a really challenging task, and though it is not part of the data warehouse data structures, it is definitely the key factor in defining the success of the data warehouse solution as a whole. In the end, who wants to use a data warehouse where the data is unreliable, corrupted, or sometimes even missing? This is exactly what ETL is responsible for getting right.

The following data structure types most often used in ETL development to move data between sources and targets are flat files, XML datasets, and DBMS tables, both in normalized schemas and dimensional data models. When choosing an ETL solution, you might face two simple choices: building a handcoded ETL solution or using a commercial one.

The following are some advantages of a handcoded ETL solution:

A programming language allows you to build your own sophisticated transformationsYou are more flexible in building the ETL architecture as you are not limited by the vendor's ETL abilitiesSometimes, it can be a cheap way of building a few simplistic ETL processes, whereas buying an ETL solution from a vendor can be overkillYou do not have to spend time learning the commercial ETL solution's architecture and functionality

Here are some advantages of a commercial ETL solution:

This is more often a simpler, faster, and cheaper development option as a variety of existing tools allow you to build a very sophisticated ETL architecture quicklyYou do not have to be a professional programmer to use the toolIt automatically manages ETL metadata by collecting, storing, and presenting it to the ETL developer, which is another important aspect of any ETL solutionIt has a huge range of additional ready-to-use functionality, from built-in schedulers to various connectors to existing systems, built-in data lineages, impact analysis reports, and many others

In the majority of DWH projects, the commercial ETL solution from a specific vendor, in spite of the higher immediate cost, eventually saves you a significant amount of money on the development and maintenance of ETL code.

SAP Data Services is an ETL solution provided by SAP and is part of the Enterprise Information Management product stack, which also includes SAP Information Steward; we will review this in one of the last chapters of this book.

Preparing a database environment

This recipe will lead you through the further steps of preparing the working environment, such as preparing a database environment to be utilized by ETL processes as a source and staging and targeting systems for the migrated and transformed data.

Getting ready

To start the ETL development, we need to think about three things: the system that we will source the data from, our staging area (for initial extracts and as a preliminary storage for data during subsequent transformation steps), and finally, the data warehouse itself, to which the data will be eventually delivered.

How to do it…

Throughout the book, we will use a 64-bit environment, so ensure that you download and install the 64-bit versions of software components. Perform the following steps:

Let's start by preparing our source system. For quick deployment, we will choose the Microsoft SQL Server 2012 Express database, which is available for download at http://www.microsoft.com/en-nz/download/details.aspx?id=29062.Click on the Download button and select the SQLEXPRWT_x64_ENU.exe file in the list of files that are available for download. This package contains everything required for the installation and configuration of the database server: the SQL Server Express database engine and the SQL Server Management Studio tool.After the download is complete, run the executable file and follow the instructions on the screen. The installation of SQL Server 2012 Express is extremely straightforward, and all options can be set to their default values. There is no need to create any default databases during or after the installation as we will do it a bit later.

How it works…

After you have completed the installation, you should be able to run the SQL Server Management Studio application and connect to your database engine using the settings provided during the installation process.

If you have done everything correctly, you should see the "green" state of your Database Engine connection in the Object Explorer window of SQL Server Management Studio, as shown in the following screenshot:

We need an "empty" installation of MS SQL Server 2012 Express because we will create all the databases we need manually in the next steps of this chapter. This database engine installation will host all our source, stage, and target relational data structures. This option allows us to easily build a test environment that is perfect for learning purposes in order to become familiar with ETL development using SAP Data Services.

In a real-life scenario, your source databases, staging area database, and DWH database/appliance will most likely reside on separate server hosts, and they may sometimes be from different vendors. So, the role of SAP Data Services is to link them together in order to migrate data from one system to another.

Defining and creating staging area structures

In this recipe, we will talk about ETL data structures that will be used in this book. Staging structures are important storage areas where extracted data is kept before it gets transformed or stored between the transformation steps. The staging area in general can be used to create backup copies of data or to run analytical queries on the data in order to validate the transformations made or the extract processes. Staging data structures can be quite different, as you will see. Which one to use depends on the tasks you are trying to accomplish, your project requirements, and the architecture of the environment used.

How to do it…

The most popular data structures that could be used in the staging area are flat files and RDBMS tables.

Flat files

One of the perks of using Data Services against the handcoded ETL solution is that Data Services allows you to easily read from and write information to a flat file.

Create the C:\AW\ folder, which will be used throughout this book to store flat files.

Note

Inserting data into a flat file is faster than inserting data into an RDBMS table. So, during ETL development, flat files are often used to reach two goals simultaneously: creating a backup copy of the data snapshot and providing you with the storage location for your preliminary data before you apply the next set of transformation rules.

Another common use of flat files is the ability to exchange data between systems that cannot communicate with each other in any other way.

Lastly, it is very cost-effective to store flat files (OS disk storage space is cheaper than DB storage space).

The main disadvantage of the flat files storage method is that the modification of data in a flat file can sometimes be a real pain, not to mention that it is much slower than modifying data in a relational DB table.

RDBMS tables

These ETL data structures will be used more often than others to stage the data that is going through the ETL transformation process.

Let's create two separate databases for relational tables, which will play the role of the ETL staging area in our future examples:

Open SQL Server Management Studio.Right-click on the Databases icon and select the New Database… option.On the next screen, input ODS as the database name, and specify 100 MB as the initial size value of the database file and 10 MB as that of the transactional log file:Repeat the last two steps to create another dataset called STAGE.

How it works…

Let's recap. The ETL staging area is a location to store the preliminary results of our ETL transformations and also a landing zone for the extracts from the source system.

Yes, Data Services allows you to extract data and perform all transformations in the memory before loading to the target system. However, as you will see in later chapters, the ETL process, which does everything in one "go", can be complex and difficult to maintain. Plus, if something goes wrong along the way, all the changes that the process has already performed will be lost and you may have to start the extraction/transformation process again. This obviously creates extra workload on a source system because you have to query it again in order to get the data. Finally, big does not mean effective. We will show you how splitting your ETL process into smaller pieces helps you to create a well-performing sequence of dataflow.

The ODS database will be used as a landing zone for the data coming from source systems. The structure of the tables here will be identical to the structure of the source system tables.

The STAGE database will hold the relational tables used to store data between the data transformation steps.

We will also store some data extracted from a source database in a flat file format to demonstrate the ability of Data Services to work with them and show the convenience of this data storage method in the ETL system.

Chapter 2. Configuring the Data Services Environment

In this chapter, we will install and configure all components required for SAP Data Services. In this chapter, we will cover the following topics:

Creating IPS and Data Services repositoriesInstalling and configuring Information Platform ServicesInstalling and configuring Data ServicesConfiguring user accessStarting and stopping servicesAdministering tasksUnderstanding the Designer tool

Introduction

The same thing that makes SAP Data Services a great ETL development environment makes it quite not a trivial one to install and configure. Here though, you have to remember that Data Services is an enterprise class ETL solution that is able to solve the most complex ETL tasks.

See the following image for a very high-level Data Services architecture view. Data Services has two basic groups of components: client tools and server-based components:

Client tools include the following (there are more, but we mention the ones most often used):

The Designer tool: This is the client-based main GUI application for ETL developmentRepository Manager: This is a client-based GUI application for Data Services to create, configure, and upgrade Data Services repositories

The main server-based components include the following ones:

IPS Services: This is used for user authentication, system configuration storage, and internal metadata managementJob Server: This is a core engine service that executes ETL codeAccess server: This is a real-time request-reply message broker, which implements real-time services in the Data Services environmentWeb application server: This provides access to some Data Services administration and reporting tasks via the DS Management Console and Central Management Consoleweb-based applications

In the course of the next few recipes, we will install, configure, and access all the components required to perform the majority of ETL development tasks. You will learn about their purposes and some useful tips that will help you effectively work in the Data Services environment throughout the book and in your future work.

Data Services installation supports all major OS and database environments. For learning purposes, we have chosen the Windows OS as it involves the least configuration on the user part. Both client tools and server components will be installed on the same Windows host.

Creating IPS and Data Services repositories

The IPS repository is a storage for environment and user configuration information and metadata collected by various services of IPS and Data Services. It has another name: the CMS database. This name should be quite familiar to those who have used SAP Business Intelligence software. Basically, IPS is a light version of SAP BI product package. You will always use only one IPS repository per Data Services installation and most likely will deal with it only once: when configuring the environment at the very beginning. Most of the time, Data Services will be communicating with IPS services and the CMS database in the background, without you even noticing.

The Data Services repository is a different story. It is much closer to an ETL developer as it is a database that stores your developed code. In a multiuser development environment, every ETL developer usually has its own repository. They can be of two types: central and local. They serve different purposes in the ETL lifecycle, and I will explain this in more detail in the upcoming chapters. Meanwhile, let's create our first local Data Services repository.

Getting ready…

Both repositories will be stored in the same SQL Server Express RDBMS ((local)\SQLEXPRESS) that we used to create our source OLTP database, ETL staging databases, and target data warehouse. So, at this point, you only need to have access to SQL Server Management Studio and your SQL Server Express services need to start.

How to do it…

This will consist of two major tasks:

Creating a database:
Log in to SQL Server Management Studio and create two databases: IPS_CMS and DS_LOCAL_REPO.Right now, your database list should look like this:
Configuring the ODBC layer: Installation requires that you create the ODBC data source for the IPS_CMS database.
Go to Control Panel | Administrative Tools | ODBC Data Sources (64-bit).Open the System DSN tab and click on the Add… button.Choose the name of the data source: SQL_IPS, the description SQL Server Express, and the SQL Server you want to connect to through this ODBC data source: (local)\SQLEXPRESS. Then, click on Next.Choose SQL Server authentication and select the checkbox Connect to SQL to obtain the default settings. Enter the login ID (sa user) and password. Click on Next.Select the checkbox and change the default database to IPS_CMS. Click on Next.Skip the next screen by clicking on Next.The final screen of the ODBC configuration should look like the following screenshot. Then, clicking on the Test Data Source button should give you the message, TESTS COMPLETED SUCCESSFULLY!

How it works…

These two empty databases will be used by Data Services tools during installation and post-installation configuration tasks. All structures inside them will be created and populated automatically.

Usually, they are not built for users to access them directly, but in the upcoming chapters, I will show you a few tricks on how to extract valuable information from them in order to troubleshoot potential problems, do a little bit of ETL metadata reporting, or use an extended search for ETL objects, which is not possible in the GUI of the Designer tool.

The ODBC layer configured for the IPS_CMS database allows you to access it from the IPS installation. When we install both IPS and Data Services, you will be able to connect to the databases directly from the Data Services applications, as it has native drivers for various types of databases and also allows you to connect through ODBC layers if you want.

See also

References to a future chapter containing techniques mentioned in the preceding paragraph.

Installing and configuring Information Platform Services

The Information Platform Services (IPS) product package was added as a component into the Data Services bundle starting from the Data Services 4.x version. The reason for this was to make the Data Services architecture flexible and robust and introduce some extra functionality, that is, a user management layer to the existing SAP Data Services solution. As we mentioned before, IPS is a light version of SAP BI core services and has a lot of similar functionality.

In this recipe, we will perform the installation and basic configuration of IPS, which is a mandatory component for future Data Services installations.

Tip

As an option, you could always use the existing full enterprise SAP BI solution if you have it installed in your environment. However, this is generally considered a bad practice. Imagine that it is like storing all eggs in one basket. Whenever you need to plan downtime for your BI system, you should keep in mind that it will affect your ETL environment as well, and you will not be able to run any Data Services jobs during this period. That is why, IPS is installed to be used only by Data Services as a safer and more convenient option in terms of support and maintenance.

Getting ready…

Download the Information Platform Services installation package from the SAP support portal and unzip it to the location of your choice. The main requirement for installing IPS as well as Data Services in the next recipe is that your OS should have a 64-bit architecture.

How to do it…

Create an EIM folder in your C drive to store your installation in one place.Launch the IPS installer by executing InstallIPS.exe.Make sure that all your critical prerequisites have the Succeeded status on the Check Prerequisites screen. Continue to the next screen.Choose C:\EIM\ as the installation destination folder. Continue to the next screen.Choose the Full installation type. Continue to the next screen.On Select Default or Existing Database, choose Configure an existing database and continue to the next screen.Select Microsoft SQL Server using ODBC as the existing CMS database type.Select No auditing database on the next screen and continue.Choose Install the default Tomcat Java Web Application Server and automatically deploy web applications. Continue to the next screen.For version management, choose Do not configure a version control system at this time.On the next screen, specify the SIA name in the Node name field as IPS and SIA port as 6410.Do not change the default CMS port, 6400.On the CMS account configuration