47,99 €
Want to cost effectively deliver trusted information to all of your crucial business functions? SAP Data Services delivers one enterprise-class solution for data integration, data quality, data profiling, and text data processing. It boosts productivity with a single solution for data quality and data integration. SAP Data Services also enables you to move, improve, govern, and unlock big data.
This book will lead you through the SAP Data Services environment to efficiently develop ETL processes. To begin with, you’ll learn to install, configure, and prepare the ETL development environment. You will get familiarized with the concepts of developing ETL processes with SAP Data Services. Starting from smallest unit of work- the data flow, the chapters will lead you to the highest organizational unit—the Data Services job, revealing the advanced techniques of ETL design.
You will learn to import XML files by creating and implementing real-time jobs. It will then guide you through the ETL development patterns that enable the most effective performance when extracting, transforming, and loading data. You will also find out how to create validation functions and transforms.
Finally, the book will show you the benefits of data quality management with the help of another SAP solution—Information Steward.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 449
Veröffentlichungsjahr: 2015
Copyright © 2015 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: November 2015
Production reference: 1261115
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78217-656-5
www.packtpub.com
Author
Ivan Shomnikov
Reviewers
Andrés Aguado Aranda
Dick Groenhof
Bernard Timbal Duclaux de Martin
Sridhar Sunkaraneni
Meenakshi Verma
Commissioning Editor
Vinay Argekar
Acquisition Editors
Shaon Basu
Kevin Colaco
Content Development Editor
Merint Mathew
Technical Editor
Humera Shaikh
Copy Editors
Brandt D'mello
Shruti Iyer
Karuna Narayanan
Sameen Siddiqui
Project Coordinator
Francina Pinto
Proofreader
Safis Editing
Indexer
Monica Ajmera Mehta
Production Coordinator
Nilesh Mohite
Cover Work
Nilesh Mohite
Ivan Shomnikov is an SAP analytics consultant specializing in the area of Extract, Transform, and Load (ETL). He has in-depth knowledge of the data warehouse life cycle processes (DWH design and ETL development) and extensive hands-on experience with both the SAP Enterprise Information Management (Data Services) technology stack and the SAP BusinessObjects reporting products stack (Web Intelligence, Designer, Dashboards).
Ivan has been involved in the implementation of complex BI solutions on the SAP BusinessObjects Enterprise platform in major New Zealand companies across different industries. He also has a strong background as an Oracle database administrator and developer.
This is my first experience of writing a book, and I would like to thank my partner and my son for their patience and support.
Andrés Aguado Aranda is a 26-year-old computer engineer from Spain. His experience has given him a really technical background in databases, data warehouse, and business intelligence.
Andrés has worked in different business sectors, such as banking, public administrations, and energy, since 2012 in data-related positions.
This book is my first stint as a reviewer, and it has been really interesting and valuable to me, both personally and professionally.
I would like to thank my family and friends for always being willing to help me when I needed. Also, I would like to thank to my former coworker and currently friend, Antonio Martín-Cobos, a BI reporting analyst who really helped me get this opportunity.
Dick Groenhof started his professional career in 1990 after finishing his studies in business information science at Vrije Universiteit Amsterdam. Having worked as a software developer and service management consultant for the first part of his career, he became active as a consultant in the business intelligence arena since 2005.
Dick has been a lead consultant on numerous SAP BI projects, designing and implementing successful solutions for his customers, who regard him as a trusted advisor. His core competences include both frontend (such as Web Intelligence, Crystal Reports, and SAP Design Studio) and backend tools (such as SAP Data Services and Information Steward). Dick is an early adopter of the SAP HANA platform, creating innovative solutions using HANA Information Views, Predictive Analysis Library, and SQLScript.
He is a Certified Application Associate in SAP HANA and SAP BusinessObjects Web Intelligence 4.1. Currently, Dick works as senior HANA and big data consultant for a highly respected and innovative SAP partner in the Netherlands.
He is a strong believer in sharing his knowledge with regard to SAP HANA and SAP Data Services by writing blogs (at http://www.dickgroenhof.com and http://www.thenextview.nl/blog) and speaking at seminars.
Dick is happily married to Emma and is a very proud father of his son, Christiaan, and daughter, Myrthe.
Bernard Timbal Duclaux de Martin is a business intelligence architect and technical expert with more than 15 years of experience. He has been involved in several large business intelligence system deployments and administration in banking and insurance companies. In addition, Bernard has skills in modeling, data extraction, transformation, loading, and reporting design. He has authored four books, including two regarding SAP BusinessObjects Enterprise administration.
Meenakshi Verma has been a part of the IT industry since 1998. She is an experienced business systems specialist having the CBAP and TOGAF certifications. Meenakshi is well-versed with a variety of tools and techniques used for business analysis, such as SAP BI, SAP BusinessObjects, Java/J2EE technologies, and others. She is currently based in Toronto, Canada, and works with a leading utility company.
Meenakshi has helped technically review many books published by Packt Publishing across various enterprise solutions. Her earlier works include JasperReports for Java Developers, Java EE 5 Development using GlassFish Application Server, Practical Data Analysis and Reporting with BIRT, EJB 3 Developer Guide, Learning Dojo, and IBM WebSphere Application Server 8.0 Administration Guide.
I'd like to thank my father, Mr. Bhopal Singh, and mother, Mrs. Raj Bala, for laying a strong foundation in me and giving me their unconditional love and support. I also owe thanks and gratitude to my husband, Atul Verma, for his encouragement and support throughout the reviewing of this book and many others; my ten-year-old son, Prieyaansh Verma, for giving me the warmth of his love despite my hectic schedules; and my brother, Sachin Singh, for always being there for me.
For support files and downloads related to your book, please visit www.PacktPub.com.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www2.packtpub.com/books/subscription/packtlib
Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.
If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view 9 entirely free books. Simply use your login credentials for immediate access.
Get notified! Find out when new books are published by following @PacktEnterprise on Twitter or the Packt Enterprise Facebook page.
SAP Data Services delivers an enterprise-class solution to build data integration processes as well as perform data quality and data profiling tasks, allowing you to govern your data in a highly-efficient way.
Some of the tasks that Data Services helps accomplish include: migration of the data between databases or applications, extracting data from various source systems into flat files, data cleansing, data transformation using either common database-like functions or complex custom-built functions that are created using an internal scripting language, and of course, loading data into your data warehouse or external systems. SAP Data Services has an intuitive user-friendly graphical interface, allowing you to access all its powerful Extract, Transform, and Load (ETL) capabilities from the single Designer tool. However, getting started with SAP Data Services can be difficult, especially for people who have little or no experience in ETL development. The goal of this book is to guide you through easy-to-understand examples of building your own ETL architecture. The book can also be used as a reference to perform specific tasks as it provides real-world examples of using the tool to solve data integration problems.
Chapter 1, Introduction to ETL Development, explains what Extract, Transform, and Load (ETL) processes are, and what role Data Services plays in ETL development. It includes the steps to configure the database environment used in recipes of the book.
Chapter 2, Configuring the Data Services Environment, explains how to install and configure all Data Services components and applications. It introduces the Data Services development GUI—the Designer tool—with the simple example of "Hello World" ETL code.
Chapter 3, Data Services Basics – Data Types, Scripting Language, and Functions, introduces the reader to Data Services internal scripting language. It explains various categories of functions that are available in Data Services, and gives the reader an example of how scripting language can be used to create custom functions.
Chapter 4, Dataflow – Extract, Transform, and Load, introduces the most important processing unit in Data Service, dataflow object, and the most useful types of transformations that can be performed inside a dataflow. It gives the reader examples of extracting data from source systems and loading data into target data structures.
Chapter 5, Workflow – Controlling Execution Order, introduces another Data Services object, workflow, which is used to group other workflows, dataflows, and script objects into execution units. It explains the conditional and loop structures available in Data Services.
Chapter 6, Job – Building the ETL Architecture, brings the reader to the job object level and reviews the steps used in the development process to make a successful and robust ETL solution. It covers the monitoring and debugging functionality available in Data Services and embedded audit features.
Chapter 7, Validating and Cleansing Data, introduces the concepts of validating methods, which can be applied to the data passing through the ETL processes in order to cleanse and conform it according to the defined Data Quality standards.
Chapter 8, Optimizing ETL Performance, is one of the first advanced chapters, which starts explaining complex ETL development techniques. This particular chapter helps the user understand how the existing processes can be optimized further in Data Services in order to make sure that they run quickly and efficiently, consuming as less computer resources as possible with the least amount of execution time.
Chapter 9, Advanced Design Techniques, guides the reader through advanced data transformation techniques. It introduces concepts of Change Data Capture methods that are available in Data Services, pivoting transformations, and automatic recovery concepts.
Chapter 10, Developing Real-time Jobs, introduces the concept of nested structures and the transforms that work with nested structures. It covers the mains aspects of how they can be created and used in Data Services real-time jobs. It also introduces new a Data Services component—Access Server.
Chapter 11, Working with SAP Applications, is dedicated to the topic of reading and loading data from SAP systems with the example of the SAP ERP system. It presents the real-life use case of loading data into the SAP ERP system module.
Chapter 12, Introduction to Information Steward, covers another SAP product, Information Steward, which accompanies Data Services and provides a comprehensive view of the organization's data, and helps validate and cleanse it by applying Data Quality methods.
To use the examples given in this book, you will need to download and make sure that you are licensed to use the following software products:
The book will be useful to application developers and database administrators who want to get familiar with ETL development using SAP Data Services. It can also be useful to ETL developers or consultants who want to improve and extend their knowledge of this tool. The book can also be useful to data and business analysts who want to take a peek at the backend of BI development. The only requirement of this book is that you are familiar with the SQL language and general database concepts. Knowledge of any kind of programming language will be a benefit as well.
In this book, you will find several headings that appear frequently (Getting ready, How to do it, How it works, There's more, and See also).
To give clear instructions on how to complete a recipe, we use these sections as follows:
This section tells you what to expect in the recipe, and describes how to set up any software or any preliminary settings required for the recipe.
This section contains the steps required to follow the recipe.
This section usually consists of a detailed explanation of what happened in the previous section.
This section consists of additional information about the recipe in order to make the reader more knowledgeable about the recipe.
This section provides helpful links to other useful information for the recipe.
Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.
To send us general feedback, simply e-mail <[email protected]>, and mention the book's title in the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from: https://www.packtpub.com/sites/default/files/downloads/6565EN_Graphics.pdf.
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the ErrataSubmissionForm link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.
To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.
Please contact us at <[email protected]> with a link to the suspected pirated material.
We appreciate your help in protecting our authors and our ability to bring you valuable content.
If you have a problem with any aspect of this book, you can contact us at <[email protected]>, and we will do our best to address the problem.
In this chapter, we will cover:
Simply put, Extract-Transform-Load (ETL) is an engine of any data warehouse. The nature of the ETL system is straightforward:
While your data warehouse delivery structures or data marts represent the frontend or, in other words, what users see when they access the data, the ETL system itself is a backbone backend solution that does all the work of moving data and getting it ready in time for users to use. Building the ETL system can be a really challenging task, and though it is not part of the data warehouse data structures, it is definitely the key factor in defining the success of the data warehouse solution as a whole. In the end, who wants to use a data warehouse where the data is unreliable, corrupted, or sometimes even missing? This is exactly what ETL is responsible for getting right.
The following data structure types most often used in ETL development to move data between sources and targets are flat files, XML datasets, and DBMS tables, both in normalized schemas and dimensional data models. When choosing an ETL solution, you might face two simple choices: building a handcoded ETL solution or using a commercial one.
The following are some advantages of a handcoded ETL solution:
Here are some advantages of a commercial ETL solution:
In the majority of DWH projects, the commercial ETL solution from a specific vendor, in spite of the higher immediate cost, eventually saves you a significant amount of money on the development and maintenance of ETL code.
SAP Data Services is an ETL solution provided by SAP and is part of the Enterprise Information Management product stack, which also includes SAP Information Steward; we will review this in one of the last chapters of this book.
This recipe will lead you through the further steps of preparing the working environment, such as preparing a database environment to be utilized by ETL processes as a source and staging and targeting systems for the migrated and transformed data.
To start the ETL development, we need to think about three things: the system that we will source the data from, our staging area (for initial extracts and as a preliminary storage for data during subsequent transformation steps), and finally, the data warehouse itself, to which the data will be eventually delivered.
Throughout the book, we will use a 64-bit environment, so ensure that you download and install the 64-bit versions of software components. Perform the following steps:
After you have completed the installation, you should be able to run the SQL Server Management Studio application and connect to your database engine using the settings provided during the installation process.
If you have done everything correctly, you should see the "green" state of your Database Engine connection in the Object Explorer window of SQL Server Management Studio, as shown in the following screenshot:
We need an "empty" installation of MS SQL Server 2012 Express because we will create all the databases we need manually in the next steps of this chapter. This database engine installation will host all our source, stage, and target relational data structures. This option allows us to easily build a test environment that is perfect for learning purposes in order to become familiar with ETL development using SAP Data Services.
In a real-life scenario, your source databases, staging area database, and DWH database/appliance will most likely reside on separate server hosts, and they may sometimes be from different vendors. So, the role of SAP Data Services is to link them together in order to migrate data from one system to another.
In this recipe, we will talk about ETL data structures that will be used in this book. Staging structures are important storage areas where extracted data is kept before it gets transformed or stored between the transformation steps. The staging area in general can be used to create backup copies of data or to run analytical queries on the data in order to validate the transformations made or the extract processes. Staging data structures can be quite different, as you will see. Which one to use depends on the tasks you are trying to accomplish, your project requirements, and the architecture of the environment used.
The most popular data structures that could be used in the staging area are flat files and RDBMS tables.
One of the perks of using Data Services against the handcoded ETL solution is that Data Services allows you to easily read from and write information to a flat file.
Create the C:\AW\ folder, which will be used throughout this book to store flat files.
Inserting data into a flat file is faster than inserting data into an RDBMS table. So, during ETL development, flat files are often used to reach two goals simultaneously: creating a backup copy of the data snapshot and providing you with the storage location for your preliminary data before you apply the next set of transformation rules.
Another common use of flat files is the ability to exchange data between systems that cannot communicate with each other in any other way.
Lastly, it is very cost-effective to store flat files (OS disk storage space is cheaper than DB storage space).
The main disadvantage of the flat files storage method is that the modification of data in a flat file can sometimes be a real pain, not to mention that it is much slower than modifying data in a relational DB table.
These ETL data structures will be used more often than others to stage the data that is going through the ETL transformation process.
Let's create two separate databases for relational tables, which will play the role of the ETL staging area in our future examples:
Let's recap. The ETL staging area is a location to store the preliminary results of our ETL transformations and also a landing zone for the extracts from the source system.
Yes, Data Services allows you to extract data and perform all transformations in the memory before loading to the target system. However, as you will see in later chapters, the ETL process, which does everything in one "go", can be complex and difficult to maintain. Plus, if something goes wrong along the way, all the changes that the process has already performed will be lost and you may have to start the extraction/transformation process again. This obviously creates extra workload on a source system because you have to query it again in order to get the data. Finally, big does not mean effective. We will show you how splitting your ETL process into smaller pieces helps you to create a well-performing sequence of dataflow.
The ODS database will be used as a landing zone for the data coming from source systems. The structure of the tables here will be identical to the structure of the source system tables.
The STAGE database will hold the relational tables used to store data between the data transformation steps.
We will also store some data extracted from a source database in a flat file format to demonstrate the ability of Data Services to work with them and show the convenience of this data storage method in the ETL system.
In this chapter, we will install and configure all components required for SAP Data Services. In this chapter, we will cover the following topics:
The same thing that makes SAP Data Services a great ETL development environment makes it quite not a trivial one to install and configure. Here though, you have to remember that Data Services is an enterprise class ETL solution that is able to solve the most complex ETL tasks.
See the following image for a very high-level Data Services architecture view. Data Services has two basic groups of components: client tools and server-based components:
Client tools include the following (there are more, but we mention the ones most often used):
The main server-based components include the following ones:
In the course of the next few recipes, we will install, configure, and access all the components required to perform the majority of ETL development tasks. You will learn about their purposes and some useful tips that will help you effectively work in the Data Services environment throughout the book and in your future work.
Data Services installation supports all major OS and database environments. For learning purposes, we have chosen the Windows OS as it involves the least configuration on the user part. Both client tools and server components will be installed on the same Windows host.
The IPS repository is a storage for environment and user configuration information and metadata collected by various services of IPS and Data Services. It has another name: the CMS database. This name should be quite familiar to those who have used SAP Business Intelligence software. Basically, IPS is a light version of SAP BI product package. You will always use only one IPS repository per Data Services installation and most likely will deal with it only once: when configuring the environment at the very beginning. Most of the time, Data Services will be communicating with IPS services and the CMS database in the background, without you even noticing.
The Data Services repository is a different story. It is much closer to an ETL developer as it is a database that stores your developed code. In a multiuser development environment, every ETL developer usually has its own repository. They can be of two types: central and local. They serve different purposes in the ETL lifecycle, and I will explain this in more detail in the upcoming chapters. Meanwhile, let's create our first local Data Services repository.
Both repositories will be stored in the same SQL Server Express RDBMS ((local)\SQLEXPRESS) that we used to create our source OLTP database, ETL staging databases, and target data warehouse. So, at this point, you only need to have access to SQL Server Management Studio and your SQL Server Express services need to start.
This will consist of two major tasks:
These two empty databases will be used by Data Services tools during installation and post-installation configuration tasks. All structures inside them will be created and populated automatically.
Usually, they are not built for users to access them directly, but in the upcoming chapters, I will show you a few tricks on how to extract valuable information from them in order to troubleshoot potential problems, do a little bit of ETL metadata reporting, or use an extended search for ETL objects, which is not possible in the GUI of the Designer tool.
The ODBC layer configured for the IPS_CMS database allows you to access it from the IPS installation. When we install both IPS and Data Services, you will be able to connect to the databases directly from the Data Services applications, as it has native drivers for various types of databases and also allows you to connect through ODBC layers if you want.
References to a future chapter containing techniques mentioned in the preceding paragraph.
The Information Platform Services (IPS) product package was added as a component into the Data Services bundle starting from the Data Services 4.x version. The reason for this was to make the Data Services architecture flexible and robust and introduce some extra functionality, that is, a user management layer to the existing SAP Data Services solution. As we mentioned before, IPS is a light version of SAP BI core services and has a lot of similar functionality.
In this recipe, we will perform the installation and basic configuration of IPS, which is a mandatory component for future Data Services installations.
As an option, you could always use the existing full enterprise SAP BI solution if you have it installed in your environment. However, this is generally considered a bad practice. Imagine that it is like storing all eggs in one basket. Whenever you need to plan downtime for your BI system, you should keep in mind that it will affect your ETL environment as well, and you will not be able to run any Data Services jobs during this period. That is why, IPS is installed to be used only by Data Services as a safer and more convenient option in terms of support and maintenance.
Download the Information Platform Services installation package from the SAP support portal and unzip it to the location of your choice. The main requirement for installing IPS as well as Data Services in the next recipe is that your OS should have a 64-bit architecture.
