29,99 €
SAS is used for various functions in the development and maintenance of data warehouses, thanks to its reputation of being able to handle ’big data’.
This book will help you learn the pros and cons of storing data in SAS. As you progress, you’ll understand how to document and design extract-transform-load (ETL) protocols for SAS processes. Later, you’ll focus on how the use of SAS arrays and macros can help standardize ETL. The book will also help you examine approaches for serving up data using SAS and explore how connecting SAS to other systems can enhance the data warehouse user’s experience.
By the end of this data management book, you will have a fundamental understanding of the roles SAS can play in a warehouse environment, and be able to choose wisely when designing your data warehousing processes involving SAS.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 716
Veröffentlichungsjahr: 2020
An advanced programming guide to designing and managing Data Warehouses using SAS
Monika Wahi
BIRMINGHAM—MUMBAI
Copyright © 2020 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Commissioning Editor: Richa Tripathi
Acquisition Editor: Karan Gupta
Senior Editor: Nitee Shetty
Content Development Editor: Ruvika Rao
Technical Editor: Gaurav Gala
Copy Editor: Safis Editing
Project Coordinator: Deeksha Thakkar
Proofreader: Safis Editing
Indexer: Tejal Daruwale Soni
Production Designer: Aparna Bhagat
First published: October 2020
Production reference: 1141020
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-78953-237-1
www.packt.com
This book is dedicated to my mother, Carol Wahi. Although happily retired now, she spent her career programming in COBOL with assembler, and teaching data management skills. Here's to the next generation of big data women!
Packt.com
Subscribe to our online digital library for full access to over 7,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.
Spend less time learning and more time coding with practical eBooks and videos from over 4,000 industry professionals
Improve your learning with Skill Plans built especially for you
Get a free eBook or video every month
Fully searchable for easy access to vital information
Copy and paste, print, and bookmark content
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at packt.com and, as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.
Monika Wahi, MPH, CPH, is a well-published epidemiologist, biostatistician, informaticist, and data scientist. For over 20 years, Monika has worked at various governmental organizations and non-profits, and led consulting projects in academia and for governments both in the United States and internationally. She is President of DethWench Professional Services (DPS), which offers consulting and training in data science, specializing in public health and healthcare. Monika is proficient in SAS, R, Excel, and SQL, and is the author of many articles and online courses in data science and health data analytics.
I'd like to thank the following people at Packt who made this book possible: Karan Gupta, who was the first to express faith in me, and Afshaan Khan, Ruvika Rao, and Prajakta Naik, who worked tirelessly with me to improve my drafts. I'd also like to thank the reviewer, Sunil Gupta, an admirable SAS author himself, for his helpful advice and encouragement.
Sunil Gupta, MS, is an international speaker, best-selling author of five SAS books, and a global SAS and CDISC corporate trainer. Sunil has over 25 years' experience in the pharmaceutical industry. Most recently, Sunil has been involved in several CDISC and PhUSE working groups and has taught his CDISC online class at the University of California at San Diego.
In 2019, Sunil published his fifth book, Clinical Data Quality Checks for CDISC Compliance Using SAS, and, in 2011, he launched his unique SAS mentoring blog for smarter SAS searches. Sunil has an MS in bioengineering from Clemson University and a BS in applied mathematics from the College of Charleston.
If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.
SAS is used for various functions in the development and maintenance of data warehouses because of its reputation of being able to handle so-called big data. SAS software has been in existence a long time, and has been implemented in many large, data-intensive environments, including data warehouses.
This book provides end-to-end coverage of the practical programming considerations to make when involving SAS in a data warehouse environment. Complete with step-by-step explanations of essential concepts, practical examples, and self-assessment questions, the book helps you begin to make decisions about the roles SAS should play in your data warehouse. It will teach you how to design arrays and macros to standardize extract-transform-load protocols, as well as how to develop strategies to optimally serve data warehouse customers.
You will learn the pros and cons of storing data in SAS, how to document and design ETL protocols for SAS processes, and how the use of SAS arrays and macros can help improve input/output (I/O) efficiency. You will also examine approaches to serving up data using SAS, and how to connect SAS to other systems to enhance the data warehouse user's experience. By the end of this book, you will have a foundational understanding of the roles SAS can play in a warehouse environment, and be able to choose wisely when designing your data warehousing processes involving SAS.
This book is aimed at programmers using SAS products who are working on a data lake, data mart, or data warehouse. It is also aimed at managers heading up projects involving using SAS to maintain a data lake, data mart, or data warehouse. To benefit from this book, it is helpful to have a background in working on data projects that require serving data or reports to customers. Also, some experience of working with big datasets will be helpful in understanding this book.
Chapter 1, Using SAS in a Data Mart, Data Lake, or Data Warehouse, explains the origins of SAS, and how data input/output (I/O) are managed in SAS. It also provides context for how SAS products are used today, in modern data warehouses.
Chapter 2, Reading Big Data into SAS, covers how to read data in different formats into SAS. It also talks about SAS data formats, and packaging data for import and export in SAS.
Chapter 3, Helpful PROCs for Managing Data, provides an introduction to PROC CONTENTS, PROC SQL, and PROC PRINT, and describes how to deal with SAS formats and labels. It also provides different strategies for viewing data in SAS.
Chapter 4, Managing ETL in SAS, explains how to prepare an analytic environment, including developing naming conventions, and SAS format and label policies. It also describes the designation of data storage and user groups.
Chapter 5, Managing Data Reporting in SAS, introduces you to the output delivery system (ODS), and explains how the ODS is used for outputting graphics files from SAS. This chapter also covers how to use PROCs that were developed specifically for the ODS, including PROC TABULATE and PROC SGPLOT.
Chapter 6, Standardizing Coding Using SAS Arrays, explains how to do array processing in a SAS data warehouse, how to add conditions to arrays, and how to deal with naming conventions in arrays. In SAS, because of I/O limitations, the use of arrays is usually necessary in ETL code.
Chapter 7, Designing and Developing ETL Code in SAS, goes over how to plan ETL code, using PROC UNIVARIATE and PROC FREQ to study our data and help us plan how to serve up variables. The second part of the chapter focuses on how to develop optimal ETL code based on our plans.
Chapter 8, Using Macros to Automate ETL in SAS, describes how to convert data step code used in ETL to SAS macro language in order to automate the process. It also covers how to store and call macros, and how to use them to load transformed data.
Chapter 9, Debugging and Troubleshooting in SAS, covers debugging approaches in SAS. Advice for forming and formatting code is given, and special attention is given to debugging do loop code and macros.
Chapter 10, Considering the User Needs of SAS Data Warehouses, describes a method by which to classify users, and then apply data stewardship policies that help ensure their needs are met. For analyst users, providing data access, foreign keys, and crosswalk variables is described. For developer users, providing data curation and other support is delineated.
Chapter 11, Connecting the SAS Data Warehouse to Other Systems, talks about serving SAS to other data systems, which is typically done asynchronously. Next, it describes connecting SAS to other data systems, which is typically done synchronously through an open database connectivity (ODBC) protocol using SAS/ACCESS.
Chapter 12, Using the ODS for Visualization in SAS, describes the differences with using the ODS and visualization in SAS when done in print compared to on the web. Next, ways to serve SAS data to the web using the SAS Enterprise Guide aided by SAS Viya are explained, and how to visualize SAS data in other programs, such as R and Tableau, is described.
You will need access to a version of SAS. If you do not have access to a SAS server environment or PC SAS, you can use the free version of SAS, called SAS University Edition (available here: https://www.sas.com/en_us/software/university-edition/download-software.html). SAS University Edition is available for Windows, macOS, and Linux. All code examples have been tested using SAS University Edition in Windows, but they should work on any version of SAS.
Example data curation files in this book were developed using Microsoft Word, Excel, and PowerPoint. These files can be developed in the same or comparable software.
If you are using the digital version of this book, we advise you to type the code yourself or access the code via the GitHub repository (link available in the next section). Doing so will help you avoid any potential errors related to the copying/pasting of code.
You may benefit from following the author on YouTube (https://www.youtube.com/channel/UCCHcm7rOjf7Ruf2GA2Qnxow) and LinkedIn (https://www.linkedin.com/in/dethwench/), where she posts video tutorials and information about data science.
You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Mastering-SAS-Programming-for-Data-Warehousing. In case there's an update to the code, it will be updated on the existing GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://static.packt-cdn.com/downloads/9781789532371_ColorImages.pdf.
There are a number of text conventions used throughout this book.
Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "We will map LIBNAME to X, with X being the folder where we put the dataset."
A block of code is set as follows:
LIBNAME X "/folders/myfolders/X";
PROC CONTENTS data=X.Chap5_1;
run;
When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:
ODS TRACE ON / label;
PROC UNIVARIATE data=X.chap5_1;
var _AGE80;
run;
ODS TRACE OFF;
Bold: Indicates a new term, an important word, or words that you see on screen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "If you are using SAS University Edition, the RESULTS tab will display the graphic."
Tips or important notes
Appear like this.
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in, and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!
For more information about Packt, please visit packt.com.
This first section focuses on the basics of managing data in a data warehouse in SAS. First, we focus heavily on the process of data input/output (I/O) in SAS, which has not changed since SAS was originally created. Then, we see how to use data steps and PROCs, or SAS procedures, to read big data into SAS in various formats, given SAS's distinct data I/O processes.
After that, we are introduced to PROCs in SAS that can help manage data, especially with respect to I/O. These include PROCs that allow you to view and profile the dataset, including PROC CONTENTS and PROC PRINT.
Then, we see how to prepare for extract, transform, and load (ETL) processes by setting naming conventions, designating user groups, and setting other policies. Lastly, we are introduced to SAS's output delivery system (ODS) and see how reporting is done in SAS.
This section comprises the following chapters:
Chapter 1, Using SAS in a Data Mart, Data Lake, or Data Warehouse
Chapter 2, Reading Big Data into SAS
Chapter 3, Helpful PROCs for Managing Data
Chapter 4, Managing ETL in SAS
Chapter 5, Managing Data Reporting in SAS