Mastering SAS Programming for Data Warehousing - Monika Wahi - E-Book

Mastering SAS Programming for Data Warehousing E-Book

Monika Wahi

0,0
29,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

SAS is used for various functions in the development and maintenance of data warehouses, thanks to its reputation of being able to handle ’big data’.
This book will help you learn the pros and cons of storing data in SAS. As you progress, you’ll understand how to document and design extract-transform-load (ETL) protocols for SAS processes. Later, you’ll focus on how the use of SAS arrays and macros can help standardize ETL. The book will also help you examine approaches for serving up data using SAS and explore how connecting SAS to other systems can enhance the data warehouse user’s experience.
By the end of this data management book, you will have a fundamental understanding of the roles SAS can play in a warehouse environment, and be able to choose wisely when designing your data warehousing processes involving SAS.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 716

Veröffentlichungsjahr: 2020

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Mastering SAS Programming for Data Warehousing

An advanced programming guide to designing and managing Data Warehouses using SAS

Monika Wahi

BIRMINGHAM—MUMBAI

Mastering SAS Programming for Data Warehousing

Copyright © 2020 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor: Richa Tripathi

Acquisition Editor: Karan Gupta

Senior Editor: Nitee Shetty

Content Development Editor: Ruvika Rao

Technical Editor: Gaurav Gala

Copy Editor: Safis Editing

Project Coordinator: Deeksha Thakkar

Proofreader: Safis Editing

Indexer: Tejal Daruwale Soni

Production Designer: Aparna Bhagat

First published: October 2020

Production reference: 1141020

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-78953-237-1

www.packt.com

This book is dedicated to my mother, Carol Wahi. Although happily retired now, she spent her career programming in COBOL with assembler, and teaching data management skills. Here's to the next generation of big data women!

Packt.com

Subscribe to our online digital library for full access to over 7,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Fully searchable for easy access to vital information

Copy and paste, print, and bookmark content

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at packt.com and, as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

Contributors

About the author

Monika Wahi, MPH, CPH, is a well-published epidemiologist, biostatistician, informaticist, and data scientist. For over 20 years, Monika has worked at various governmental organizations and non-profits, and led consulting projects in academia and for governments both in the United States and internationally. She is President of DethWench Professional Services (DPS), which offers consulting and training in data science, specializing in public health and healthcare. Monika is proficient in SAS, R, Excel, and SQL, and is the author of many articles and online courses in data science and health data analytics.

I'd like to thank the following people at Packt who made this book possible: Karan Gupta, who was the first to express faith in me, and Afshaan Khan, Ruvika Rao, and Prajakta Naik, who worked tirelessly with me to improve my drafts. I'd also like to thank the reviewer, Sunil Gupta, an admirable SAS author himself, for his helpful advice and encouragement.

About the reviewer

Sunil Gupta, MS, is an international speaker, best-selling author of five SAS books, and a global SAS and CDISC corporate trainer. Sunil has over 25 years' experience in the pharmaceutical industry. Most recently, Sunil has been involved in several CDISC and PhUSE working groups and has taught his CDISC online class at the University of California at San Diego.

In 2019, Sunil published his fifth book, Clinical Data Quality Checks for CDISC Compliance Using SAS, and, in 2011, he launched his unique SAS mentoring blog for smarter SAS searches. Sunil has an MS in bioengineering from Clemson University and a BS in applied mathematics from the College of Charleston.

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Table of Contents

Preface

Section 1: Managing Data in a SAS Data Warehouse

Chapter 1: Using SAS in a Data Mart, Data Lake, or Data Warehouse

Technical requirements  4

Using original versions of SAS  4

Initial SAS data handling  5

Early SAS data handling  8

SAS data handling improvements  10

Accessing data in SAS  11

Upgrading to mainframes  11

Transitioning to personal computers  12

Reading external files  13

Improving I/O  17

Developing warehouse environments  17

Using the WHERE clause  18

Using IF compared to WHERE  19

Sorting in SAS  20

Setting indexes on variables  22

Dealing with storage and memory issues  24

Avoiding memory issues  24

Accommodating Structured Query Language  25

Using PROC SQL  26

Using SAS today in a warehouse environment  28

Using SAS in the cloud  30

Using SAS in modern warehousing  31

Warehousing unstructured text  31

Using SAS components for warehousing  32

Using other applications with SAS  33

Connecting to Snowflake  35

Summary  35

Questions  36

Further reading  37

Chapter 2: Reading Big Data into SAS

Technical requirements  40

Reading data extracts into SAS  40

Understanding SAS datasets  40

Working with the WORK directory  41

Specifying LIBNAME  43

Reading in SAS datasets  45

Using the SAS XPT format  47

Storing data in XPT format  47

Creating an XPT file  49

Comparing PROC CPORT/CIMPORT to PROC COPY  51

Reading in XPTs using the XPORT engine  52

Working with other file formats  52

Reading non-SAS data formats  53

Using PROC IMPORT  57

Converting non-SAS data to SAS format  60

Dealing with difficult data  62

Understanding legacy data  63

Reading data with difficult formats  65

Specifying data locations in a fixed-width file  73

Troubleshooting reading data after transport  75

Summary  77

Questions  77

Further reading  78

Chapter 3: Helpful PROCs for Managing Data

Technical requirements  80

PROCs for understanding data  80

Using PROC CONTENTS to understand data  81

Documenting SAS data with codebooks  85

Using labels for variables  87

Adding user-defined formats to categorical variables  88

Using native SAS formats with numeric variables  94

Applying user-defined formats to continuous variables  97

Using labels and formats in processing  100

Using PROCs with labels and formats  100

Maintaining labels and formats  103

Alternatives to using labels and formats in a warehouse setting  106

Viewing data in SAS  107

Using PROC PRINT to view data  107

Using PROC SQL to view data  111

Using arithmetic operators in SAS  116

Viewing data through SAS windows  117

Summary  118

Questions  119

Further reading  120

Chapter 4: Managing ETL in SAS

Technical requirements  122

Setting up an analytic environment  122

Designating storage and user groups  123

Managing documentation storage  126

Setting naming conventions for datasets  127

Planning for data transformation  131

Understanding arrays in SAS  131

Setting naming conventions for variables  142

Setting naming conventions and style for code  146

Developing policy  149

Setting format and label policies  149

Setting data transfer policies  154

Setting other policies  157

Summary  157

Questions  158

Further reading  158

Chapter 5: Managing Data Reporting in SAS

Technical requirements  160

Using the ODS for data files  160

Identifying available tables in the ODS  161

Identifying internal tables in the log  164

Outputting internal tables using the ODS  166

Using the ODS for graphics files  168

Outputting graphics from analytic PROCs  169

Outputting graphics in different formats  169

Setting system options  172

SAS PROCs designed for reporting  173

Using PROC REPORT  174

Understanding the basics of PROC TABULATE  177

Preparing data for PROC TABULATE  178

Formulating PROC TABULATE code  179

Using PROC SGPLOT  185

Using PROC SGPANEL and PROC SGSCATTER  188

Using PROC TEMPLATE with PROC SGRENDER  192

Summary  194

Questions  195

Further reading  196

Section 2: Using SAS for Extract-Transform-Load (ETL) Protocols in a Data Warehouse

Chapter 6: Standardizing Coding Using SAS Arrays

Technical requirements  202

Understanding examples of arrays used to create variables  202

Scenarios where arrays are useful  203

Arrays as temporary objects  207

Using arrays to create variables  209

Conditions and index variables in array processing  214

Adding a condition to array processing  214

Creating index variables from array outputs  218

Documenting and standardizing array processing  221

Limitations of arrays  223

Naming limitations in SAS arrays  223

Naming limitations arrays impose on data storage  225

Difficulty in troubleshooting  226

Summary  227

Questions  227

Further reading  228

Chapter 7: Designing and Developing ETL Code in SAS

Technical requirements  230

Planning the ETL approach  230

Specifying data with a data dictionary  230

Understanding default PROC FREQ  239

Using options to manipulate PROC FREQ output  241

Using PROC UNIVARIATE for troubleshooting  244

Using PROC FREQ to troubleshoot continuous variables  247

Making plots for troubleshooting  251

Choosing variables to serve to users  255

Creating and maintaining formats for variables  261

Creating transformation code  262

Designing categorical grouping variables  262

Cleaning up continuous variables  265

Designing indicator variables  268

Considering dates and numerical variables  272

Exporting the transformed dataset  275

Summary  277

Questions  277

Further reading  278

Chapter 8: Using Macros to Automate ETL in SAS

Technical requirements  280

Creating macros out of data step code  280

Choosing to use macros and macro variables  280

Using macro variables with the %LET command  281

Using the log file with macro variables and macros  288

Making macros with PROCs  290

Making macros with data steps  293

Addition conditions to macros  297

Storing and calling macros  302

Storing and calling macros in the same code  302

Storing macros separately and calling them from code  303

Loading transformed data  305

Summary  310

Questions  310

Further reading  311

Chapter 9: Debugging and Troubleshooting in SAS

Technical requirements  314

Debugging data step code  314

Writing well-formed and well-formatted code  314

Using log information as guidance  317

Troubleshooting strategies for data steps  323

Debugging the do loop code  326

Using the original data step debugger  327

Using the data step debugger in SAS Enterprise Guide  329

Debugging SAS macros  331

Avoiding errors through the design process  331

Using %PUT to display values of macro variables  337

Setting system options to help with debugging macros  338

Summary  339

Questions  340

Further reading  340

Section 3: Using SAS When Serving Warehouse Data to Users

Chapter 10: Considering the User Needs of SAS Data Warehouses

Technical requirements  346

Needs of data warehouse users  346

Considering classes of data warehouse users  347

Considering the needs of each class of users  350

Data stewardship for serving warehouse users  354

Providing data access  354

Serving needs created through the warehouse structure  356

Adding, using, and serving up foreign keys  359

Crosswalking data over time  374

Data stewardship for serving warehouse developers  383

Managing a data stewardship committee  383

Providing curation and other support  384

Summary  386

Questions  387

Further reading  388

Chapter 11: Connecting the SAS Data Warehouse to Other Systems

Technical requirements  390

Serving SAS to other systems  390

Implementing de-identification policies  391

Serving up a star schema  396

Connecting to non-SAS data storage  402

Understanding SQL views  403

Using SAS to copy data from a remote data system  404

Leveraging PROC SQL views for data transfer  406

Exporting SAS data to non-SAS data storage  409

Innovations in integrating SAS in reporting functions  411

Summary  412

Questions  412

Further reading  413

Chapter 12: Using the ODS for Visualization in SAS

Technical requirements  416

The basics of using the ODS for data visualization  416

Using macros in reporting  417

Connecting to data in Snowflake  424

Serving SAS data to the web with the ODS  426

Interacting with SAS data over the web  427

Using the SAS Enterprise Guide  429

Using SAS Viya  431

Using SAS and R for visualizations  433

Reporting SAS data in Tableau  436

Considerations when reporting SAS warehouse data  437

Summary  438

Questions  439

Further reading  440

Assessments

Other Books You May Enjoy

Preface

SAS is used for various functions in the development and maintenance of data warehouses because of its reputation of being able to handle so-called big data. SAS software has been in existence a long time, and has been implemented in many large, data-intensive environments, including data warehouses.

This book provides end-to-end coverage of the practical programming considerations to make when involving SAS in a data warehouse environment. Complete with step-by-step explanations of essential concepts, practical examples, and self-assessment questions, the book helps you begin to make decisions about the roles SAS should play in your data warehouse. It will teach you how to design arrays and macros to standardize extract-transform-load protocols, as well as how to develop strategies to optimally serve data warehouse customers.

You will learn the pros and cons of storing data in SAS, how to document and design ETL protocols for SAS processes, and how the use of SAS arrays and macros can help improve input/output (I/O) efficiency. You will also examine approaches to serving up data using SAS, and how to connect SAS to other systems to enhance the data warehouse user's experience. By the end of this book, you will have a foundational understanding of the roles SAS can play in a warehouse environment, and be able to choose wisely when designing your data warehousing processes involving SAS.

Who this book is for

This book is aimed at programmers using SAS products who are working on a data lake, data mart, or data warehouse. It is also aimed at managers heading up projects involving using SAS to maintain a data lake, data mart, or data warehouse. To benefit from this book, it is helpful to have a background in working on data projects that require serving data or reports to customers. Also, some experience of working with big datasets will be helpful in understanding this book.

What this book covers

Chapter 1, Using SAS in a Data Mart, Data Lake, or Data Warehouse, explains the origins of SAS, and how data input/output (I/O) are managed in SAS. It also provides context for how SAS products are used today, in modern data warehouses.

Chapter 2, Reading Big Data into SAS, covers how to read data in different formats into SAS. It also talks about SAS data formats, and packaging data for import and export in SAS.

Chapter 3, Helpful PROCs for Managing Data, provides an introduction to PROC CONTENTS, PROC SQL, and PROC PRINT, and describes how to deal with SAS formats and labels. It also provides different strategies for viewing data in SAS.

Chapter 4, Managing ETL in SAS, explains how to prepare an analytic environment, including developing naming conventions, and SAS format and label policies. It also describes the designation of data storage and user groups.

Chapter 5, Managing Data Reporting in SAS, introduces you to the output delivery system (ODS), and explains how the ODS is used for outputting graphics files from SAS. This chapter also covers how to use PROCs that were developed specifically for the ODS, including PROC TABULATE and PROC SGPLOT.

Chapter 6, Standardizing Coding Using SAS Arrays, explains how to do array processing in a SAS data warehouse, how to add conditions to arrays, and how to deal with naming conventions in arrays. In SAS, because of I/O limitations, the use of arrays is usually necessary in ETL code.

Chapter 7, Designing and Developing ETL Code in SAS, goes over how to plan ETL code, using PROC UNIVARIATE and PROC FREQ to study our data and help us plan how to serve up variables. The second part of the chapter focuses on how to develop optimal ETL code based on our plans.

Chapter 8, Using Macros to Automate ETL in SAS, describes how to convert data step code used in ETL to SAS macro language in order to automate the process. It also covers how to store and call macros, and how to use them to load transformed data.

Chapter 9, Debugging and Troubleshooting in SAS, covers debugging approaches in SAS. Advice for forming and formatting code is given, and special attention is given to debugging do loop code and macros.

Chapter 10, Considering the User Needs of SAS Data Warehouses, describes a method by which to classify users, and then apply data stewardship policies that help ensure their needs are met. For analyst users, providing data access, foreign keys, and crosswalk variables is described. For developer users, providing data curation and other support is delineated.

Chapter 11, Connecting the SAS Data Warehouse to Other Systems, talks about serving SAS to other data systems, which is typically done asynchronously. Next, it describes connecting SAS to other data systems, which is typically done synchronously through an open database connectivity (ODBC) protocol using SAS/ACCESS.

Chapter 12, Using the ODS for Visualization in SAS, describes the differences with using the ODS and visualization in SAS when done in print compared to on the web. Next, ways to serve SAS data to the web using the SAS Enterprise Guide aided by SAS Viya are explained, and how to visualize SAS data in other programs, such as R and Tableau, is described.

To get the most out of this book

You will need access to a version of SAS. If you do not have access to a SAS server environment or PC SAS, you can use the free version of SAS, called SAS University Edition (available here: https://www.sas.com/en_us/software/university-edition/download-software.html). SAS University Edition is available for Windows, macOS, and Linux. All code examples have been tested using SAS University Edition in Windows, but they should work on any version of SAS.

Example data curation files in this book were developed using Microsoft Word, Excel, and PowerPoint. These files can be developed in the same or comparable software.

If you are using the digital version of this book, we advise you to type the code yourself or access the code via the GitHub repository (link available in the next section). Doing so will help you avoid any potential errors related to the copying/pasting of code.

You may benefit from following the author on YouTube (https://www.youtube.com/channel/UCCHcm7rOjf7Ruf2GA2Qnxow) and LinkedIn (https://www.linkedin.com/in/dethwench/), where she posts video tutorials and information about data science.

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Mastering-SAS-Programming-for-Data-Warehousing. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://static.packt-cdn.com/downloads/9781789532371_ColorImages.pdf.

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "We will map LIBNAME to X, with X being the folder where we put the dataset."

A block of code is set as follows:

LIBNAME X "/folders/myfolders/X";

PROC CONTENTS data=X.Chap5_1;

run;

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

ODS TRACE ON / label;

PROC UNIVARIATE data=X.chap5_1;

    var _AGE80;

run;

ODS TRACE OFF;

Bold: Indicates a new term, an important word, or words that you see on screen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "If you are using SAS University Edition, the RESULTS tab will display the graphic."

Tips or important notes

Appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in, and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packt.com.

Section 1: Managing Data in a SAS Data Warehouse

This first section focuses on the basics of managing data in a data warehouse in SAS. First, we focus heavily on the process of data input/output (I/O) in SAS, which has not changed since SAS was originally created. Then, we see how to use data steps and PROCs, or SAS procedures, to read big data into SAS in various formats, given SAS's distinct data I/O processes.

After that, we are introduced to PROCs in SAS that can help manage data, especially with respect to I/O. These include PROCs that allow you to view and profile the dataset, including PROC CONTENTS and PROC PRINT.

Then, we see how to prepare for extract, transform, and load (ETL) processes by setting naming conventions, designating user groups, and setting other policies. Lastly, we are introduced to SAS's output delivery system (ODS) and see how reporting is done in SAS.

This section comprises the following chapters:

Chapter 1, Using SAS in a Data Mart, Data Lake, or Data Warehouse

Chapter 2, Reading Big Data into SAS

Chapter 3, Helpful PROCs for Managing Data

Chapter 4, Managing ETL in SAS

Chapter 5, Managing Data Reporting in SAS