E-Book
29,99 €

Mastering SAS Programming for Data Warehousing E-Book

Monika Wahi

0,0

29,99 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

SAS is used for various functions in the development and maintenance of data warehouses, thanks to its reputation of being able to handle ’big data’.
This book will help you learn the pros and cons of storing data in SAS. As you progress, you’ll understand how to document and design extract-transform-load (ETL) protocols for SAS processes. Later, you’ll focus on how the use of SAS arrays and macros can help standardize ETL. The book will also help you examine approaches for serving up data using SAS and explore how connecting SAS to other systems can enhance the data warehouse user’s experience.
By the end of this data management book, you will have a fundamental understanding of the roles SAS can play in a warehouse environment, and be able to choose wisely when designing your data warehousing processes involving SAS.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

MOBI

Seitenzahl: 716

Veröffentlichungsjahr: 2020

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Mastering SAS Programming for Data Warehousing

An advanced programming guide to designing and managing Data Warehouses using SAS

Monika Wahi

BIRMINGHAM—MUMBAI

Mastering SAS Programming for Data Warehousing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor: Richa Tripathi

Acquisition Editor: Karan Gupta

Senior Editor: Nitee Shetty

Content Development Editor: Ruvika Rao

Technical Editor: Gaurav Gala

Copy Editor: Safis Editing

Project Coordinator: Deeksha Thakkar

Proofreader: Safis Editing

Indexer: Tejal Daruwale Soni

Production Designer: Aparna Bhagat

First published: October 2020

Production reference: 1141020

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-78953-237-1

www.packt.com

This book is dedicated to my mother, Carol Wahi. Although happily retired now, she spent her career programming in COBOL with assembler, and teaching data management skills. Here's to the next generation of big data women!

Packt.com

Subscribe to our online digital library for full access to over 7,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Fully searchable for easy access to vital information

Copy and paste, print, and bookmark content

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at packt.com and, as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

Contributors

About the author

Monika Wahi, MPH, CPH, is a well-published epidemiologist, biostatistician, informaticist, and data scientist. For over 20 years, Monika has worked at various governmental organizations and non-profits, and led consulting projects in academia and for governments both in the United States and internationally. She is President of DethWench Professional Services (DPS), which offers consulting and training in data science, specializing in public health and healthcare. Monika is proficient in SAS, R, Excel, and SQL, and is the author of many articles and online courses in data science and health data analytics.

I'd like to thank the following people at Packt who made this book possible: Karan Gupta, who was the first to express faith in me, and Afshaan Khan, Ruvika Rao, and Prajakta Naik, who worked tirelessly with me to improve my drafts. I'd also like to thank the reviewer, Sunil Gupta, an admirable SAS author himself, for his helpful advice and encouragement.

About the reviewer

Sunil Gupta, MS, is an international speaker, best-selling author of five SAS books, and a global SAS and CDISC corporate trainer. Sunil has over 25 years' experience in the pharmaceutical industry. Most recently, Sunil has been involved in several CDISC and PhUSE working groups and has taught his CDISC online class at the University of California at San Diego.

In 2019, Sunil published his fifth book, Clinical Data Quality Checks for CDISC Compliance Using SAS, and, in 2011, he launched his unique SAS mentoring blog for smarter SAS searches. Sunil has an MS in bioengineering from Clemson University and a BS in applied mathematics from the College of Charleston.

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Preface

Section 1: Managing Data in a SAS Data Warehouse

Chapter 1: Using SAS in a Data Mart, Data Lake, or Data Warehouse

Technical requirements 4

Using original versions of SAS 4

Initial SAS data handling 5

Early SAS data handling 8

SAS data handling improvements 10

Accessing data in SAS 11

Upgrading to mainframes 11

Transitioning to personal computers 12

Reading external files 13

Improving I/O 17

Developing warehouse environments 17

Using the WHERE clause 18

Using IF compared to WHERE 19

Sorting in SAS 20

Setting indexes on variables 22

Dealing with storage and memory issues 24

Avoiding memory issues 24

Accommodating Structured Query Language 25

Using PROC SQL 26

Using SAS today in a warehouse environment 28

Using SAS in the cloud 30

Using SAS in modern warehousing 31

Warehousing unstructured text 31

Using SAS components for warehousing 32

Using other applications with SAS 33

Connecting to Snowflake 35

Summary 35

Questions 36

Chapter 2: Reading Big Data into SAS

Technical requirements 40

Reading data extracts into SAS 40

Understanding SAS datasets 40

Working with the WORK directory 41

Specifying LIBNAME 43

Reading in SAS datasets 45

Using the SAS XPT format 47

Storing data in XPT format 47

Creating an XPT file 49

Comparing PROC CPORT/CIMPORT to PROC COPY 51

Reading in XPTs using the XPORT engine 52

Working with other file formats 52

Reading non-SAS data formats 53

Using PROC IMPORT 57

Converting non-SAS data to SAS format 60

Dealing with difficult data 62

Understanding legacy data 63

Reading data with difficult formats 65

Specifying data locations in a fixed-width file 73

Troubleshooting reading data after transport 75

Summary 77

Questions 77

Chapter 3: Helpful PROCs for Managing Data

Technical requirements 80

PROCs for understanding data 80

Using PROC CONTENTS to understand data 81

Documenting SAS data with codebooks 85

Using labels for variables 87

Adding user-defined formats to categorical variables 88

Using native SAS formats with numeric variables 94

Applying user-defined formats to continuous variables 97

Using labels and formats in processing 100

Using PROCs with labels and formats 100

Maintaining labels and formats 103

Alternatives to using labels and formats in a warehouse setting 106

Viewing data in SAS 107

Using PROC PRINT to view data 107

Using PROC SQL to view data 111

Using arithmetic operators in SAS 116

Viewing data through SAS windows 117

Summary 118

Questions 119

Chapter 4: Managing ETL in SAS

Technical requirements 122

Setting up an analytic environment 122

Designating storage and user groups 123

Managing documentation storage 126

Setting naming conventions for datasets 127

Planning for data transformation 131

Understanding arrays in SAS 131

Setting naming conventions for variables 142

Setting naming conventions and style for code 146

Developing policy 149

Setting format and label policies 149

Setting data transfer policies 154

Setting other policies 157

Summary 157

Questions 158

Chapter 5: Managing Data Reporting in SAS

Technical requirements 160

Using the ODS for data files 160

Identifying available tables in the ODS 161

Identifying internal tables in the log 164

Outputting internal tables using the ODS 166

Using the ODS for graphics files 168

Outputting graphics from analytic PROCs 169

Outputting graphics in different formats 169

Setting system options 172

SAS PROCs designed for reporting 173

Using PROC REPORT 174

Understanding the basics of PROC TABULATE 177

Preparing data for PROC TABULATE 178

Formulating PROC TABULATE code 179

Using PROC SGPLOT 185

Using PROC SGPANEL and PROC SGSCATTER 188

Using PROC TEMPLATE with PROC SGRENDER 192

Summary 194

Questions 195

Section 2: Using SAS for Extract-Transform-Load (ETL) Protocols in a Data Warehouse

Chapter 6: Standardizing Coding Using SAS Arrays

Technical requirements 202

Understanding examples of arrays used to create variables 202

Scenarios where arrays are useful 203

Arrays as temporary objects 207

Using arrays to create variables 209

Conditions and index variables in array processing 214

Adding a condition to array processing 214

Creating index variables from array outputs 218

Documenting and standardizing array processing 221

Limitations of arrays 223

Naming limitations in SAS arrays 223

Naming limitations arrays impose on data storage 225

Difficulty in troubleshooting 226

Summary 227

Questions 227

Chapter 7: Designing and Developing ETL Code in SAS

Technical requirements 230

Planning the ETL approach 230

Specifying data with a data dictionary 230

Understanding default PROC FREQ 239

Using options to manipulate PROC FREQ output 241

Using PROC UNIVARIATE for troubleshooting 244

Using PROC FREQ to troubleshoot continuous variables 247

Making plots for troubleshooting 251

Choosing variables to serve to users 255

Creating and maintaining formats for variables 261

Creating transformation code 262

Designing categorical grouping variables 262

Cleaning up continuous variables 265

Designing indicator variables 268

Considering dates and numerical variables 272

Exporting the transformed dataset 275

Summary 277

Questions 277

Chapter 8: Using Macros to Automate ETL in SAS

Technical requirements 280

Creating macros out of data step code 280

Choosing to use macros and macro variables 280

Using macro variables with the %LET command 281

Using the log file with macro variables and macros 288

Making macros with PROCs 290

Making macros with data steps 293

Addition conditions to macros 297

Storing and calling macros 302

Storing and calling macros in the same code 302

Storing macros separately and calling them from code 303

Loading transformed data 305

Summary 310

Questions 310

Chapter 9: Debugging and Troubleshooting in SAS

Technical requirements 314

Debugging data step code 314

Writing well-formed and well-formatted code 314

Using log information as guidance 317

Troubleshooting strategies for data steps 323

Debugging the do loop code 326

Using the original data step debugger 327

Using the data step debugger in SAS Enterprise Guide 329

Debugging SAS macros 331

Avoiding errors through the design process 331

Using %PUT to display values of macro variables 337

Setting system options to help with debugging macros 338

Summary 339

Questions 340

Section 3: Using SAS When Serving Warehouse Data to Users

Chapter 10: Considering the User Needs of SAS Data Warehouses

Technical requirements 346

Needs of data warehouse users 346

Considering classes of data warehouse users 347

Considering the needs of each class of users 350

Data stewardship for serving warehouse users 354

Providing data access 354

Serving needs created through the warehouse structure 356

Adding, using, and serving up foreign keys 359

Crosswalking data over time 374

Data stewardship for serving warehouse developers 383

Managing a data stewardship committee 383

Providing curation and other support 384

Summary 386

Questions 387

Chapter 11: Connecting the SAS Data Warehouse to Other Systems

Technical requirements 390

Serving SAS to other systems 390

Implementing de-identification policies 391

Serving up a star schema 396

Connecting to non-SAS data storage 402

Understanding SQL views 403

Using SAS to copy data from a remote data system 404

Leveraging PROC SQL views for data transfer 406

Exporting SAS data to non-SAS data storage 409

Innovations in integrating SAS in reporting functions 411

Summary 412

Questions 412

Chapter 12: Using the ODS for Visualization in SAS

Technical requirements 416

The basics of using the ODS for data visualization 416

Using macros in reporting 417

Connecting to data in Snowflake 424

Serving SAS data to the web with the ODS 426

Interacting with SAS data over the web 427

Using the SAS Enterprise Guide 429

Using SAS Viya 431

Using SAS and R for visualizations 433

Reporting SAS data in Tableau 436

Considerations when reporting SAS warehouse data 437

Summary 438

Questions 439

Assessments

Other Books You May Enjoy

Preface

SAS is used for various functions in the development and maintenance of data warehouses because of its reputation of being able to handle so-called big data. SAS software has been in existence a long time, and has been implemented in many large, data-intensive environments, including data warehouses.

This book provides end-to-end coverage of the practical programming considerations to make when involving SAS in a data warehouse environment. Complete with step-by-step explanations of essential concepts, practical examples, and self-assessment questions, the book helps you begin to make decisions about the roles SAS should play in your data warehouse. It will teach you how to design arrays and macros to standardize extract-transform-load protocols, as well as how to develop strategies to optimally serve data warehouse customers.

You will learn the pros and cons of storing data in SAS, how to document and design ETL protocols for SAS processes, and how the use of SAS arrays and macros can help improve input/output (I/O) efficiency. You will also examine approaches to serving up data using SAS, and how to connect SAS to other systems to enhance the data warehouse user's experience. By the end of this book, you will have a foundational understanding of the roles SAS can play in a warehouse environment, and be able to choose wisely when designing your data warehousing processes involving SAS.

Who this book is for

This book is aimed at programmers using SAS products who are working on a data lake, data mart, or data warehouse. It is also aimed at managers heading up projects involving using SAS to maintain a data lake, data mart, or data warehouse. To benefit from this book, it is helpful to have a background in working on data projects that require serving data or reports to customers. Also, some experience of working with big datasets will be helpful in understanding this book.

What this book covers

Chapter 1, Using SAS in a Data Mart, Data Lake, or Data Warehouse, explains the origins of SAS, and how data input/output (I/O) are managed in SAS. It also provides context for how SAS products are used today, in modern data warehouses.

Chapter 2, Reading Big Data into SAS, covers how to read data in different formats into SAS. It also talks about SAS data formats, and packaging data for import and export in SAS.

Chapter 3, Helpful PROCs for Managing Data, provides an introduction to PROC CONTENTS, PROC SQL, and PROC PRINT, and describes how to deal with SAS formats and labels. It also provides different strategies for viewing data in SAS.

Chapter 4, Managing ETL in SAS, explains how to prepare an analytic environment, including developing naming conventions, and SAS format and label policies. It also describes the designation of data storage and user groups.

Chapter 5, Managing Data Reporting in SAS, introduces you to the output delivery system (ODS), and explains how the ODS is used for outputting graphics files from SAS. This chapter also covers how to use PROCs that were developed specifically for the ODS, including PROC TABULATE and PROC SGPLOT.

Chapter 6, Standardizing Coding Using SAS Arrays, explains how to do array processing in a SAS data warehouse, how to add conditions to arrays, and how to deal with naming conventions in arrays. In SAS, because of I/O limitations, the use of arrays is usually necessary in ETL code.

Chapter 7, Designing and Developing ETL Code in SAS, goes over how to plan ETL code, using PROC UNIVARIATE and PROC FREQ to study our data and help us plan how to serve up variables. The second part of the chapter focuses on how to develop optimal ETL code based on our plans.

Chapter 8, Using Macros to Automate ETL in SAS, describes how to convert data step code used in ETL to SAS macro language in order to automate the process. It also covers how to store and call macros, and how to use them to load transformed data.

Chapter 9, Debugging and Troubleshooting in SAS, covers debugging approaches in SAS. Advice for forming and formatting code is given, and special attention is given to debugging do loop code and macros.

Chapter 10, Considering the User Needs of SAS Data Warehouses, describes a method by which to classify users, and then apply data stewardship policies that help ensure their needs are met. For analyst users, providing data access, foreign keys, and crosswalk variables is described. For developer users, providing data curation and other support is delineated.

Chapter 11, Connecting the SAS Data Warehouse to Other Systems, talks about serving SAS to other data systems, which is typically done asynchronously. Next, it describes connecting SAS to other data systems, which is typically done synchronously through an open database connectivity (ODBC) protocol using SAS/ACCESS.

Chapter 12, Using the ODS for Visualization in SAS, describes the differences with using the ODS and visualization in SAS when done in print compared to on the web. Next, ways to serve SAS data to the web using the SAS Enterprise Guide aided by SAS Viya are explained, and how to visualize SAS data in other programs, such as R and Tableau, is described.

To get the most out of this book

You will need access to a version of SAS. If you do not have access to a SAS server environment or PC SAS, you can use the free version of SAS, called SAS University Edition (available here: https://www.sas.com/en_us/software/university-edition/download-software.html). SAS University Edition is available for Windows, macOS, and Linux. All code examples have been tested using SAS University Edition in Windows, but they should work on any version of SAS.

Example data curation files in this book were developed using Microsoft Word, Excel, and PowerPoint. These files can be developed in the same or comparable software.

If you are using the digital version of this book, we advise you to type the code yourself or access the code via the GitHub repository (link available in the next section). Doing so will help you avoid any potential errors related to the copying/pasting of code.

You may benefit from following the author on YouTube (https://www.youtube.com/channel/UCCHcm7rOjf7Ruf2GA2Qnxow) and LinkedIn (https://www.linkedin.com/in/dethwench/), where she posts video tutorials and information about data science.

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Mastering-SAS-Programming-for-Data-Warehousing. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://static.packt-cdn.com/downloads/9781789532371_ColorImages.pdf.

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "We will map LIBNAME to X, with X being the folder where we put the dataset."

A block of code is set as follows:

LIBNAME X "/folders/myfolders/X";

PROC CONTENTS data=X.Chap5_1;

run;

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

ODS TRACE ON / label;

PROC UNIVARIATE data=X.chap5_1;

var _AGE80;

run;

ODS TRACE OFF;

Bold: Indicates a new term, an important word, or words that you see on screen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "If you are using SAS University Edition, the RESULTS tab will display the graphic."

Tips or important notes

Appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in, and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packt.com.

Section 1: Managing Data in a SAS Data Warehouse

This first section focuses on the basics of managing data in a data warehouse in SAS. First, we focus heavily on the process of data input/output (I/O) in SAS, which has not changed since SAS was originally created. Then, we see how to use data steps and PROCs, or SAS procedures, to read big data into SAS in various formats, given SAS's distinct data I/O processes.

After that, we are introduced to PROCs in SAS that can help manage data, especially with respect to I/O. These include PROCs that allow you to view and profile the dataset, including PROC CONTENTS and PROC PRINT.

Then, we see how to prepare for extract, transform, and load (ETL) processes by setting naming conventions, designating user groups, and setting other policies. Lastly, we are introduced to SAS's output delivery system (ODS) and see how reporting is done in SAS.

This section comprises the following chapters:

Chapter 1, Using SAS in a Data Mart, Data Lake, or Data Warehouse

Chapter 2, Reading Big Data into SAS

Chapter 3, Helpful PROCs for Managing Data

Chapter 4, Managing ETL in SAS

Chapter 5, Managing Data Reporting in SAS

Tausende von E-Books und Hörbücher

Ihre Zahl wächst ständig und Sie haben eine Fixpreisgarantie.

Sie haben über uns geschrieben:

Mastering SAS Programming for Data Warehousing E-Book

Monika Wahi

Mastering SAS Programming for Data Warehousing

Mastering SAS Programming for Data Warehousing

Why subscribe?

Contributors

About the author

About the reviewer

Packt is searching for authors like you

Table of Contents

Preface

Section 1: Managing Data in a SAS Data Warehouse

Chapter 1: Using SAS in a Data Mart, Data Lake, or Data Warehouse

Technical requirements 4

Using original versions of SAS 4

Initial SAS data handling 5

Early SAS data handling 8

SAS data handling improvements 10

Accessing data in SAS 11

Upgrading to mainframes 11

Transitioning to personal computers 12

Reading external files 13

Improving I/O 17

Developing warehouse environments 17

Using the WHERE clause 18

Using IF compared to WHERE 19

Sorting in SAS 20

Setting indexes on variables 22

Dealing with storage and memory issues 24

Avoiding memory issues 24

Accommodating Structured Query Language 25

Using PROC SQL 26

Using SAS today in a warehouse environment 28

Using SAS in the cloud 30

Using SAS in modern warehousing 31

Warehousing unstructured text 31

Using SAS components for warehousing 32

Using other applications with SAS 33

Connecting to Snowflake 35

Summary 35

Questions 36

Further reading 37

Chapter 2: Reading Big Data into SAS

Technical requirements 40

Reading data extracts into SAS 40

Understanding SAS datasets 40

Working with the WORK directory 41

Specifying LIBNAME 43

Reading in SAS datasets 45

Using the SAS XPT format 47

Storing data in XPT format 47

Creating an XPT file 49

Comparing PROC CPORT/CIMPORT to PROC COPY 51

Reading in XPTs using the XPORT engine 52

Working with other file formats 52

Reading non-SAS data formats 53

Using PROC IMPORT 57

Converting non-SAS data to SAS format 60

Dealing with difficult data 62

Understanding legacy data 63

Reading data with difficult formats 65

Specifying data locations in a fixed-width file 73

Troubleshooting reading data after transport 75

Summary 77

Questions 77

Further reading 78

Chapter 3: Helpful PROCs for Managing Data

Technical requirements 80

PROCs for understanding data 80

Using PROC CONTENTS to understand data 81

Documenting SAS data with codebooks 85

Using labels for variables 87

Adding user-defined formats to categorical variables 88

Using native SAS formats with numeric variables 94

Applying user-defined formats to continuous variables 97

Using labels and formats in processing 100

Using PROCs with labels and formats 100

Maintaining labels and formats 103

Alternatives to using labels and formats in a warehouse setting 106

Viewing data in SAS 107