The Ultimate Guide to Snowpark - Shankar Narayanan SGS - E-Book

The Ultimate Guide to Snowpark E-Book

Shankar Narayanan SGS

0,0
31,19 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Snowpark is a powerful framework that helps you unlock numerous possibilities within the Snowflake Data Cloud. However, without proper guidance, leveraging the full potential of Snowpark with Python can be challenging. Packed with practical examples and code snippets, this book will be your go-to guide to using Snowpark with Python successfully.
The Ultimate Guide to Snowpark helps you develop an understanding of Snowflake Snowpark and how it enables you to implement workloads in data engineering, data science, and data applications within the Data Cloud. From configuration and coding styles to workloads such as data manipulation, collection, preparation, transformation, aggregation, and analysis, this guide will equip you with the right knowledge to make the most of this framework. You'll discover how to build, test, and deploy data pipelines and data science models. As you progress, you’ll deploy data applications natively in Snowflake and operate large language models (LLMs) using Snowpark container services.
By the end of this book, you'll be able to leverage Snowpark's capabilities and propel your career as a Snowflake developer to new heights.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 259

Veröffentlichungsjahr: 2024

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



The Ultimate Guide to Snowpark

Design and deploy Snowpark with Python for efficient data workloads

Shankar Narayanan SGS

Vivekanandan SS

The Ultimate Guide to Snowpark

Copyright © 2024 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Kaustubh Manglurkar

Publishing Product Manager: Apeksha Shetty

Book Project Manager: Hemangi Lotlikar

Content Development Editor: Manikandan Kurup

Technical Editor: Seemanjay Ameriya

Copy Editor: Safis Editing

Proofreader: Manikandan Kurup

Indexer: Hemangini Bari

Production Designer: Joshua Misquitta

Senior DevRel Marketing Executive: Nivedita Singh

First published: May 2024

Production reference: 1100524

Published by Packt Publishing Ltd.

Grosvenor House

11 St Paul’s Square

Birmingham

B3 1RB, UK

ISBN 978-1-80512-341-5

www.packtpub.com

To my caring mother, Geetha, and inspiring father, Ganapathy, who taught me to aim high. To my lovely sister, Deepa, who tolerated all my pranks yet continued to love me unconditionally.

To my life partner, Amrita – thank you for being my loving partner and motivating and supporting me in completing the book.

To Snowflake for creating the comprehensive data cloud platform and to all the technologists and developers whose work inspired the book.

– Shankar Narayanan SGS

I dedicate this book to the loving memory of my cherished parents, the Late S.R. Srinivasan and the Late S. Chandra. Their legacy of love and strength continues to shape my journey each day.

I am deeply grateful for the unwavering support and love of my brother, S.S. Sathish Kumar, and sister-in-law, R. Ranjani. They have stood by my side steadfastly through every high and low of my life.

My heartfelt thanks go to my beloved aunt, R. Rani, and to my little bundle of joy, Skandhaguru (Skandhu), who has loved me no matter what through this journey.

Finally, I extend my heartfelt gratitude to the remarkable transplant team at Kauvery Hospital, Chennai, for guiding me through my renal transplant journey, even as I write and publish this book.

– Vivekanandan SS

Foreword

This is a unique and critical juncture point in the industry. Data plays an increasingly important role in every organization. The volumes and ecosystems of data continue to grow. AI and ML will continue to define and redefine business models and customer experiences. To complement these forces is an emerging set of tools and technologies to help you explore these exciting new worlds.

With that framing in mind, I’m thrilled to introduce The Ultimate Guide to Snowpark: Design and deploy Snowflake Snowpark with Python for efficient data workloads, authored by the Data Superhero Shankar Narayanan SGS and his co-author Vivekanandan SS. This book arrives at a critical juncture in the evolution of data processing, where efficiency and scalability are paramount, and Snowpark stands as a beacon of innovation within Snowflake’s ecosystem.

The core premise of Snowpark is simple: how can we improve data processing tasks and data applications by moving the code and processing to the data? Snowpark represents a transformative approach to managing data workloads, offering developers and data scientists the tools to build robust, scalable data solutions directly within Snowflake. By integrating Python, Snowpark unleashes the power of one of the most popular programming languages, making it accessible not just to data engineers but also to a broader community of technologists seeking to leverage data in new and powerful ways.

This guide is more than just a technical manual; it is a gateway to mastering Snowpark. The authors have meticulously crafted a resource that balances foundational knowledge with advanced techniques, providing clear, actionable guidance complemented by practical examples and code snippets. Their deep understanding of Snowpark’s capabilities is evident in every chapter, making this book an indispensable tool for anyone eager to excel in the Snowflake environment.

As the Director of Product at Snowflake managing Snowpark, I am profoundly proud and grateful to the authors who invested their expertise and passion into empowering our user community. This book does an exceptional job of demystifying complex concepts and lays down a roadmap for deploying sophisticated data engineering and science projects that are both innovative and practical.

To the readers, whether you are starting your journey with Snowflake or looking to expand your existing skills, The Ultimate Guide to Snowpark will equip you with the knowledge to transform your ideas into reality, elevate your projects, and lead the way in data-driven innovation.

Enjoy the read, and may it inspire you to push the boundaries of what is possible with Snowpark and Snowflake. I can’t wait to see what you can create, build, and unlock.

Jeff Hollan

Director of Product, Snowflake

Contributors

About the authors

Shankar Narayanan SGS is a principal architect at Microsoft, with over a decade of diverse experience leading and delivering large-scale data and cloud implementations for Fortune 500 companies across various industries. He has successfully implemented the Snowflake Data Cloud for many organizations, leading customers to adopt Snowflake.

He holds bachelor’s and master’s degrees in computer science and many certifications in multi-cloud platforms and Snowflake. He is an award-winning blogger, contributing to various technical publications and open source projects. He has been selected as the SAP community topic leader. He has been chosen as one of the Snowflake Data Heroes by Snowflake and the recipient of a Top 10 Data and Analytics Professional award by OnCon.

Vivekanandan SS spearheads the GenAI enablement team at Verizon, leveraging over a decade of expertise in data science and big data. His professional journey spans across building analytics solutions and products across diverse domains, and proficient in cloud analytics and data warehouses.

He holds a bachelor’s degree in industrial engineering from Anna University, a long-distance program in big data analytics from IIM, Bangalore, and a master’s in data science from Eastern University. As a seasoned trainer, he imparts his knowledge, specializing in Snowflake and GenAI, and is also a data science guest faculty and advisor for various educational institutes. His solution is ranked in the top 1 percentile in Kaggle kernels globally.

About the reviewers

Preston Blackburn is a machine learning engineer with a background in data engineering. He has worked at multiple start-ups, specializing in Snowflake consulting and built libraries that extend Snowpark functionality. Preston excels in developing internal developer tools, accelerating data modernization efforts, and architecting robust ML pipelines. His dedication to pushing the boundaries of technology drives innovation and ensures his clients stay at the forefront of industry advancements.

Balamurugan Kannaiyan is a highly accomplished data engineering leader, with around two decades of experience specializing in cloud technologies (AWS, Azure, and Snowflake), data management, and advanced analytics.

He currently leads the data engineering team at a Texas Public Sector, leveraging Snowflake’s cutting-edge capabilities to build high-performance data applications. Bala brings a distinguished track record from prior roles within the public sector and Fortune 500 companies in designing scalable, cloud-distributed systems. Further solidifying his expertise, Bala holds a bachelor of engineering degree from the prestigious Anna University alongside numerous certifications and accreditations in Snowflake, Azure, Databricks, and Oracle.

Table of Contents

Preface

Part 1: Snowpark Foundation and Setup

1

Discovering Snowpark

Introducing Snowpark

Leveraging Python for Snowpark

Capabilities of Snowpark for Python

Why Python for Snowpark

Understanding Snowpark for different workloads

Data science and ML

Data engineering

Data governance and security

Data applications

Realizing the value of using Snowpark

Summary

2

Establishing a Foundation with Snowpark

Technical requirements

Configuring the Snowpark development environment

Snowpark Python worksheet

Snowpark development in a local environment

Operating with Snowpark

The Python Engine

Client APIs

UDFs

Establishing a project structure for Snowpark

Summary

Part 2: Snowpark Data Workloads

3

Simplifying Data Processing Using Snowpark

Technical requirements

Data ingestion

Important note on datasets

Ingesting a CSV file into Snowflake

Ingesting JSON into Snowflake

Ingesting Parquet files into Snowflake

Ingesting images into Snowpark

Data exploration and transformation

Data exploration

Data transformations

Data grouping and analysis

Data grouping

Data analysis

Summary

4

Building Data Engineering Pipelines with Snowpark

Technical requirements

Developing resilient data pipelines with Snowpark

Traditional versus modern data pipelines

Data engineering with Snowpark

Implementing programmatic ELT with Snowpark

Deploying efficient DataOps in Snowpark

Developing a data engineering pipeline

Overview of tasks in Snowflake

Compute models for tasks

Task graphs

Managing tasks and task graphs with Python

Implementing logging and tracing in Snowpark

Event tables

Setting up logging in Snowpark

Handling exceptions in Snowpark

Setting up tracing in Snowpark

Comparison of logs and traces

Summary

5

Developing Data Science Projects with Snowpark

Technical requirements

Data science in Data Cloud

Data science and ML concepts

The Data Cloud paradigm

Why Snowpark for data science and ML?

Introduction to Snowpark ML

End-to-end ML with Snowpark

Exploring and preparing data

Missing value analysis

Outlier analysis

Correlation analysis

Leakage variables

Feature engineering

Training ML models in Snowpark

The efficiency of Snowpark ML

Summary

6

Deploying and Managing ML Models with Snowpark

Technical requirements

Deploying ML models in Snowpark

Snowpark ML model registry

Managing Snowpark model data

Snowpark Feature Store

Benefits of Feature Store

Feature stores versus data warehouses

When to utilize versus when to avoid feature stores

Summary

Part 3: Snowpark Applications

7

Developing a Native Application with Snowpark

Technical requirements

Introduction to the Native Apps Framework

Snowflake’s native application Landscape

Native App Framework components

Streamlit in Snowflake

Benefits of Native Apps

Developing the native application

The Streamlit editor

Running the Streamlit application

Developing with the Native App Framework

Publishing the native application

Setting the default release directive

Creating a listing for your application

Managing the native application

Viewing installed applications

Viewing README for applications

Managing access to the application

Removing an installed application

Summary

8

Introduction to Snowpark Container Services

Technical requirements

Introduction to Snowpark Container Services

Data security in Snowpark Container Services

Components of Snowpark Containers

Setting up Snowpark Container Services

Creating Snowflake objects

Setting up the services

Setting up the filter service

Building the Docker image

Deploying the service

Setting up a Snowpark Container Service job

Setting up the job

Deploying the job

Executing the job

Deploying LLMs with Snowpark

Preparing the LLM

Registering the model

Deploying the model to Snowpark Container Services

Running the model

Summary

Index

Other Books You May Enjoy

Part 1: Snowpark Foundation and Setup

In this part, we will explore the fundamental and advanced features of the Snowpark framework in Python. This part focuses on the Snowpark platform and the setup required to get started using Snowpark.

This part includes the following chapters:

Chapter 1, Discovering SnowparkChapter 2, Establishing a Foundation with Snowpark

1

Discovering Snowpark

Snowpark is the recent major innovation released by Snowflake that provides an intuitive set of libraries and runtimes for querying and processing data at scale in Snowflake. This chapter aims to guide you through Snowpark to understand its unique capabilities. In addition, the chapter helps you learn how to utilize Python with Snowpark and implement it in various workloads such as data engineering, data science, and data applications. By the end of this chapter, you will have grasped Snowpark’s capabilities and benefits, including faster data processing, scalability, and reduced costs.

In this chapter, we’re going to cover the following main topics:

Introducing SnowparkLeveraging Python for SnowparkUnderstanding Snowpark for different workloadsRealizing the value of using Snowpark

Introducing Snowpark

Snowflake, founded in 2012, started its journey to the data cloud by completely re-engineering the world of data and rethinking how a reliable, secure, high-performance, and scalable data-processing system should be architected for the cloud. It started with offering cloud-based data warehousing through a managed Software as a Service (SaaS) platform to load, analyze, and process large volumes of data. The success of Snowflake lies in the fact that it is a cloud-native managed solution that is built on top of the major public cloud providers such as Amazon Web Services, Microsoft Azure, and Google Cloud Platform by automatically providing a reliable, secure, high-performance, and scalable data processing system for organizations without the need to deploy hardware or install or configure any software.

As with any cloud data warehousing, Snowflake supports American National Standards Institute (ANSI) SQL as the language of choice. Although SQL is a powerful declarative language that allows users to ask questions about data, it is constrained to data warehouse workloads, limiting the support for advanced workloads such as data science and data engineering, which require developers to write the solution in other programming languages leading them to move data out of Snowflake to perform these workloads.

Snowflake’s solution to this challenge is Snowpark, an innovative developer framework that streamlines the process of building complex data pipelines. With Snowpark, data scientists and developers can directly interact with Snowflake using their preferred programming language, enabling them to quickly and securely deploy machine learning (ML) models, execute data pipelines, and develop data applications on Snowflake’s virtual compute warehouse in a serverless manner without having to transfer data outside of Snowflake.

Snowpark enables data teams to collaborate on the data by natively supporting work with DataFrame style programming in Python, Scala, or Java, exposing deeply integrated interfaces in these languages to augment Snowflake’s original SQL language and minimizing the complexity of having to manage different environments for advanced data pipelines. This has led developers to leverage Snowflake’s robust and scalable computing power to ship code to the data without exporting it outside Snowflake into other environments.

In this section, we covered a brief introduction to Snowpark and learned how it fits into the Snowflake ecosystem and how it helps developers. The following section will cover how to leverage Python for Snowpark.

Leveraging Python for Snowpark

In June 2022, Snowflake made a significant announcement, revealing the much-anticipated Snowpark for Python. This new release has rapidly emerged as the preferred programming language for Snowpark, providing users with a more extensive range of options for programming data in Snowflake. Moreover, Snowpark has simplified managing data architectures, enabling users to operate more quickly and efficiently.

Snowpark for Python is a cutting-edge, enterprise-grade, open-source innovation integrated into the Snowflake data cloud. As a result, the platform delivers a seamless, unified experience for data scientists and developers. In addition, the Snowpark for Python package is built upon the Snowflake Python connector. The Python connector enables users to execute SQL commands and other essential functions in Snowflake and Snowpark for Python empowers users to undertake more advanced data applications.

For instance, the platform permits users to run user-defined functions (UDFs), external functions, and stored procedures directly within Snowflake. This powerful new functionality enables data scientists, engineers, and developers to create robust and secure data pipelines and ML models within Snowflake. As a result, they can leverage the platform’s superior performance, elasticity, and security features to deliver advanced insights and drive meaningful business outcomes. Overall, Snowpark for Python represents a significant step forward for Snowflake, offering users enhanced functionality and flexibility while retaining the platform’s exceptional performance and security features.

Snowpark for Python supports pre-vetted open-source packages through integration with the Anaconda environment that executes on an Anaconda-powered sandbox inside Snowflake’s virtual compute warehouses, which provides a familiar interface for the developers. The integrated Anaconda package manager is valuable for developers as it comes with a comprehensive set of curated open-source packages and supports resolving dependencies between different packages and versions. It is a huge time-saver and helps prevent developers from dealing with “dependency hell.”

Capabilities of Snowpark for Python

Snowpark for Python is generally available across all cloud instances of Snowflake. It helps accelerate different workloads and comes with a rich set of capabilities, as follows:

It allows developers to write Python code within Snowflake, enabling them to directly leverage the power of Python libraries and frameworks in SnowflakeIt supports popular open-source Python libraries such as pandas, NumPy, SciPy, and scikit-learn, along with other libraries, allowing developers to perform complex data analysis and ML tasks directly within SnowflakeIt also provides access to external data sources such as AWS S3, Azure Blob storage, and Google Cloud Storage, allowing developers to work with data stored outside SnowflakeIt provides seamless integration with Snowflake’s SQL engine, allowing developers to write queries using functional programming methods with Python that compile to SQLIt also supports distributed processing, allowing developers to scale their Python code to handle large datasets and complex logicIt enables developers to build custom UDFs that can be used within SQL queries, allowing for greater flexibility and customization of data processing workflowsSnowpark provides a Python development environment within Snowflake, allowing developers to write, test, and debug Python code directly within the Snowflake UIIt enables developers to work with various data formats such as CSV, JSON, Parquet, and Avro, providing data processing and analysis flexibilityIt provides a unified data processing experience that works with SQL and Python in a single environmentIt enables developers to create custom data pipelines using Python code, making integrating Snowflake with other data sources and data processing tools easierIt can handle real-time and batch data processing, making it easier to build data-intensive workloadsIt provides a robust framework built on Snowflake that ensures data privacy and compliance with industry standards such as the Health Insurance Portability and Accountability Act (HIPAA), General Data Protection Regulation (GDPR), and Security Operations Center (SOC)Snowpark supports enhancing data by leveraging Snowflake Marketplace

Snowpark for Python packs many capabilities that help developers use it efficiently for various workloads and use cases within Snowflake.

Why Python for Snowpark

Although Snowpark supports Python, Scala, and Java, this book will focus only on Python, a de facto for Snowpark development. Python’s growing popularity through high-level built-in data structures with dynamic typing and binding makes it ideal for data operations. In addition, the language is very flexible and easy to learn by developers. Its power lies in the rich open-source ecosystem that is well-supported with a growing list of popular packages.

Python is a general-purpose, versatile programming language for different purposes, such as data engineering, data science, and data applications. It enables developers to learn a single programming language for all their needs.

Snowflake is also heavily investing in Python to make it easier for data scientists, engineers, and application developers to build even more in the data cloud without governance trade-offs.

In this section, we covered the capabilities of Snowpark for Python and why Python is a preferred language for developing Snowpark. The following section will cover how Snowpark can be used for different workloads.

Understanding Snowpark for different workloads

The release of Snowpark transformed Snowflake into a complete data platform designed to support various workloads. Snowpark supports multiple workloads, such as data science and ML, data engineering, and data applications.

Data science and ML

Python is the favorite language for data scientists. Snowpark for Python supports popular libraries and frameworks such as pandas, NumPy, and scikit-learn, making it the ideal framework for data scientists to perform ML development in Snowflake. In addition, data scientists can use the DataFrames API to interact with data inside Snowflake and perform batch training and inference inside Snowflake. Developers can also use Snowpark for feature engineering, ML model inference, and end-to-end ML pipelines. Snowpark also provides a SnowparkML library to support data science and ML in Snowpark.

Data engineering

Data cleansing and ELT workloads are complex, and building a data pipeline with just SQL is where Snowpark can be of great benefit. Snowpark lets developers factor code for readability and reuse it while providing a better capability for unit tests. In addition, with the support of Anaconda, developers can use open-source Python libraries for building reliable data pipelines. The other major challenge with data processing is that the infrastructure requires significant manual effort and maintenance. Snowpark solves this problem by being highly performant, enabling data engineers to work with large datasets quickly and efficiently, building complex data pipelines, and processing large volumes of data without performance issues.

Data governance and security

Snowpark supports developing solutions that incorporate data governance and security. Data governance is critical and augments the data science and data engineering use cases. Snowpark simplifies the governance posture by helping organizations understand and improve data quality. Developers can quickly create a function to perform data tests and detect anomalies. Snowpark can utilize the data classification capability to detect personally identifiable information (PII) and classify data that is critical to an organization. Custom functions developed in Snowpark can mask sensitive data such as credit card numbers using the robust dynamic data masking feature while retaining the existing security model in Snowflake.

Data applications

Snowpark helps the team develop dynamic data applications that run directly on Snowflake without moving the data outside. Using Streamlit, a powerful open-source library that Snowflake acquired, developers can build native applications using the familiar Python environment. Interactive ML-powered applications can be developed and shared with users securely utilizing role-based access controls entirely on Snowflake’s governed platform, taking advantage of its scale, performance, and governance. The Snowflake Native Application Framework provides a streamlined path to monetize apps through Snowflake Marketplace, where you can make your app available to other Snowflake customers and open new revenue opportunities.

Snowpark supports different workloads and makes Snowflake a complete data cloud solution. The following section will highlight Snowpark’s technical and business benefits.

Realizing the value of using Snowpark

The traditional big data approach has been in the industry for a long time and is unsuitable for modern cloud-based scalable workloads. Traditional architecture has many challenges, such as the following:

De-coupling the compute and data into separate systemsRunning separate processing clusters for different languagesComplexity in managing the systemData silos and data duplicationLack of unified security and governance

Snowflake solves the traditional system’s challenges using Snowpark, providing tremendous value to the data ecosystem and Snowflake users. The following diagram shows the difference between a traditional approach and Snowflake’s streamlined approach:

Figure 1.1 – Traditional versus Snowflake approach

As you can see from the difference between both approaches, Snowpark’s streamlined approach benefits both the business and the developers by providing a flexible, efficient, and cost-effective way to build data that scales with the business needs. Some of the significant values of using Snowpark are as follows:

Snowpark can access data programmatically through the DataFrame APIs, making the data ingestion and integration consistent, as you can integrate various structured and unstructured dataSnowpark standardizes the approach to data processing since the data pipelines are in Python code; they can be tested and deployed and are easier to understand and interpret