31,19 €
Snowpark is a powerful framework that helps you unlock numerous possibilities within the Snowflake Data Cloud. However, without proper guidance, leveraging the full potential of Snowpark with Python can be challenging. Packed with practical examples and code snippets, this book will be your go-to guide to using Snowpark with Python successfully.
The Ultimate Guide to Snowpark helps you develop an understanding of Snowflake Snowpark and how it enables you to implement workloads in data engineering, data science, and data applications within the Data Cloud. From configuration and coding styles to workloads such as data manipulation, collection, preparation, transformation, aggregation, and analysis, this guide will equip you with the right knowledge to make the most of this framework. You'll discover how to build, test, and deploy data pipelines and data science models. As you progress, you’ll deploy data applications natively in Snowflake and operate large language models (LLMs) using Snowpark container services.
By the end of this book, you'll be able to leverage Snowpark's capabilities and propel your career as a Snowflake developer to new heights.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 259
Veröffentlichungsjahr: 2024
The Ultimate Guide to Snowpark
Design and deploy Snowpark with Python for efficient data workloads
Shankar Narayanan SGS
Vivekanandan SS
Copyright © 2024 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Group Product Manager: Kaustubh Manglurkar
Publishing Product Manager: Apeksha Shetty
Book Project Manager: Hemangi Lotlikar
Content Development Editor: Manikandan Kurup
Technical Editor: Seemanjay Ameriya
Copy Editor: Safis Editing
Proofreader: Manikandan Kurup
Indexer: Hemangini Bari
Production Designer: Joshua Misquitta
Senior DevRel Marketing Executive: Nivedita Singh
First published: May 2024
Production reference: 1100524
Published by Packt Publishing Ltd.
Grosvenor House
11 St Paul’s Square
Birmingham
B3 1RB, UK
ISBN 978-1-80512-341-5
www.packtpub.com
To my caring mother, Geetha, and inspiring father, Ganapathy, who taught me to aim high. To my lovely sister, Deepa, who tolerated all my pranks yet continued to love me unconditionally.
To my life partner, Amrita – thank you for being my loving partner and motivating and supporting me in completing the book.
To Snowflake for creating the comprehensive data cloud platform and to all the technologists and developers whose work inspired the book.
– Shankar Narayanan SGS
I dedicate this book to the loving memory of my cherished parents, the Late S.R. Srinivasan and the Late S. Chandra. Their legacy of love and strength continues to shape my journey each day.
I am deeply grateful for the unwavering support and love of my brother, S.S. Sathish Kumar, and sister-in-law, R. Ranjani. They have stood by my side steadfastly through every high and low of my life.
My heartfelt thanks go to my beloved aunt, R. Rani, and to my little bundle of joy, Skandhaguru (Skandhu), who has loved me no matter what through this journey.
Finally, I extend my heartfelt gratitude to the remarkable transplant team at Kauvery Hospital, Chennai, for guiding me through my renal transplant journey, even as I write and publish this book.
– Vivekanandan SS
This is a unique and critical juncture point in the industry. Data plays an increasingly important role in every organization. The volumes and ecosystems of data continue to grow. AI and ML will continue to define and redefine business models and customer experiences. To complement these forces is an emerging set of tools and technologies to help you explore these exciting new worlds.
With that framing in mind, I’m thrilled to introduce The Ultimate Guide to Snowpark: Design and deploy Snowflake Snowpark with Python for efficient data workloads, authored by the Data Superhero Shankar Narayanan SGS and his co-author Vivekanandan SS. This book arrives at a critical juncture in the evolution of data processing, where efficiency and scalability are paramount, and Snowpark stands as a beacon of innovation within Snowflake’s ecosystem.
The core premise of Snowpark is simple: how can we improve data processing tasks and data applications by moving the code and processing to the data? Snowpark represents a transformative approach to managing data workloads, offering developers and data scientists the tools to build robust, scalable data solutions directly within Snowflake. By integrating Python, Snowpark unleashes the power of one of the most popular programming languages, making it accessible not just to data engineers but also to a broader community of technologists seeking to leverage data in new and powerful ways.
This guide is more than just a technical manual; it is a gateway to mastering Snowpark. The authors have meticulously crafted a resource that balances foundational knowledge with advanced techniques, providing clear, actionable guidance complemented by practical examples and code snippets. Their deep understanding of Snowpark’s capabilities is evident in every chapter, making this book an indispensable tool for anyone eager to excel in the Snowflake environment.
As the Director of Product at Snowflake managing Snowpark, I am profoundly proud and grateful to the authors who invested their expertise and passion into empowering our user community. This book does an exceptional job of demystifying complex concepts and lays down a roadmap for deploying sophisticated data engineering and science projects that are both innovative and practical.
To the readers, whether you are starting your journey with Snowflake or looking to expand your existing skills, The Ultimate Guide to Snowpark will equip you with the knowledge to transform your ideas into reality, elevate your projects, and lead the way in data-driven innovation.
Enjoy the read, and may it inspire you to push the boundaries of what is possible with Snowpark and Snowflake. I can’t wait to see what you can create, build, and unlock.
Jeff Hollan
Director of Product, Snowflake
Shankar Narayanan SGS is a principal architect at Microsoft, with over a decade of diverse experience leading and delivering large-scale data and cloud implementations for Fortune 500 companies across various industries. He has successfully implemented the Snowflake Data Cloud for many organizations, leading customers to adopt Snowflake.
He holds bachelor’s and master’s degrees in computer science and many certifications in multi-cloud platforms and Snowflake. He is an award-winning blogger, contributing to various technical publications and open source projects. He has been selected as the SAP community topic leader. He has been chosen as one of the Snowflake Data Heroes by Snowflake and the recipient of a Top 10 Data and Analytics Professional award by OnCon.
Vivekanandan SS spearheads the GenAI enablement team at Verizon, leveraging over a decade of expertise in data science and big data. His professional journey spans across building analytics solutions and products across diverse domains, and proficient in cloud analytics and data warehouses.
He holds a bachelor’s degree in industrial engineering from Anna University, a long-distance program in big data analytics from IIM, Bangalore, and a master’s in data science from Eastern University. As a seasoned trainer, he imparts his knowledge, specializing in Snowflake and GenAI, and is also a data science guest faculty and advisor for various educational institutes. His solution is ranked in the top 1 percentile in Kaggle kernels globally.
Preston Blackburn is a machine learning engineer with a background in data engineering. He has worked at multiple start-ups, specializing in Snowflake consulting and built libraries that extend Snowpark functionality. Preston excels in developing internal developer tools, accelerating data modernization efforts, and architecting robust ML pipelines. His dedication to pushing the boundaries of technology drives innovation and ensures his clients stay at the forefront of industry advancements.
Balamurugan Kannaiyan is a highly accomplished data engineering leader, with around two decades of experience specializing in cloud technologies (AWS, Azure, and Snowflake), data management, and advanced analytics.
He currently leads the data engineering team at a Texas Public Sector, leveraging Snowflake’s cutting-edge capabilities to build high-performance data applications. Bala brings a distinguished track record from prior roles within the public sector and Fortune 500 companies in designing scalable, cloud-distributed systems. Further solidifying his expertise, Bala holds a bachelor of engineering degree from the prestigious Anna University alongside numerous certifications and accreditations in Snowflake, Azure, Databricks, and Oracle.
In this part, we will explore the fundamental and advanced features of the Snowpark framework in Python. This part focuses on the Snowpark platform and the setup required to get started using Snowpark.
This part includes the following chapters:
Chapter 1, Discovering SnowparkChapter 2, Establishing a Foundation with SnowparkSnowpark is the recent major innovation released by Snowflake that provides an intuitive set of libraries and runtimes for querying and processing data at scale in Snowflake. This chapter aims to guide you through Snowpark to understand its unique capabilities. In addition, the chapter helps you learn how to utilize Python with Snowpark and implement it in various workloads such as data engineering, data science, and data applications. By the end of this chapter, you will have grasped Snowpark’s capabilities and benefits, including faster data processing, scalability, and reduced costs.
In this chapter, we’re going to cover the following main topics:
Introducing SnowparkLeveraging Python for SnowparkUnderstanding Snowpark for different workloadsRealizing the value of using SnowparkSnowflake, founded in 2012, started its journey to the data cloud by completely re-engineering the world of data and rethinking how a reliable, secure, high-performance, and scalable data-processing system should be architected for the cloud. It started with offering cloud-based data warehousing through a managed Software as a Service (SaaS) platform to load, analyze, and process large volumes of data. The success of Snowflake lies in the fact that it is a cloud-native managed solution that is built on top of the major public cloud providers such as Amazon Web Services, Microsoft Azure, and Google Cloud Platform by automatically providing a reliable, secure, high-performance, and scalable data processing system for organizations without the need to deploy hardware or install or configure any software.
As with any cloud data warehousing, Snowflake supports American National Standards Institute (ANSI) SQL as the language of choice. Although SQL is a powerful declarative language that allows users to ask questions about data, it is constrained to data warehouse workloads, limiting the support for advanced workloads such as data science and data engineering, which require developers to write the solution in other programming languages leading them to move data out of Snowflake to perform these workloads.
Snowflake’s solution to this challenge is Snowpark, an innovative developer framework that streamlines the process of building complex data pipelines. With Snowpark, data scientists and developers can directly interact with Snowflake using their preferred programming language, enabling them to quickly and securely deploy machine learning (ML) models, execute data pipelines, and develop data applications on Snowflake’s virtual compute warehouse in a serverless manner without having to transfer data outside of Snowflake.
Snowpark enables data teams to collaborate on the data by natively supporting work with DataFrame style programming in Python, Scala, or Java, exposing deeply integrated interfaces in these languages to augment Snowflake’s original SQL language and minimizing the complexity of having to manage different environments for advanced data pipelines. This has led developers to leverage Snowflake’s robust and scalable computing power to ship code to the data without exporting it outside Snowflake into other environments.
In this section, we covered a brief introduction to Snowpark and learned how it fits into the Snowflake ecosystem and how it helps developers. The following section will cover how to leverage Python for Snowpark.
In June 2022, Snowflake made a significant announcement, revealing the much-anticipated Snowpark for Python. This new release has rapidly emerged as the preferred programming language for Snowpark, providing users with a more extensive range of options for programming data in Snowflake. Moreover, Snowpark has simplified managing data architectures, enabling users to operate more quickly and efficiently.
Snowpark for Python is a cutting-edge, enterprise-grade, open-source innovation integrated into the Snowflake data cloud. As a result, the platform delivers a seamless, unified experience for data scientists and developers. In addition, the Snowpark for Python package is built upon the Snowflake Python connector. The Python connector enables users to execute SQL commands and other essential functions in Snowflake and Snowpark for Python empowers users to undertake more advanced data applications.
For instance, the platform permits users to run user-defined functions (UDFs), external functions, and stored procedures directly within Snowflake. This powerful new functionality enables data scientists, engineers, and developers to create robust and secure data pipelines and ML models within Snowflake. As a result, they can leverage the platform’s superior performance, elasticity, and security features to deliver advanced insights and drive meaningful business outcomes. Overall, Snowpark for Python represents a significant step forward for Snowflake, offering users enhanced functionality and flexibility while retaining the platform’s exceptional performance and security features.
Snowpark for Python supports pre-vetted open-source packages through integration with the Anaconda environment that executes on an Anaconda-powered sandbox inside Snowflake’s virtual compute warehouses, which provides a familiar interface for the developers. The integrated Anaconda package manager is valuable for developers as it comes with a comprehensive set of curated open-source packages and supports resolving dependencies between different packages and versions. It is a huge time-saver and helps prevent developers from dealing with “dependency hell.”
Snowpark for Python is generally available across all cloud instances of Snowflake. It helps accelerate different workloads and comes with a rich set of capabilities, as follows:
It allows developers to write Python code within Snowflake, enabling them to directly leverage the power of Python libraries and frameworks in SnowflakeIt supports popular open-source Python libraries such as pandas, NumPy, SciPy, and scikit-learn, along with other libraries, allowing developers to perform complex data analysis and ML tasks directly within SnowflakeIt also provides access to external data sources such as AWS S3, Azure Blob storage, and Google Cloud Storage, allowing developers to work with data stored outside SnowflakeIt provides seamless integration with Snowflake’s SQL engine, allowing developers to write queries using functional programming methods with Python that compile to SQLIt also supports distributed processing, allowing developers to scale their Python code to handle large datasets and complex logicIt enables developers to build custom UDFs that can be used within SQL queries, allowing for greater flexibility and customization of data processing workflowsSnowpark provides a Python development environment within Snowflake, allowing developers to write, test, and debug Python code directly within the Snowflake UIIt enables developers to work with various data formats such as CSV, JSON, Parquet, and Avro, providing data processing and analysis flexibilityIt provides a unified data processing experience that works with SQL and Python in a single environmentIt enables developers to create custom data pipelines using Python code, making integrating Snowflake with other data sources and data processing tools easierIt can handle real-time and batch data processing, making it easier to build data-intensive workloadsIt provides a robust framework built on Snowflake that ensures data privacy and compliance with industry standards such as the Health Insurance Portability and Accountability Act (HIPAA), General Data Protection Regulation (GDPR), and Security Operations Center (SOC)Snowpark supports enhancing data by leveraging Snowflake MarketplaceSnowpark for Python packs many capabilities that help developers use it efficiently for various workloads and use cases within Snowflake.
Although Snowpark supports Python, Scala, and Java, this book will focus only on Python, a de facto for Snowpark development. Python’s growing popularity through high-level built-in data structures with dynamic typing and binding makes it ideal for data operations. In addition, the language is very flexible and easy to learn by developers. Its power lies in the rich open-source ecosystem that is well-supported with a growing list of popular packages.
Python is a general-purpose, versatile programming language for different purposes, such as data engineering, data science, and data applications. It enables developers to learn a single programming language for all their needs.
Snowflake is also heavily investing in Python to make it easier for data scientists, engineers, and application developers to build even more in the data cloud without governance trade-offs.
In this section, we covered the capabilities of Snowpark for Python and why Python is a preferred language for developing Snowpark. The following section will cover how Snowpark can be used for different workloads.
The release of Snowpark transformed Snowflake into a complete data platform designed to support various workloads. Snowpark supports multiple workloads, such as data science and ML, data engineering, and data applications.
Python is the favorite language for data scientists. Snowpark for Python supports popular libraries and frameworks such as pandas, NumPy, and scikit-learn, making it the ideal framework for data scientists to perform ML development in Snowflake. In addition, data scientists can use the DataFrames API to interact with data inside Snowflake and perform batch training and inference inside Snowflake. Developers can also use Snowpark for feature engineering, ML model inference, and end-to-end ML pipelines. Snowpark also provides a SnowparkML library to support data science and ML in Snowpark.
Data cleansing and ELT workloads are complex, and building a data pipeline with just SQL is where Snowpark can be of great benefit. Snowpark lets developers factor code for readability and reuse it while providing a better capability for unit tests. In addition, with the support of Anaconda, developers can use open-source Python libraries for building reliable data pipelines. The other major challenge with data processing is that the infrastructure requires significant manual effort and maintenance. Snowpark solves this problem by being highly performant, enabling data engineers to work with large datasets quickly and efficiently, building complex data pipelines, and processing large volumes of data without performance issues.
Snowpark supports developing solutions that incorporate data governance and security. Data governance is critical and augments the data science and data engineering use cases. Snowpark simplifies the governance posture by helping organizations understand and improve data quality. Developers can quickly create a function to perform data tests and detect anomalies. Snowpark can utilize the data classification capability to detect personally identifiable information (PII) and classify data that is critical to an organization. Custom functions developed in Snowpark can mask sensitive data such as credit card numbers using the robust dynamic data masking feature while retaining the existing security model in Snowflake.
Snowpark helps the team develop dynamic data applications that run directly on Snowflake without moving the data outside. Using Streamlit, a powerful open-source library that Snowflake acquired, developers can build native applications using the familiar Python environment. Interactive ML-powered applications can be developed and shared with users securely utilizing role-based access controls entirely on Snowflake’s governed platform, taking advantage of its scale, performance, and governance. The Snowflake Native Application Framework provides a streamlined path to monetize apps through Snowflake Marketplace, where you can make your app available to other Snowflake customers and open new revenue opportunities.
Snowpark supports different workloads and makes Snowflake a complete data cloud solution. The following section will highlight Snowpark’s technical and business benefits.
The traditional big data approach has been in the industry for a long time and is unsuitable for modern cloud-based scalable workloads. Traditional architecture has many challenges, such as the following:
De-coupling the compute and data into separate systemsRunning separate processing clusters for different languagesComplexity in managing the systemData silos and data duplicationLack of unified security and governanceSnowflake solves the traditional system’s challenges using Snowpark, providing tremendous value to the data ecosystem and Snowflake users. The following diagram shows the difference between a traditional approach and Snowflake’s streamlined approach:
Figure 1.1 – Traditional versus Snowflake approach
As you can see from the difference between both approaches, Snowpark’s streamlined approach benefits both the business and the developers by providing a flexible, efficient, and cost-effective way to build data that scales with the business needs. Some of the significant values of using Snowpark are as follows:
Snowpark can access data programmatically through the DataFrame APIs, making the data ingestion and integration consistent, as you can integrate various structured and unstructured dataSnowpark standardizes the approach to data processing since the data pipelines are in Python code; they can be tested and deployed and are easier to understand and interpret