34,79 €
Pachyderm is an open source project that enables data scientists to run reproducible data pipelines and scale them to an enterprise level. This book will teach you how to implement Pachyderm to create collaborative data science workflows and reproduce your ML experiments at scale.
You’ll begin your journey by exploring the importance of data reproducibility and comparing different data science platforms. Next, you’ll explore how Pachyderm fits into the picture and its significance, followed by learning how to install Pachyderm locally on your computer or a cloud platform of your choice. You’ll then discover the architectural components and Pachyderm's main pipeline principles and concepts. The book demonstrates how to use Pachyderm components to create your first data pipeline and advances to cover common operations involving data, such as uploading data to and from Pachyderm to create more complex pipelines. Based on what you've learned, you'll develop an end-to-end ML workflow, before trying out the hyperparameter tuning technique and the different supported Pachyderm language clients. Finally, you’ll learn how to use a SaaS version of Pachyderm with Pachyderm Notebooks.
By the end of this book, you will learn all aspects of running your data pipelines in Pachyderm and manage them on a day-to-day basis.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 328
Veröffentlichungsjahr: 2022
Learn how to build version-controlled, end-to-end data pipelines using Pachyderm 2.0
Svetlana Karslioglu
BIRMINGHAM—MUMBAI
Copyright © 2022 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Publishing Product Manager: Reshma Raman
Senior Editor: David Sugarman
Content Development Editor: Nathanya Dias
Technical Editor: Rahul Limbachiya
Copy Editor: Safis Editing
Project Coordinator: Aparna Ravikumar Nair
Proofreader: Safis Editing
Indexer: Pratik Shirodkar
Production Designer: Sinhayna Bais
Marketing Coordinator: Shifa Ansari
First published: March 2022
Production reference: 1070222
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-80107-448-3
www.packt.com
Svetlana Karslioglu is a seasoned documentation professional with over 10 years of experience in top Silicon Valley companies. During her tenure at Pachyderm, she authored much of the open source documentation for Pachyderm and was also in charge of the documentation infrastructure. Throughout her career, she has spoken at local conferences and given talks advocating for open infrastructure and unbiased research in artificial intelligence. When Svetlana is not busy writing books, she spends time with her three children and her husband, Murat.
This book is dedicated to my husband, Murat, who supported me through each step of the process, tirelessly watched our three kids, and cooked breakfasts, lunches, and dinners while I was busy writing this book. My husband gave me the most valuable gift of all – time to fulfill my dreams. For this, I thank you, Muratjim.
Soumo Chakraborty is a data architect at Kyndryl – Data & AI team. He helps clients continually succeed by using big data and AI services. He started 15 years ago as a trainee desktop engineer and life has been about continuous learning for him – leading to his current role. He is very passionate about learning new skills around the cloud, MLOps, AIOps, and data engineering, and putting them into business use cases. Soumo loves traveling and dreams of touring the world. During his leisure time, he and his 6-year-old watch YouTube videos of the Amazon rainforest and write tech blogs.
I dedicate this effort to my parents, who taught me to keep moving, my spouse, Richa, for teaching me to challenge new heights, Manoj, a mentor for life, and to all my well-wishers.
Manoj Palaniswamy is a senior technical staff member at Kyndryl and plays a technical leadership role within the Application, Data and AI practice. He has been working in the IT industry for more than 17 years and his domain expertise includes enterprise AI strategy, data management, AIOps, MLOps, AI platform engineering, and hybrid cloud IT infrastructure and analytics. He has led many cross-cultural technical teams across the globe on complex technical projects and closely works with clients to translate their business requirements into technical solutions. Manoj holds two patents in the area of machine learning and workload optimization on VMs.
Hendrik Vincent Koops is a senior data scientist at RTL Netherlands, working on AI multimedia projects. He received a B.A. degree in audio and sound design with honors in 2008 and an M.A. degree in music composition in 2008, both at the HKU University of the Arts Utrecht. He then received B.S. (2012) and M.S. (2014) degrees in artificial intelligence at Utrecht University. After completing research at the Department of Electrical and Computer Engineering at Carnegie Mellon University, he received a Ph.D. degree in computer science from Utrecht University in 2019. At RTL Netherlands, he is responsible for developing scalable audio-visual machine learning solutions to make video content more discoverable, searchable, and valuable.
This section introduces the basics of Pachyderm, as well as describing the importance of data reproducibility for an enterprise-level data science platform. You will learn what the main pillars of the Pachyderm solution are, including repositories, datums, jobs, and the most important of them all – the pipeline. The chapter also briefly talks about the ethics of AI in terms of reproducibility.
This section comprises the following chapters:
Chapter 1, The Problem of Data ReproducibilityChapter 2, Pachyderm BasicsChapter 3, Pachyderm Pipeline SpecificationToday, machine learning algorithms are used everywhere. They are integrated into our day-to-day lives, and we use them without noticing. While we are rushing to work, planning a vacation, or visiting a doctor's office, the models are at work, at times making important decisions about us. If we are unsure what the model is doing and how it makes decisions, how can we be sure that its decisions are fair and just? Pachyderm profoundly cares about the reproducibility of data science experiments and puts data lineage, reproducibility, and version control at its core. But before we proceed, let's discuss why reproducibility is so important.
This chapter explains the concepts of reproducibility, ethical Artificial Intelligence (AI), and Machine Learning Operations (MLOps), as well as providing an overview of the existing data science platforms and how they compare to Pachyderm.
In this chapter, we're going to cover the following main topics:
Why is reproducibility important?The reproducibility crisis in scienceDemystifying MLOps Types of data science platformsExplaining ethical AIFirst of all, let's define AI, ML, and data science.
Data science is a field of study that involves collecting and preparing large amounts of data to extract knowledge and produce insights.
AI is more of an umbrella term for technology that enables machines to mimic the behavior of human beings. Machine learning is a subset of AI that is based on the idea that an algorithm can learn based on past experiences.
Now, let's define reproducibility. A data science experiment is considered reproducible if other data scientists can repeat it with a comparable outcome on a similar dataset and problem. And although reproducibility has been a pillar of scientific research for decades, it has only recently become an important topic in the data science scope.
Not only is a reproducible experiment more likely to be free of errors, but it also takes the experiment further and allows others to build on top of it, contributing to knowledge transfer and speeding up future discoveries.
It's not a secret that data science has become one of the hottest topics in the last 10 years. Many big tech companies have opened tens of high-paying data scientist, data engineering, and data analyst positions. With that, the demand to join the profession has been rising exponentially. According to the AI Index 2019 Annual Report published by the Stanford Institute for Human-Centered Artificial Intelligence (HAI), the number of AI papers has grown threefold in the last 20 years. You can read more about this report on the Stanford University HAI website: https://hai.stanford.edu/blog/introducing-ai-index-2019-report.
Figure 1.1 – AI publications trend, from the AI Index 2019 Annual Report (p. 5)
Almost every learning platform and university now offers a data science or AI program, and these programs never lack students. Thousands of people of all backgrounds, from software developers to CEOs, take ML classes to keep up with the rapidly growing industry.
The number of AI conferences has been steadily growing as well. Even in the pandemic world, where in-person events have become impossible, the AI community has continued to meet in a virtual format. Such flagship conferences as Neural Information Processing Systems (NeurIPS) and International Conference on Machine Learning (ICML), which typically attract more than 10,000 visitors, took place online with significant attendance.
According to some predictions, the AI market size will increase to more than $350 billion by 2025. The market grew from $12 billion to $58 billion from 2020 to 2021 alone. The Silicon Valley tech giants are fiercely battling to achieve dominance in the space, while smaller players emerge to get their share of the market. The number of AI start-ups worldwide is steadily growing, with billions being invested in them each year.
The following graph shows the growth of AI-related start-ups in recent years:
Figure 1.2 – Total private investment in AI-related start-ups worldwide, from the AI Index 2019 Annual Report (p. 88)
The total private investment in AI start-ups grew by more than 30 times in the last 10 years.
And another interesting metric from the same source is the number of AI patents published between 2015 and 2018:
Figure 1.3 – Total number of AI patents (2015-2018), from the AI Index 2019 Annual Report (p. 32)
The United States is leading in the number of published patents among other countries.
These trends boost the economy and industry but inevitably affect the quality of submitted AI papers, processes, practices, and experiments. That's why a proper process is needed to ensure the validation of data science models. The replication of experiments is an important part of a data science model's quality control.
Next, let's learn what a model is.
Let's define what a model is. A data science or AI model is a simplified representation of a process that also suggests possible results. Whether it is a weather-prediction algorithm or a website attendance calculator, a model provides the most probable outcome and helps us make informed decisions. When a data scientist creates a model, they need to make decisions about the critical parameters that must be included in that model because they cannot include everything. Therefore, a model is a simplified version of a process. And that's when sacrifices are made based on the data scientist's or organization's definition of success.
The following diagram demonstrates a data model:
Figure 1.4 – Data science model
Every model needs a continuous data flow to improve and perform correctly. Consider the Amazon Go stores where shoppers' behavior is analyzed by multiple cameras inside the store. The models that ensure safety in the store are trained continuously on real-life customer behavior. These models had to learn that sometimes shoppers might pick up an item and then change their mind and put it back; sometimes shoppers can drop an item on the floor, damaging the product, and so on. The Amazon Go store model is likely good because it has access to a lot of real data, and it improves over time. However, not all models have access to real data, and that's when a synthetic dataset can be used.
A synthetic dataset is a dataset that was generated artificially by a computer. The problem with synthetic data is that it is only as good as the algorithm that generated it. Often, such data misrepresents the real world. In some cases, such as when users' privacy prevents data scientists from using real data, usage of a synthetic dataset is justified; in other cases, it can lead to negative results.
IBM's Watson was an ambitious project that promised to revolutionize healthcare by promising to diagnose patients based on a provided list of symptoms in a matter of a few seconds. This invention could greatly speed up the diagnosis process. In some places on this planet, where people have no access to healthcare, a system like that could save many lives. Unfortunately, since the original promise was to replace doctors, Watson is a recommendation system that can assist in diagnosing, but nothing more than that. One of the reasons is that Watson was trained on a synthetic dataset and not on real data.
There are cases when detecting issues in a trained model can be especially difficult. Take the example of an image recognition algorithm developed by the University of Washington that was built to identify whether an image had a husky portrayed in it or a wolf. The model was seemingly working really well, predicting the correct result with almost 90% accuracy. However, when the scientists dug a bit deeper into the algorithm and data, they learned that the model was basing its predictions on the background. The majority of images with huskies had grass in the background, while the majority of images with wolves had snow in the background.
How can you ensure that a data science process in your company adheres to the principles of reproducibility? Here is a list of the main principles of reproducibility:
Use open data: The data that is used for training models should not be a black box. It has to be available to other data scientists in an unmodified state. Train the model on many examples: The information about experiments and on how many examples it was trained must be available for review.Rigorously document the process: The process of data modifications, statistical failures, and experiment performance must be thoroughly documented so that the author and other data scientists can reproduce the experiment in the future.Let's consider a few examples where reproducibility, collaboration, and open data principles were not part of the experiment process.
A few years ago, a group of scientists at Duke University became wildly popular because they emerged with an ambitious claim of predicting the course of lung cancer based on the data collected from patients. The medical community was very excited about the prospect of such a discovery. However, a group of other scientists in the MD Anderson Cancer Centre in Houston found severe errors in that research when they tried to reproduce the original result. They discovered mislabeling in the chemotherapy prediction model, mismatches in genes to gene-expression data, and other issues that would make correct treatment prescription based on the model calculations significantly less likely. While the flaws were eventually unraveled, it took almost 3 years and more than 2,000 working hours for the researchers to get to the bottom of the problem, which could have been easily avoided if the proper research practices were established in the first place.
Now let's look at how AI can go wrong based on a chatbot example. You might remember the infamous Microsoft chatbot called Tay. Tay was a robot who could learn from his conversations with internet users. When Tay went live, his first conversations were friendly, but overnight his language changed, and he started to post harmful, racist, and overall inappropriate responses. He learned from the users who taught him to be rude, and as the bot was designed to mirror human behavior, he did what he was created for. Why was he not racist from the very beginning, you might ask? The answer is that he was trained on clean, cherry-picked data that did not include vulgar and abusive language. But we cannot control the web and what people post, and the bot did not have any sense of morals programmed into it. This experiment raised many questions about AI ethics and how we can ensure that the AI that we build does not turn on us one day.
The new generation of chatbots is built on the recently released GPT-3 library. These chatbots are trained with neural networks that, during training, create associations that cannot be broken. These chatbots, although using a seemingly more advanced technology behind them than their predecessors, still easily might become racists and hateful depending on the data they are trained on. If a bot is trained on misogynist and hateful conversations, it will be offensive and will likely reply inappropriately.
As you can see, data science, AI, and machine learning are powerful technologies that help us solve many difficult problems, but at the same time, they can endanger their users and have devastating consequences. The data science community needs to work on devising better ways of minimizing adverse outcomes by establishing proper standards and processes to ensure the quality of data science experiments and AI software.
Now that we've seen why reproducibility is so important, let's look at what consequences it has on the scientific community and data science.
The reproducibility crisis is a problem that has been around for more than a decade. Because data science is a close discipline to science, it is important to review the issues many scientists have outlined in the past and correlate them with similar problems the data science space is facing today.
One of the most important issues is replication—the ability to reproduce the results of a scientific experiment has been one of the founding principles of good research. In other words, if an experiment can be reproduced, it is valid, and if not, it could be a one-time occurrence that does not represent real phenomena. Unfortunately, in recent years, more and more research papers in sociology, medicine, biology, and other areas of science cannot withhold retesting against an increased number of samples, even if these papers were published in well-known and trustworthy science magazines, such as Nature. This tendency could lead to public mistrust in science and AI as part of it.
As was mentioned previously, because of the popularity and growth of the AI industry, the number of AI papers has increased multiple times. Unfortunately, the quality of these papers does not grow with the number of papers itself.
Nature magazine recently conducted a survey among scientists asking them whether they feel that there is a reproducibility crisis in science. The majority of scientists agreed that false-positive results due to pressure to publish results frequently definitely exists. Researchers need sponsorship and sponsors need to see results to invest additional money in the research, which results in many published papers with declining credibility. Ultimately, the fight for grants and bureaucracy are often named as the main causes of the lack of the reproducibility process in labs.
The research papers that were questioned for integrity have the following common attributes:
No code or data were publicly shared for other researchers to attempt to replicate the results.The scientists who attempted to replicate the results failed completely or partially to do it by following the provided instructions.Even the papers published by Nobel laureates can sometimes be questioned due to an inability to reproduce the results. For example, in 2014, Science magazine retracted a paper published by Nobel Prize winner and immunologist Bruce Beutler. His paper was about the response to pathogens by virus-like organisms in the human genome. This paper was cited over 50 times before it was retracted.
When COVID-19 become a major topic of 2020, multiple papers were published on it. According to Retraction Watch, an online blog that tracks the scientific papers that have been called off, as of March 2021 more than 86 of them were retracted.
In 2019, more than 1,400 science papers were retracted by multiple publishers. This number is huge and has been steadily growing, compared to only 50 papers in the early 2000s. This raises awareness of a so-called reproducibility crisis in science. While not all papers are retracted for that reason, oftentimes it happens because of that.
Data fishing or data dredging is a method of achieving a statistically significant result of an experiment by running the computation multiple times before the desired result is achieved and only reporting these results and ignoring the inconvenient results. Sometimes, scientists unintentionally dredge the data to achieve the result they think is most probable and confirm their hypothesis. A more sinister plan can take place too—a scientist might be intentionally hacking the result of the experiment to achieve a predefined conclusion.
An example of such a misuse of data analysis would be if you decided to prove that there is a correlation between banana consumption and an increased level of IQ in children of age 10 and older. This is a completely made-up example, but say you wanted to establish this connection. You would need to get information about IQ level and banana consumption of a big enough sample of children – let's say 5,000.
Then, you would run tests, such as: do kids who eat bananas and exercise have a higher IQ level than the ones who only exercise? Do kids who watch TV and eat bananas have a higher level of IQ compared to the ones who do not? After conducting these tests enough times, you most likely would get some kind of correlation. However, this result would not be significant, and using the data dredging technique is considered extremely unethical by the scientific community. In data science specifically, similar problems are being seen.
Without conducting a full investigation, detecting data dredging might be difficult. Possible factors to look for include the following:
Was the research conducted by a reputable institution or group of scientists?What does other research in similar areas suggest?Is financial interest involved?Is the claim sensational?Without a proper process, data dredging and unreliable researchers will continue to be published. Recently, Nature magazine surveyed around 1,500 researchers from different areas of science and more than 50% of respondents outlined that they have tried and failed to reproduce the results of research in the past. Even more shockingly, in many cases, they failed to reproduce the results of their own experiments.
Out of all respondents, only 24% were able to successfully publish their reproduction attempts and the majority were never contacted with a request to reproduce someone else's research.
Of course, increasing the reproducibility of experiments is a costly problem and can double the time required to conduct an experiment, which many research laboratories might not be able to afford. But if it's added to the originally planned time for the research and has a proper process, it should not be as difficult or rigorous as adding it midway in the research lifecycle.
Even worse, retracting a paper after it was published can be a tedious task. Some publishers even charge researchers a significant amount of money if a paper is retracted. Such practices are truly discouraging.
All of this negatively impacts research all over the world and results in growing mistrust in science. Organizations must take steps to improve processes in their scientific departments and scientific journals must raise the bar of publishing research.
Now that we have learned about data fishing, let's review better reproducibility guidelines.
The Center for Open Science (COS), a non-profit organization that focuses on supporting and promoting open-science initiatives, reproducibility, and integrity of scientific research, has published Guidelines for Transparency and Guidelines for Transparency and Openness Promotion (TOP) in Journal Policies and Practices, or the TOP Guidelines. These guidelines emphasize the importance of transparency in published research papers. Researchers can use them to justify the necessity of sharing research artifacts publicly to avoid any possible inquiries regarding the integrity of their work.
The main principles of the TOP Guidelines include the following:
Proper citation and credit to original authors: All text, code, and data artifacts that belong to other authors must be outlined in the paper and credit given as needed.Data, methodology, and research material transparency: The authors of the paper must share the written code, methodology, and research materials in a publicly accessible location with instructions on how to access and use them.Design and analysis transparency: The authors should be transparent about the methodology as much as possible, although this might vary by industry. At a minimum, they must disclose the standards that have been applied during the research.Preregistrations of the research and analysis plans: Even if research does not get published, preregistration makes it more discoverable.Reproducibility of obtained results: The authors must include sufficient details on how to reproduce the original results.There are three levels that are applied to all these metrics:
Not implemented—information is not included in the reportLevel 1—available upon requestLevel 2—access before publicationLevel 3—verification before publicationLevel 3 is the highest level of transparency that a metric can achieve. Having this level of transparency justifies the quality of submitted research. COS applies the TOP factor to rate a journal's efforts to ensure transparency and ultimately the quality of the published research.
Apart from data and code reproducibility, often the environment and software used during the research play a big role. New technologies, such as containers and virtual and cloud environments make it easy to achieve uniformity in conducted research. Of course, if we consider biochemistry or other industries that require more precise lab conditions, achieving uniformity might be even more complex.
Now let's learn about common practices that help improve reproducibility.
Thanks to the work of reproducibility advocates and the problem being widely discussed in scientific communities in recent years, some positive tendencies in increasing reproducibility seem to be emerging. These practices include the following:
Request a colleague to reproduce your work.Develop extensive documentation.Standardize research methodology.Preregister your research before publication to avoid data cherry-picking.There are scientific groups that make it their mission to reproduce and notify researchers about mistakes in their papers. Their typical process is to try to reproduce the result of a paper and write a letter to the researchers or lab to request a correction or retraction. Some researchers willingly collaborate and correct the mistakes in the paper, but in other cases, it is unclear and difficult. One such group has identified the following problems in the 25 papers that they analyzed:
Lack of process or point of contact regarding to whom they should address feedback on a paper. Scientific journals do not provide a clear statement on whether feedback can be addressed to the chief editor or whether there is a feedback submission form of some sort.Scientific journal editors accept and act on submissions unwillingly. In some cases, it might take up to a year to publish a warning on a paper that has received critical feedback, even if it was provided by a reputable institution.Some publishers expect you to pay if you want to publish a correction letter and delay retractions.Raw data is not always available publicly. In many cases, publishers did not have a unified process around a shared location for the raw data used in the research. If you have to directly contact an author, you might not be able to get the requested information and it might significantly delay the process. Moreover, they can simply deny such a request.The lack of a standard in submitting corrections and research paper retractions contributes to the overall reproducibility crisis and knowledge sharing. The papers that used data dredging and other techniques to manipulate the results will become a source of information for future researchers, contributing to the overall misinformation and chaos. Researchers, publishers, and editors should work together on establishing unified post-publication review guidelines that encourage other scientists to participate in testing and providing feedback.
We've learned how reproducibility affects the quality of research. Now, let's review how organizations can establish a process to ensure their data science experiments adhere to best industry practices to ensure high standards.
This section defines Machine Learning Operations (MLOps) and describes why it is crucial to establish a reliable MLOps process within your data science department.
In many organizations, data science departments have been created fairly recently, in the last few years. The profession of data scientist is fairly new as well. Therefore, many of these departments have to find a way to integrate into the existing corporate process and devise ways to ensure the reliability and scalability of data science deliverables.
In many cases, the burden of building a suitable infrastructure falls on the shoulders of the data scientists themselves, who often are not as familiar with the latest infrastructure trends. Another problem is how to make it all work for different languages, platforms, and environments. In the end, data scientists spend more time on building the infrastructure than on working on the model itself. This is where the new discipline has emerged to help bridge the gap between data science and infra.
MLOps
