33,59 €
Streaming data is the new top technology to watch out for in the field of data science and machine learning. As business needs become more demanding, many use cases require real-time analysis as well as real-time machine learning. This book will help you to get up to speed with data analytics for streaming data and focus strongly on adapting machine learning and other analytics to the case of streaming data.
You will first learn about the architecture for streaming and real-time machine learning. Next, you will look at the state-of-the-art frameworks for streaming data like River. Later chapters will focus on various industrial use cases for streaming data like Online Anomaly Detection and others. As you progress, you will discover various challenges and learn how to mitigate them. In addition to this, you will learn best practices that will help you use streaming data to generate real-time insights.
By the end of this book, you will have gained the confidence you need to stream data in your machine learning models.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 245
Veröffentlichungsjahr: 2022
Rapidly build practical online machine learning solutions using River and other top key frameworks
Joos Korstanje
BIRMINGHAM—MUMBAI
Copyright © 2022 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Publishing Product Manager: Dinesh Chaudhary
Content Development Editor: Joseph Sunil
Technical Editor: Rahul Limbachiya
Copy Editor: Safis Editing
Project Coordinator: Farheen Fathima
Proofreader: Safis Editing
Indexer: Sejal Dsilva
Production Designer: Shankar Kalbhor
Marketing Coordinator: Shifa Ansari and Abeer Riyaz Dawe
First published: July 2022
Production reference: 1240622
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-80324-836-3
www.packt.com
Joos Korstanje, with his master's degrees in both environmental sciences and data science, has been working on statistics and data science for almost 10 years. Through his work in different companies including Disney, AXA, and others, he has closely followed developments in data science and related fields. This experience in the business world has allowed him to write about data science from an applied point of view (through his books, Medium, Towards Data Science, LinkedIn, and more).
Olivia Petris is a big data engineer working as an IT consultant in a technology and advisory services firm based in Paris. On her professional journey, she's always looking for challenging and interesting assignments. Since her engineering diploma in computer science, she has chosen to be in the data science and big data field. Therefore, she continues to improve her skills and keep up to date with new IT and technology developments. In her free time, she enjoys traveling, practicing karate, and hanging out with her family and friends.
Streaming data is the new top technology to watch in the field of data science and machine learning. As business needs become more demanding, many use cases require real-time analysis as well as real-time machine learning. This book will allow you to get up to speed with data analytics for streaming data and focuses strongly on adapting machine learning and other analytics to the case of streaming data.
You will first learn about the architecture for streaming and real-time machine learning. You will then look at the state-of-the-art frameworks for streaming data such as River.
You will learn about various industrial use cases for streaming data, such as online anomaly detection. Then, you will deep dive into challenges and how you will mitigate them. You will then learn the best practices that will help you use streaming data to generate real-time insights.
Upon completion of the book, you will be confident about using streaming data in your machine learning models.
Data scientists and machine learning engineers who have a basis in machine learning, are practice- and technology-oriented, and want to learn how to apply machine learning to streaming data through practical examples with modern technologies will benefit from this book. You will need to understand basic Python and machine learning concepts but require no prior knowledge of streaming.
Chapter 1, Introduction to Streaming Data, explains what streaming data is and why it is different from batch data. This chapter also explains the challenges that we should expect to encounter as well as the advantages of using streaming data.
Chapter 2, Architectures for Streaming and Real-Time Machine Learning, describes various architectures that can be used to set up streaming, and how they can be utilized.
Chapter 3, Data Analysis on Streaming Data, explores data analysis on streaming data, which includes real-time insights, real-time descriptive statistics, real-time visualizations, and basic alerting systems.
Chapter 4, Online Learning with River, covers the core concepts of online learning and also introduces you to the River library, which is a fundamental part of streaming.
Chapter 5, Online Anomaly Detection, covers online anomaly detection, explains how it is useful, and also provides a use case that involves building a program for detecting anomalies in streaming data.
Chapter 6, Online Classification, covers online classification, explains how it is useful, and also provides a use case that involves building a program for classifying streaming data.
Chapter 7, Online Regression, covers online regression, how it is useful, and also provides a use case that involves building a program for detecting regression in streaming data.
Chapter 8, Reinforcement Learning, introduces you to reinforcement learning. We will explore some of the key algorithms and also explore some use cases for it using Python.
Chapter 9, Drift and Drift Detection, focuses on helping us understand drift in online learning and learning how to build solutions to detect drift.
Chapter 10, Feature Transformation and Scaling, shows us how to build a feature transformation pipeline that works with real-time and streaming data.
Chapter 11, Catastrophic Forgetting, explores what catastrophic forgetting is, and shows us how we can deal with it using example use cases.
Chapter 12, Conclusion and Best Practices, acts as a review of the book and combines all the concepts explored throughout the book for us to revise and revisit as needed.
For following along with this book, you can use online notebook environments like Google Colab, Kaggle Notebooks, or your own local Jupyter Notebook environment with Python 3. Also, a (free) AWS account would be needed for a small number of exercises.
If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book's GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.
You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Machine-Learning-for-Streaming-Data-with-Python. If there's an update to the code, it will be updated in the GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide a PDF file that has color images of the screenshots and diagrams used in this book. You can download it here: https://packt.link/6rZ0m.
There are a number of text conventions used throughout this book.
Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "There is no predict_many function here, so it is necessary to do a loop with predict_one repeatedly."
A block of code is set as follows:
def self_made_decision_tree(observation): if observation.can_speak: if not observation.has_feathers: return 'human' return 'not human' for i,row in data.iterrows(): print(self_made_decision_tree(row))When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:
from sklearn.datasets import make_blobs X,y=make_blobs(shuffle=True,centers=2,n_samples=2000)Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: "Select System info from the Administration panel."
Tips or important notes
Appear like this.
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, email us at customercare@packtpub.com and mention the book title in the subject of your message.
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.
Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at copyright@packt.com with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Once you've read Machine Learning for Streaming Data with Python, we'd love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.
Your review is important to us and the tech community and will help us make sure we're delivering excellent quality content.
In this introductory part of the book, we will be introduced to the basic concept and principles surrounding streaming data. We will explore the various architectures that can be used to implement streaming data for machine learning. Finally, we will learn how to do data analysis on streaming data, along with various other functions.
This section comprises the following chapters:
Chapter 1, An Introduction to Streaming DataChapter 2, Architectures for Streaming and Real-Time Machine LearningChapter 3, Data Analysis on Streaming DataStreaming analytics is one of the new hot topics in data science. It proposes an alternative framework to the more standard batch processing, in which we are no longer dealing with datasets on a fixed time of treatment, but rather we are handling every individual data point directly upon reception.
This new paradigm has important consequences for data engineering, as it requires much more robust and, particularly, much faster data ingestion pipelines. It also imposes a big change in data analytics and machine learning.
Until recently, machine learning and data analytics methods and algorithms were mainly designed to work on entire datasets. Now that streaming has become a hot topic, it becomes more and more common to see use cases in which entire datasets just do not exist anymore. When a continuous stream of data is being ingested into a data storage source, there is no natural moment to relaunch an analytics batch job.
Streaming analytics and streaming machine learning models are models that are designed to work specifically with streaming data sources. A part of the solution, for example, is in the updating. Streaming analytics and machine learning need to update all the time as new data is being received. When updating, you may also want to forget the much older data.
This and other problems that are introduced by moving from batch analytics to streaming analytics need a different approach to analytics and machine learning. This book will lay out the basis for getting you started with data analytics and machine learning on data that is received as a continuous stream.
In this first chapter, you'll get a more solid understanding of the differences between streaming and batch data. You'll see some example use cases that showcase the importance of working with streaming rather than converting back into batch. You'll also start working with a first Python example to get a feel for the type of work that you'll be doing throughout this book.
In later chapters, you'll see some more background notions on architecture and, then, you'll go into a number of data science and analytics use cases and how they can be adapted to the new streaming paradigm.
In this chapter, you will discover the following topics:
A short history of data scienceWorking with streaming dataReal-time data formats and importing an example dataset in PythonYou can find all the code for this book on GitHub at the following link: https://github.com/PacktPublishing/Machine-Learning-for-Streaming-Data-with-Python. If you are not yet familiar with Git and GitHub, the easiest way to download the notebooks and code samples is the following:
Go to the link of the repository.Go to the green Code button.Select Download ZIP:Figure 1.1 – GitHub interface example
When you download the ZIP file, you unzip it in your local environment, and you will be able to access the code through your preferred Python editor.
To follow along with this book, you can download the code in the repository and execute it using your preferred Python editor.
If you are not yet familiar with Python environments, I would advise you to check out Anaconda (https://www.anaconda.com/products/individual), which comes with the Jupyter Notebook and JupyterLab, which are both great for executing notebooks. It also comes with Spyder and VSCode for editing scripts and programs.
If you have difficulty installing Python or the associated programs on your machine, you can check out Google Colab (https://colab.research.google.com/) or Kaggle Notebooks (https://www.kaggle.com/code), which both allow you to run Python code in online notebooks for free, without any setup to do.
Note
The code in the book will generally use Colab and Kaggle Notebooks with Python version 3.7.13 and you can set up your own environment to mimic this.
Over the last few years, new technology domains have quickly taken over a lot of parts of the world. Machine learning, artificial intelligence, and data science are new fields that have entered our daily life, both in our personal lives and in our professional lives.
The topics that data scientists work on today are not new. The absolute foundation of the field is in mathematics and statistics, two fields that have existed for centuries. As an example, least squares regression was first published in 1805. With time, mathematicians and statisticians have continued working on finding other methods and models.
In the following timeline, you can see how the recent boom in technology has been able to take place. In the 1600s and 1700s, very smart people were already laying the foundations for what we still do in statistics and mathematics today. However, it was not until the invention and popularization of computing power that the field became booming.
Figure 1.2 – A timeline of the history of data
Personal computer and internet accessibility is an important reason for data science's popularity today. Almost everyone has a computer that is performant enough for fairly complex machine learning. This strongly helps computer literacy, but also, online documentation accessibility is a big booster for learning.
The availability of big data tools such as Hadoop and Spark is also an important part of the popularization of data science, as they allow practitioners to work with datasets that are larger than anyone could ever imagine before.
Lastly, cloud computing is allowing data scientists from all over the world to access very powerful hardware at low prices. Especially for big data tools, the hardware needed is still priced in a way that most students would not be able to buy it for training purposes. Cloud computing gives access to those use cases for many.
In this book, you will learn how to work with streaming data. It is important to have this short history of data science in mind, as streaming data is one of those technologies that has been disadvantaged by the need for difficult hardware and setup requirements. Streaming data is currently gaining popularity quickly in many domains and has the potential to be a big hit in the coming period. Let's now have a deeper look into the definition of streaming data.
Streaming data is data that is streamed. You may know the term streaming from online video services on which you can stream video. When doing this, the video streaming service will continue sending the next parts of the video to you while you are already watching the first part of the video.
The concept is the same when working with streaming data. The data format is not necessarily video and can be any data type that is useful for your use case. One of the most intuitive examples is that of an industrial production line, in which you have continuous measurements from sensors. As long as your production line doesn't pause, you will continue to generate measurements. We will check out the following overview of the data streaming process:
Figure 1.3 – The data streaming process
The important notion is that you have a continuous flow of data that you need to treat in real time. You cannot wait until the production line stops to do your analysis, as you would need to detect potential problems right away.
Streaming data is generally not among the first use cases that new data scientists tend to start with. The type of problem that is usually introduced first is batch use cases. Batch data is the opposite of streaming data, as it works in phases: you collect a bunch of data, and then you treat a bunch of data.
If you see streaming data as streaming a video online, you could see batch data as downloading the entire video first and then watching it when the downloading is finished. For analytical purposes, this would mean that you get the analysis of a bunch of data when the data generating process is finished rather than whenever a problem occurs.
For some use cases, this is not a problem. Yet, you can understand that streaming can deliver great added value in those use cases where fast analytics can have an impact. It also has added value in use cases where data is ingested in a streaming method, which is becoming more and more common. In practice, many use cases that would get added value through streaming are still solved with batch treatment, just because these methods are better known and more widespread.
The following overview shows the batch treatment process:
Figure 1.4 – The batch process
Let's now look at some advantages of using streaming analytics rather than other approaches in the following subsections.
The first advantage of building streaming data analytics rather than batch systems is that many data generating processes are actually in real time. You will discover a number of use cases later, but in general, it is rare that data collection is done in batches.
Although most of us are used to building batch systems around real-time data generating systems, it often makes more sense to build streaming analytics directly.
Of course, batch analytics and streaming analytics can co-exist. Yet, adding a batch treatment to a streaming analytics service is often much easier than adding streaming functionality into a system that is designed for batches. It simply makes the most sense to start with streaming.
When designing data science solutions, streaming does not always come to mind first. However, when solutions or tools are built in real time, it is rare that the real-time functionality is not appreciated.
Many analytical solutions of today are built in real time and the tools are available. In many problems, real-time information will be used at some point. Maybe it will not be used from the start, but the day that anomalies happen, you will find a great competitive advantage in having the analytics straight away, rather than waiting till the next hour or the next morning.
Let's talk about some examples of companies that have implemented real-time analytics successfully. The first example is Shell. They have been able to implement real-time analytics of their security cameras on their gas stations. An automated and real-time machine learning pipeline is able to detect whether people are smoking.
Another example is the use of sensor data in connected sports equipment. By measuring heart rate and other KPIs in real time, they are able to alert you when anything is wrong with your body.
Of course, the big players such as Facebook and Twitter also analyze a lot of data in real time, for example, when detecting fake news or bad content. There are many successful use cases of streaming analytics, yet at the same time, there are some common challenges that streaming data brings with them. Let's have a look at them now.
Streaming data analytics are currently less widespread than batch data analytics. Although this is slowly changing, it is good to understand where the challenges are when working with streaming data.
One simple reason for streaming analytics being less widespread is a question of knowledge and know-how. Setting up streaming analytics is often not taught in schools and is definitely not taught as the go-to method. There are also fewer resources available on the internet to get started with it. As there are much more resources on machine learning and analytics for batch treatment, and the batch methods do not apply to streaming data, people tend to start with batch applications for data science.
A second difficulty when working on streaming data is architecture. Although some data science practitioners have knowledge of architecture, data engineering, and DevOps, this is not always the case. To set up a streaming analytics proof of concept or a minimum viable product (MVP), all those skills are needed. For batch treatment, it is often enough to work with scripts.
Architectural difficulties are inherent to streaming, as it is necessary to work with real-time processes that send individually collected records to an analytical treatment process that will update in real time. If there is no architecture that can handle this, it does not make much sense to start with streaming analytics.
Another challenge when working with streaming data is the financial aspect. Although working with streaming is not necessarily more expensive in the long run, it can be more expensive to set up the infrastructure needed to get started. Working on a local developer PC for an MVP is unlikely to succeed as the data needs to be treated in real time.
Real-time processes also have a larger risk of runtime problems. When building software, bugs and failures happen. If you are on a daily batch process, you may be able to repair the process, rerun the failed batch, and solve the problem.
If a streaming tool is down, there are risks of losing data. As the data should be ingested in real time, the data that is generated during a time-out of your process may not be recuperable. If your process is very important, you will need to set up extensive monitoring day and night and have more quality checks before pushing your solutions to production. Of course, this is also important in batch processes, but even more so in streaming.
The last challenge of streaming analytics is that the common methods are generally developed for batch data first. There are currently many solutions out there for analytics on real time and streaming data, but still not as many as for batch data.
Also, since the streaming analysis has to be done very quickly to respect real-time delivery, streaming use cases tend to end up with much less interesting analytical methodologies and stay at the basic level of descriptive or basic analyses.
For companies to get started with streaming data, the first step
Tausende von E-Books und Hörbücher
Ihre Zahl wächst ständig und Sie haben eine Fixpreisgarantie.
Sie haben über uns geschrieben: