E-Book
33,59 €

Machine Learning for Streaming Data with Python E-Book

Joos Korstanje

0,0

33,59 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Fachliteratur
Sprache: Englisch

Beschreibung

Streaming data is the new top technology to watch out for in the field of data science and machine learning. As business needs become more demanding, many use cases require real-time analysis as well as real-time machine learning. This book will help you to get up to speed with data analytics for streaming data and focus strongly on adapting machine learning and other analytics to the case of streaming data.
You will first learn about the architecture for streaming and real-time machine learning. Next, you will look at the state-of-the-art frameworks for streaming data like River. Later chapters will focus on various industrial use cases for streaming data like Online Anomaly Detection and others. As you progress, you will discover various challenges and learn how to mitigate them. In addition to this, you will learn best practices that will help you use streaming data to generate real-time insights.
By the end of this book, you will have gained the confidence you need to stream data in your machine learning models.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

MOBI

Seitenzahl: 245

Veröffentlichungsjahr: 2022

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Ähnliche

Machine Learning on Geographical Data Using Python

Joos Korstanje

Advanced Forecasting with Python

Joos Korstanje

Für immer aufgeräumt – auch digital

Jürgen Kurz

Der Weg zum erfolgreichen Unternehmer

Stefan Merath

Der Weg zum erfolgreichen Unternehmer

Stefan Merath

Denke (nach) und werde reich

Napoleon Hill

30 Minuten Resilienz

Ulrich Siegrist

Krebszellen mögen keine Himbeeren - Der große Bestseller - Vollständig überarbeitet und aktualisiert

Richard Béliveau

Die Hormonrevolution

Michael E Platt

Der Crash ist die Lösung

Matthias Weik

Günter, der innere Schweinehund, lernt verkaufen

Stefan Frädrich

Mission erfüllt

Owen Mark

Die Leber wächst mit ihren Aufgaben

Dr. med. Eckart von Hirschhausen

Macht, was ihr liebt!

Anja Förster

Kopf schlägt Kapital

Günter Faltin

Der größte Raubzug der Geschichte

Matthias Weik

Der Mann und das Holz

Lars Mytting

Unsere Hunde - gesund durch Homöopathie

Hans Günter Wolff

Die Jahrhundertlüge, die nur Insider kennen

Heiko Schrang

Organisation für Komplexität

Niels Pfläging

Leseprobe

Machine Learning for Streaming Data with Python

Rapidly build practical online machine learning solutions using River and other top key frameworks

Joos Korstanje

BIRMINGHAM—MUMBAI

Machine Learning for Streaming Data with Python

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Publishing Product Manager: Dinesh Chaudhary

Content Development Editor: Joseph Sunil

Technical Editor: Rahul Limbachiya

Copy Editor: Safis Editing

Project Coordinator: Farheen Fathima

Proofreader: Safis Editing

Indexer: Sejal Dsilva

Production Designer: Shankar Kalbhor

Marketing Coordinator: Shifa Ansari and Abeer Riyaz Dawe

First published: July 2022

Production reference: 1240622

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-80324-836-3

www.packt.com

Contributors

About the author

Joos Korstanje, with his master's degrees in both environmental sciences and data science, has been working on statistics and data science for almost 10 years. Through his work in different companies including Disney, AXA, and others, he has closely followed developments in data science and related fields. This experience in the business world has allowed him to write about data science from an applied point of view (through his books, Medium, Towards Data Science, LinkedIn, and more).

About the reviewer

Olivia Petris is a big data engineer working as an IT consultant in a technology and advisory services firm based in Paris. On her professional journey, she's always looking for challenging and interesting assignments. Since her engineering diploma in computer science, she has chosen to be in the data science and big data field. Therefore, she continues to improve her skills and keep up to date with new IT and technology developments. In her free time, she enjoys traveling, practicing karate, and hanging out with her family and friends.

Preface

Part 1: Introduction and Core Concepts of Streaming Data

Chapter 1: An Introduction to Streaming Data

Technical requirements

Setting up a Python environment

A short history of data science

Working with streaming data

Streaming data versus batch data

Advantages of streaming data

Examples of successful implementation of streaming analytics

Challenges of streaming data

How to get started with streaming data

Common use cases for streaming data

Streaming versus big data

Real-time data formats and importing an example dataset in Python

Summary

Chapter 2: Architectures for Streaming and Real-Time Machine Learning

Technical requirements

Python environment

Defining your analytics as a function

Understanding microservices architecture

Communicating between services through APIs

Demystifying the HTTP protocol

The GET request

The POST request

JSON format for communication between systems

RESTful APIs

Building a simple API on AWS

API Gateway in AWS

Lambda in AWS

Data-generating process on a local machine

Implementing the example

More architectural considerations

Other AWS services and other services in general that have the same functionality

Big data tools for real time streaming

Calling a big data environment in real time

Summary

Chapter 3: Data Analysis on Streaming Data

Technical requirements

Python environment

Descriptive statistics on streaming data

Why are descriptive statistics different on streaming data?

Introduction to sampling theory

Comparing population and sample

Population parameters and sample statistics

Sampling distribution

Sample size calculations and confidence level

Rolling descriptive statistics from streaming

Exponential weight

Tracking convergence as an additional KPI

Overview of the main descriptive statistics

The mean

The median

The mode

Standard deviation

Variance

Quartiles and interquartile range

Correlations

Real-time visualizations

Opening the dashboard

Comparing Plotly's Dash and other real-time visualization tools

Building basic alerting systems

Alerting systems on extreme values

Alerting systems on process stability (mean and median)

Alerting systems on constant variability (std and variance)

Basic alerting systems using statistical process control

Summary

Part 2: Exploring Use Cases for Data Streaming

Chapter 4: Online Learning with River

Technical requirements

Python environment

What is online machine learning?

How is online learning different from regular learning?

Advantages of online learning

Challenges of online learning

Types of online learning

Using River for online learning

Training an online model with River

Improving the model evaluation

Building a multiclass classifier using one-vs-rest

Summary

Chapter 5: Online Anomaly Detection

Technical requirements

Python environment

Defining anomaly detection

Are outliers a problem?

Exploring use cases of anomaly detection

Fraud detection in financial institutions

Anomaly detection on your log data

Fault detection in manufacturing and production lines

Hacking detection in computer networks (cyber security)

Medical risks in health data

Predictive maintenance and sensor data

Comparing anomaly detection and imbalanced classification

The problem of imbalanced data

The F1 score

SMOTE oversampling

Anomaly detection versus classification

Algorithms for detecting anomalies in River

The use of thresholders in River anomaly detection

Anomaly detection algorithm 1 – One-Class SVM

Anomaly detection algorithm 2 – Half-Space-Trees

Going further with anomaly detection

Summary

Chapter 6: Online Classification

Technical requirements

Python environment

Defining classification

Identifying use cases of classification

Use case 1 – email spam classification

Use case 2 – face detection in phone camera

Use case 3 – online marketing ad selection

Overview of classification algorithms in River

Classification algorithm 1 – LogisticRegression

Classification algorithm 2 – Perceptron

Classification algorithm 3 – AdaptiveRandomForestClassifier

Classification algorithm 4 – ALMAClassifier

Classification algorithm 5 – PAClassifier

Evaluating benchmark results

Summary

Chapter 7: Online Regression

Technical requirements

Python environment

Defining regression

Use cases of regression

Use case 1 – Forecasting

Use case 2 – Predicting the number of faulty products in manufacturing

Overview of regression algorithms in River

Regression algorithm 1 – LinearRegression

Regression algorithm 2 – HoeffdingAdaptiveTreeRegressor

Regression algorithm 3 – SGTRegressor

Regression algorithm 4 – SRPRegressor

Summary

Chapter 8: Reinforcement Learning

Technical requirements

Python environment

Defining reinforcement learning

Comparing online and offline reinforcement learning

A more detailed overview of feedback loops in reinforcement learning

The main steps of a reinforcement learning model

Making the decisions

Updating the decision rules

Exploring Q-learning

The goal of Q-learning

Parameters of the Q-learning algorithm

Deep Q-learning

Using reinforcement learning for streaming data

Use cases of reinforcement learning

Use case one – trading system

Use case two – social network ranking system

Use case three – a self-driving car

Use case four – chatbots

Use case five – learning games

Implementing reinforcement learning in Python

Summary

Part 3: Advanced Concepts and Best Practices around Streaming Data

Chapter 9: Drift and Drift Detection

Technical requirements

Python environment

Defining drift

Three types of drift

Introducing model explicability

Measuring drift

Measuring data drift

Measuring concept drift

Measuring drift in Python

A basic intuitive approach to measuring drift

Measuring drift with robust tools

Counteracting drift

Offline learning with retraining strategies against drift

Online learning against drift

Summary

Chapter 10: Feature Transformation and Scaling

Technical requirements

Python environment

Challenges of data preparation with streaming data

Scaling data for streaming

Introducing scaling

Adapting scaling to a streaming context

Transforming features in a streaming context

Introducing PCA

Mathematical definition of PCA

Regular PCA in Python

Incremental PCA for streaming

Summary

Chapter 11: Catastrophic Forgetting

Technical requirements

Python environment

Introducing catastrophic forgetting

Catastrophic forgetting in online models

Detecting catastrophic forgetting

Using Python to detect catastrophic forgetting

Model explicability versus catastrophic forgetting

Explaining models using linear coefficients

Explaining models using dendrograms

Explaining models using variable importance

Summary

Chapter 12: Conclusion and Best Practices

Going further

Summary

Other Books You May Enjoy

Preface

Streaming data is the new top technology to watch in the field of data science and machine learning. As business needs become more demanding, many use cases require real-time analysis as well as real-time machine learning. This book will allow you to get up to speed with data analytics for streaming data and focuses strongly on adapting machine learning and other analytics to the case of streaming data.

You will first learn about the architecture for streaming and real-time machine learning. You will then look at the state-of-the-art frameworks for streaming data such as River.

You will learn about various industrial use cases for streaming data, such as online anomaly detection. Then, you will deep dive into challenges and how you will mitigate them. You will then learn the best practices that will help you use streaming data to generate real-time insights.

Upon completion of the book, you will be confident about using streaming data in your machine learning models.

Who this book is for

Data scientists and machine learning engineers who have a basis in machine learning, are practice- and technology-oriented, and want to learn how to apply machine learning to streaming data through practical examples with modern technologies will benefit from this book. You will need to understand basic Python and machine learning concepts but require no prior knowledge of streaming.

What this book covers

Chapter 1, Introduction to Streaming Data, explains what streaming data is and why it is different from batch data. This chapter also explains the challenges that we should expect to encounter as well as the advantages of using streaming data.

Chapter 2, Architectures for Streaming and Real-Time Machine Learning, describes various architectures that can be used to set up streaming, and how they can be utilized.

Chapter 3, Data Analysis on Streaming Data, explores data analysis on streaming data, which includes real-time insights, real-time descriptive statistics, real-time visualizations, and basic alerting systems.

Chapter 4, Online Learning with River, covers the core concepts of online learning and also introduces you to the River library, which is a fundamental part of streaming.

Chapter 5, Online Anomaly Detection, covers online anomaly detection, explains how it is useful, and also provides a use case that involves building a program for detecting anomalies in streaming data.

Chapter 6, Online Classification, covers online classification, explains how it is useful, and also provides a use case that involves building a program for classifying streaming data.

Chapter 7, Online Regression, covers online regression, how it is useful, and also provides a use case that involves building a program for detecting regression in streaming data.

Chapter 8, Reinforcement Learning, introduces you to reinforcement learning. We will explore some of the key algorithms and also explore some use cases for it using Python.

Chapter 9, Drift and Drift Detection, focuses on helping us understand drift in online learning and learning how to build solutions to detect drift.

Chapter 10, Feature Transformation and Scaling, shows us how to build a feature transformation pipeline that works with real-time and streaming data.

Chapter 11, Catastrophic Forgetting, explores what catastrophic forgetting is, and shows us how we can deal with it using example use cases.

Chapter 12, Conclusion and Best Practices, acts as a review of the book and combines all the concepts explored throughout the book for us to revise and revisit as needed.

To get the most out of this book

For following along with this book, you can use online notebook environments like Google Colab, Kaggle Notebooks, or your own local Jupyter Notebook environment with Python 3. Also, a (free) AWS account would be needed for a small number of exercises.

If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book's GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Machine-Learning-for-Streaming-Data-with-Python. If there's an update to the code, it will be updated in the GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots and diagrams used in this book. You can download it here: https://packt.link/6rZ0m.

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "There is no predict_many function here, so it is necessary to do a loop with predict_one repeatedly."

A block of code is set as follows:

def self_made_decision_tree(observation): if observation.can_speak: if not observation.has_feathers: return 'human' return 'not human' for i,row in data.iterrows(): print(self_made_decision_tree(row))

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

from sklearn.datasets import make_blobs X,y=make_blobs(shuffle=True,centers=2,n_samples=2000)

Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: "Select System info from the Administration panel."

Tips or important notes

Appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Share Your Thoughts

Once you've read Machine Learning for Streaming Data with Python, we'd love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we're delivering excellent quality content.

Part 1: Introduction and Core Concepts of Streaming Data

In this introductory part of the book, we will be introduced to the basic concept and principles surrounding streaming data. We will explore the various architectures that can be used to implement streaming data for machine learning. Finally, we will learn how to do data analysis on streaming data, along with various other functions.

This section comprises the following chapters:

Chapter 1, An Introduction to Streaming DataChapter 2, Architectures for Streaming and Real-Time Machine LearningChapter 3, Data Analysis on Streaming Data

Chapter 1: An Introduction to Streaming Data

Streaming analytics is one of the new hot topics in data science. It proposes an alternative framework to the more standard batch processing, in which we are no longer dealing with datasets on a fixed time of treatment, but rather we are handling every individual data point directly upon reception.

This new paradigm has important consequences for data engineering, as it requires much more robust and, particularly, much faster data ingestion pipelines. It also imposes a big change in data analytics and machine learning.

Until recently, machine learning and data analytics methods and algorithms were mainly designed to work on entire datasets. Now that streaming has become a hot topic, it becomes more and more common to see use cases in which entire datasets just do not exist anymore. When a continuous stream of data is being ingested into a data storage source, there is no natural moment to relaunch an analytics batch job.

Streaming analytics and streaming machine learning models are models that are designed to work specifically with streaming data sources. A part of the solution, for example, is in the updating. Streaming analytics and machine learning need to update all the time as new data is being received. When updating, you may also want to forget the much older data.

This and other problems that are introduced by moving from batch analytics to streaming analytics need a different approach to analytics and machine learning. This book will lay out the basis for getting you started with data analytics and machine learning on data that is received as a continuous stream.

In this first chapter, you'll get a more solid understanding of the differences between streaming and batch data. You'll see some example use cases that showcase the importance of working with streaming rather than converting back into batch. You'll also start working with a first Python example to get a feel for the type of work that you'll be doing throughout this book.

In later chapters, you'll see some more background notions on architecture and, then, you'll go into a number of data science and analytics use cases and how they can be adapted to the new streaming paradigm.

In this chapter, you will discover the following topics:

A short history of data scienceWorking with streaming dataReal-time data formats and importing an example dataset in Python

Technical requirements

You can find all the code for this book on GitHub at the following link: https://github.com/PacktPublishing/Machine-Learning-for-Streaming-Data-with-Python. If you are not yet familiar with Git and GitHub, the easiest way to download the notebooks and code samples is the following:

Go to the link of the repository.Go to the green Code button.Select Download ZIP:

Figure 1.1 – GitHub interface example

When you download the ZIP file, you unzip it in your local environment, and you will be able to access the code through your preferred Python editor.

Setting up a Python environment

To follow along with this book, you can download the code in the repository and execute it using your preferred Python editor.

If you are not yet familiar with Python environments, I would advise you to check out Anaconda (https://www.anaconda.com/products/individual), which comes with the Jupyter Notebook and JupyterLab, which are both great for executing notebooks. It also comes with Spyder and VSCode for editing scripts and programs.

If you have difficulty installing Python or the associated programs on your machine, you can check out Google Colab (https://colab.research.google.com/) or Kaggle Notebooks (https://www.kaggle.com/code), which both allow you to run Python code in online notebooks for free, without any setup to do.

Note

The code in the book will generally use Colab and Kaggle Notebooks with Python version 3.7.13 and you can set up your own environment to mimic this.

A short history of data science

Over the last few years, new technology domains have quickly taken over a lot of parts of the world. Machine learning, artificial intelligence, and data science are new fields that have entered our daily life, both in our personal lives and in our professional lives.

The topics that data scientists work on today are not new. The absolute foundation of the field is in mathematics and statistics, two fields that have existed for centuries. As an example, least squares regression was first published in 1805. With time, mathematicians and statisticians have continued working on finding other methods and models.

In the following timeline, you can see how the recent boom in technology has been able to take place. In the 1600s and 1700s, very smart people were already laying the foundations for what we still do in statistics and mathematics today. However, it was not until the invention and popularization of computing power that the field became booming.

Figure 1.2 – A timeline of the history of data

Personal computer and internet accessibility is an important reason for data science's popularity today. Almost everyone has a computer that is performant enough for fairly complex machine learning. This strongly helps computer literacy, but also, online documentation accessibility is a big booster for learning.

The availability of big data tools such as Hadoop and Spark is also an important part of the popularization of data science, as they allow practitioners to work with datasets that are larger than anyone could ever imagine before.

Lastly, cloud computing is allowing data scientists from all over the world to access very powerful hardware at low prices. Especially for big data tools, the hardware needed is still priced in a way that most students would not be able to buy it for training purposes. Cloud computing gives access to those use cases for many.

In this book, you will learn how to work with streaming data. It is important to have this short history of data science in mind, as streaming data is one of those technologies that has been disadvantaged by the need for difficult hardware and setup requirements. Streaming data is currently gaining popularity quickly in many domains and has the potential to be a big hit in the coming period. Let's now have a deeper look into the definition of streaming data.

Working with streaming data

Streaming data is data that is streamed. You may know the term streaming from online video services on which you can stream video. When doing this, the video streaming service will continue sending the next parts of the video to you while you are already watching the first part of the video.

The concept is the same when working with streaming data. The data format is not necessarily video and can be any data type that is useful for your use case. One of the most intuitive examples is that of an industrial production line, in which you have continuous measurements from sensors. As long as your production line doesn't pause, you will continue to generate measurements. We will check out the following overview of the data streaming process:

Figure 1.3 – The data streaming process

The important notion is that you have a continuous flow of data that you need to treat in real time. You cannot wait until the production line stops to do your analysis, as you would need to detect potential problems right away.

Streaming data versus batch data

Streaming data is generally not among the first use cases that new data scientists tend to start with. The type of problem that is usually introduced first is batch use cases. Batch data is the opposite of streaming data, as it works in phases: you collect a bunch of data, and then you treat a bunch of data.

If you see streaming data as streaming a video online, you could see batch data as downloading the entire video first and then watching it when the downloading is finished. For analytical purposes, this would mean that you get the analysis of a bunch of data when the data generating process is finished rather than whenever a problem occurs.

For some use cases, this is not a problem. Yet, you can understand that streaming can deliver great added value in those use cases where fast analytics can have an impact. It also has added value in use cases where data is ingested in a streaming method, which is becoming more and more common. In practice, many use cases that would get added value through streaming are still solved with batch treatment, just because these methods are better known and more widespread.

The following overview shows the batch treatment process:

Figure 1.4 – The batch process

Advantages of streaming data

Let's now look at some advantages of using streaming analytics rather than other approaches in the following subsections.

Data generating processes are in real time

The first advantage of building streaming data analytics rather than batch systems is that many data generating processes are actually in real time. You will discover a number of use cases later, but in general, it is rare that data collection is done in batches.

Although most of us are used to building batch systems around real-time data generating systems, it often makes more sense to build streaming analytics directly.

Of course, batch analytics and streaming analytics can co-exist. Yet, adding a batch treatment to a streaming analytics service is often much easier than adding streaming functionality into a system that is designed for batches. It simply makes the most sense to start with streaming.

Real-time insights have value

When designing data science solutions, streaming does not always come to mind first. However, when solutions or tools are built in real time, it is rare that the real-time functionality is not appreciated.

Many analytical solutions of today are built in real time and the tools are available. In many problems, real-time information will be used at some point. Maybe it will not be used from the start, but the day that anomalies happen, you will find a great competitive advantage in having the analytics straight away, rather than waiting till the next hour or the next morning.

Examples of successful implementation of streaming analytics

Let's talk about some examples of companies that have implemented real-time analytics successfully. The first example is Shell. They have been able to implement real-time analytics of their security cameras on their gas stations. An automated and real-time machine learning pipeline is able to detect whether people are smoking.

Another example is the use of sensor data in connected sports equipment. By measuring heart rate and other KPIs in real time, they are able to alert you when anything is wrong with your body.

Of course, the big players such as Facebook and Twitter also analyze a lot of data in real time, for example, when detecting fake news or bad content. There are many successful use cases of streaming analytics, yet at the same time, there are some common challenges that streaming data brings with them. Let's have a look at them now.

Challenges of streaming data

Streaming data analytics are currently less widespread than batch data analytics. Although this is slowly changing, it is good to understand where the challenges are when working with streaming data.

Knowledge of streaming analytics

One simple reason for streaming analytics being less widespread is a question of knowledge and know-how. Setting up streaming analytics is often not taught in schools and is definitely not taught as the go-to method. There are also fewer resources available on the internet to get started with it. As there are much more resources on machine learning and analytics for batch treatment, and the batch methods do not apply to streaming data, people tend to start with batch applications for data science.

Understanding the architecture

A second difficulty when working on streaming data is architecture. Although some data science practitioners have knowledge of architecture, data engineering, and DevOps, this is not always the case. To set up a streaming analytics proof of concept or a minimum viable product (MVP), all those skills are needed. For batch treatment, it is often enough to work with scripts.

Architectural difficulties are inherent to streaming, as it is necessary to work with real-time processes that send individually collected records to an analytical treatment process that will update in real time. If there is no architecture that can handle this, it does not make much sense to start with streaming analytics.

Financial hurdles

Another challenge when working with streaming data is the financial aspect. Although working with streaming is not necessarily more expensive in the long run, it can be more expensive to set up the infrastructure needed to get started. Working on a local developer PC for an MVP is unlikely to succeed as the data needs to be treated in real time.

Risks of runtime problems

Real-time processes also have a larger risk of runtime problems. When building software, bugs and failures happen. If you are on a daily batch process, you may be able to repair the process, rerun the failed batch, and solve the problem.

If a streaming tool is down, there are risks of losing data. As the data should be ingested in real time, the data that is generated during a time-out of your process may not be recuperable. If your process is very important, you will need to set up extensive monitoring day and night and have more quality checks before pushing your solutions to production. Of course, this is also important in batch processes, but even more so in streaming.

Smaller analytics (fewer methods easily available)

The last challenge of streaming analytics is that the common methods are generally developed for batch data first. There are currently many solutions out there for analytics on real time and streaming data, but still not as many as for batch data.

Also, since the streaming analysis has to be done very quickly to respect real-time delivery, streaming use cases tend to end up with much less interesting analytical methodologies and stay at the basic level of descriptive or basic analyses.

How to get started with streaming data

For companies to get started with streaming data, the first step

Tausende von E-Books und Hörbücher

Ihre Zahl wächst ständig und Sie haben eine Fixpreisgarantie.

Sie haben über uns geschrieben:

Machine Learning for Streaming Data with Python E-Book

Joos Korstanje

Machine Learning for Streaming Data with Python

Machine Learning for Streaming Data with Python

Contributors

About the author

About the reviewer

Table of Contents

Preface

Part 1: Introduction and Core Concepts of Streaming Data

Chapter 1: An Introduction to Streaming Data

Technical requirements

Setting up a Python environment

A short history of data science

Working with streaming data

Streaming data versus batch data

Advantages of streaming data

Examples of successful implementation of streaming analytics

Challenges of streaming data

How to get started with streaming data

Common use cases for streaming data

Streaming versus big data

Real-time data formats and importing an example dataset in Python

Summary

Further reading

Chapter 2: Architectures for Streaming and Real-Time Machine Learning

Technical requirements

Python environment

Defining your analytics as a function

Understanding microservices architecture

Communicating between services through APIs

Demystifying the HTTP protocol

The GET request

The POST request

JSON format for communication between systems

RESTful APIs

Building a simple API on AWS

API Gateway in AWS

Lambda in AWS

Data-generating process on a local machine

Implementing the example

More architectural considerations

Other AWS services and other services in general that have the same functionality

Big data tools for real time streaming

Calling a big data environment in real time

Summary

Further reading

Chapter 3: Data Analysis on Streaming Data

Technical requirements

Python environment

Descriptive statistics on streaming data

Why are descriptive statistics different on streaming data?

Introduction to sampling theory

Comparing population and sample

Population parameters and sample statistics

Sampling distribution

Sample size calculations and confidence level

Rolling descriptive statistics from streaming

Exponential weight

Tracking convergence as an additional KPI

Overview of the main descriptive statistics

The mean

The median

The mode

Standard deviation

Variance

Quartiles and interquartile range

Correlations

Real-time visualizations

Opening the dashboard

Comparing Plotly's Dash and other real-time visualization tools

Building basic alerting systems

Alerting systems on extreme values

Alerting systems on process stability (mean and median)

Alerting systems on constant variability (std and variance)

Basic alerting systems using statistical process control

Summary

Further reading

Part 2: Exploring Use Cases for Data Streaming

Chapter 4: Online Learning with River