39,59 €
Being a business analyst and data scientist, you have to use many algorithms and approaches to prepare, process, and build ML-based applications by leveraging time series data, but you face common problems, such as not knowing which algorithm to choose or how to combine and interpret them. Amazon Web Services (AWS) provides numerous services to help you build applications fueled by artificial intelligence (AI) capabilities. This book helps you get to grips with three AWS AI/ML-managed services to enable you to deliver your desired business outcomes.
The book begins with Amazon Forecast, where you’ll discover how to use time series forecasting, leveraging sophisticated statistical and machine learning algorithms to deliver business outcomes accurately. You’ll then learn to use Amazon Lookout for Equipment to build multivariate time series anomaly detection models geared toward industrial equipment and understand how it provides valuable insights to reinforce teams focused on predictive maintenance and predictive quality use cases. In the last chapters, you’ll explore Amazon Lookout for Metrics, and automatically detect and diagnose outliers in your business and operational data.
By the end of this AWS book, you’ll have understood how to use the three AWS AI services effectively to perform time series analysis.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 461
Veröffentlichungsjahr: 2022
Learn how to build forecasting models and detect anomalies in your time series data
Michaël Hoarau
BIRMINGHAM—MUMBAI
Copyright © 2022 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Publishing Product Manager: Reshma Raman
Senior Editor: Roshan Kumar, David Sugarman
Content Development Editor: Tazeen Shaikh
Technical Editor: Sonam Pandey
Copy Editor: Safis Editing
Project Coordinator: Aparna Ravikumar Nair
Proofreader: Safis Editing
Indexer: Subalakshmi Govindhan
Production Designer: Jyoti Chauhan
Marketing Coordinator: Priyanka Mhatre
First published: February 2022
Production reference: 1270122
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-80181-684-7
www.packt.com
To my parents, Roseline and Philippe, for their life-long dedication to offering the best learning environment I could hope for. To my wife, Nathalie, for her love, support, and inspiration. To my son, as he starts his journey to be a great creative thinker.
– Michaël Hoarau
Michaël Hoarau is an AI/ML specialist solutions architect (SA) working at Amazon Web Services (AWS). He is an AWS Certified Associate SA. He previously worked as an AI/ML specialist SA at AWS and the EMEA head of data science at GE Digital. He has experience in building product quality prediction systems for multiple industries. He has used forecasting techniques to build virtual sensors for industrial production lines. He has also helped multiple customers build forecasting and anomaly detection systems to increase their business efficiency.
I want to thank the people who have been close to me and supported me through the many evenings and weekends spent writing this book, especially my wife, Nathalie, and my son, Valentin.
Subramanian M Suryanarayanan (Mani) is head of category marketing and analytics at BigBasket – India's largest online supermarket. He has 23+ years of experience in consulting and analytics. He studied at the University of Madras, IIM-Ahmedabad, and MIT. Mani is a frequent contributor to industry and academic forums. He invests in, advises, and mentors start-ups in the DeepTech space. He is a mentor with the DeepTechClub of NASSCOM. Mani is the co-author of the best-selling book on BigBasket, Saying No to Jugaad, which was ranked among the top 5 best business books in India in 2020.
I want to thank my wife, Deepa, and my son, Nandan, who showed immense patience when I spent many evenings reviewing this book at the expense of spending time with them.
Christy Bergman is an AI/ML specialist solutions architect at AWS. Her work involves helping AWS customers be successful using AI/ML services to solve real-world business problems. Prior to joining AWS, Christy built real-time fraud detection at a banking start-up and prior to that was a data scientist in the subscription software industry. In her spare time, she enjoys hiking and bird watching.
This section will see you become familiar with time series data, the challenges it poses, and how it can be used to solve real-life business problems. You will learn about the different forms that time series datasets can take and the key use cases they help to address.
In this section, you will also learn about Amazon Forecast and how this AI service can help you tackle forecasting challenges. By the end of this section, you will have trained a forecast model and you will know how to generate new forecasts based on new data. You will also learn how to get deep insights from your model results to improve your forecasting strategy or gain a better understanding of the reason why a given prediction was made by your model.
This section comprises the following chapters:
Chapter 1, An Overview of Time Series AnalysisChapter 2, An Overview of Amazon ForecastChapter 3, Creating a Project and Ingesting Your DataChapter 4, Training a Predictor with AutoMLChapter 5, Customizing Your Predictor TrainingChapter 6, Generating New ForecastsChapter 7, Improving and Scaling Your Forecast StrategyTime series analysis is a technical domain with a very large choice of techniques that need to be carefully selected depending on the business problem you want to solve and the nature of your time series. In this chapter, we will discover the different families of time series and expose unique challenges you may encounter when dealing with this type of data.
By the end of this chapter, you will understand how to recognize what type of time series data you have and select the best approaches to perform your time series analysis (depending on the insights you want to uncover), and you will understand the use cases that Amazon Forecast, Amazon Lookout for Equipment, and Amazon Lookout for Metrics can help you solve, and which ones they are not suitable for.
You will also have a sound toolbox of time series techniques, Amazon Web Services (AWS) services, and open source Python packages that you can leverage in addition to the three managed services described in detail in this book. Data scientists will also find these tools to be great additions to their time series exploratory data analysis (EDA) toolbox.
In this chapter, we are going to cover the following main topics:
What is a time series dataset?Recognizing the different families of time seriesAdding context to time series dataLearning about common time series challengesSelecting an analysis approachTypical time series use casesNo hands-on experience in a language such as Python or R is necessary to follow along with the content of this chapter. However, for the more technology-savvy readers, we will address some technical considerations (such as visualization, transformation, and preprocessing) and the packages you can leverage in the Python language to address these. In addition, you can also open the Jupyter notebook you will find in the following GitHub repository to follow along and experiment by yourself: https://github.com/PacktPublishing/Time-series-Analysis-on-AWS/blob/main/Chapter01/chapter1-time-series-analysis-overview.ipynb.
Although following along this chapter with the aforementioned notebook provided is optional, this will help you build your intuition about what insights the time series techniques described in this introductory chapter can deliver.
From a mathematical point of view, a time series is a discrete sequence of data points that are listed chronologically. Let's take each of the highlighted terms of this definition, as follows:
Sequence: A time series is a sequence of data points. Each data point can take a single value (usually a binary, an integer, or a real value).Discrete: Although the phenomena measured by a system can be continuous (the outside temperature is a continuous variable as time can, of course, range over the entire real number line), time series data is generated by systems that capture data points at a certain interval (a temperature sensor can, for instance, take a new temperature reading every 5 minutes). The measurement interval is usually regular (the data points will be equally spaced in time), but more often than not, they are irregular.Chronologically: The sequence of data points is naturally time-ordered. If you were to perform an analysis of your time series data, you will generally observe that data points that are close together in time will be more closely related than observations that are further apart. This natural ordering means that any time series model usually tries to learn from past values of a given sequence rather than from future values (this is obvious when you try to forecast future values of a time series sequence but is still very important for other use cases).Machine learning (ML) and artificial intelligence (AI) have largely been successful at leveraging new technologies including dedicated processing units (such as graphics processing units, or GPUs), fast network channels, and immense storage capacities. Paradigm shifts such as cloud computing have played a big role in democratizing access to such technologies, allowing tremendous leaps in novel architectures that are immediately leveraged in production workloads.
In this landscape, time series appear to have benefited a lot less from these advances: what makes time series data so specific? We might consider a time series dataset to be a mere tabular dataset that would be time-indexed: this would be a big mistake. Introducing time as a variable in your dataset means that you are now dealing with the notion of temporality: patterns can now emerge based on the timestamp of each row of your dataset. Let's now have a look at the different families of time series you may encounter.
In this section, you will become familiar with the different families of time series. For any ML practitioner, it is obvious that a single image should not be processed like a video stream and that detecting an anomaly on an image requires a high enough resolution to capture the said anomaly. Multiple images from a certain subject (for example, pictures of a cauliflower) would not be very useful to teach an ML system anything about the visual characteristics of a pumpkin—or an aircraft, for that matter. As eyesight is one of our human senses, this may be obvious. However, we will see in this section and the following one (dedicated to challenges specific to time series) that the same kinds of differences apply to different time series.
There are four different families involved in time series data, which are outlined here:
Univariate time series dataContinuous multivariate dataEvent-based multivariate dataMultiple time series dataA univariate time series is a sequence of single time-dependent values.
Such a series could be the energy output in kilowatt-hour (kWh) of a power plant, the closing price of a single stock market action, or the daily average temperature measured in Paris, France.
The following screenshot shows an excerpt of the energy consumption of a household:
Figure 1.1 – First rows and line plot of a univariate time series capturing the energy consumption of a household
A univariate time series can be discrete: for instance, you may be limited to the single daily value of stock market closing prices. In this situation, if you wanted to have a higher resolution (say, hourly data), you would end up with the same value duplicated 24 times per day.
Temperature seems to be closer to a continuous variable, for that matter: you can get a reading as frequently as you would wish, and you can expect some level of variation whenever you have a data point. You are, however, limited to the frequency at which the temperature sensor takes its reading (every 5 minutes for a home meteorological station or every hour in main meteorological stations). For practical purposes, most time series are indeed discrete, hence the definition called out earlier in this chapter.
The three services described in this book (Amazon Forecast, Amazon Lookout for Equipment, and Amazon Lookout for Metrics) can deal with univariate data to perform with various use cases.
A multivariate time series dataset is a sequence of many-valued vector values emitted at the same time. In this type of dataset, each variable can be considered individually or in the context shaped by the other variables as a whole. This happens when complex relationships govern the way these variables evolve with time (think about several engineering variables linked through physics-based equations).
An industrial asset such as an arc furnace (used in steel manufacturing) is running 24/7 and emits time series data captured by sensors during its entire lifetime. Understanding these continuous time series is critical to prevent any risk of unplanned downtime by performing the appropriate maintenance activities (a domain widely known under the umbrella term of predictive maintenance). Operators of such assets have to deal with sometimes thousands of time series generated at a high frequency (it is not uncommon to collect data with a 10-millisecond sampling rate), and each sensor is measuring a physical grandeur. The key reason why each time series should not be considered individually is that each of these physical grandeurs is usually linked to all the others by more or less complex physical equations.
Take the example of a centrifugal pump: such a pump transforms rotational energy provided by a motor into the displacement of a fluid. While going through such a pump, the fluid gains both additional speed and pressure. According to Euler's pump equation, the head pressure created by the impeller of the centrifugal pump is derived using the following expression:
In the preceding expression, the following applies:
H is the head pressure.u denotes the peripheral circumferential velocity vector.w denotes the relative velocity vector.c denotes the absolute velocity vector.Subscript 1 denotes the input variable (also called inlet for such a pump). For instance, w1 is the inlet relative velocity.Subscript 2 denotes output variables (also called peripheral variables when dealing with this kind of asset). For instance, w2 is the peripheral relative velocity.g is the gravitational acceleration and is a constant value depending on the latitude where the pump is located.A multivariate time series dataset describing this centrifugal pump could include u1, u2, w1, w2, c1, c2, and H. All these variables are obviously linked together by the law of physics that governs this particular asset and cannot be considered individually as univariate time series.
If you know when this particular pump is in good shape, has had a maintenance operation, or is running through an abnormal event, you can also have an additional column in your time series capturing this state: your multivariate time series can then be seen as a related dataset that might be useful to try and predict the condition of your pump. You will find more details and examples about labels and related time series data in the Adding context to time series data section.
Amazon Lookout for Equipment, one of the three services described in this book, is able to perform anomaly detection on this type of multivariate dataset.
There are situations where data is continuously recorded across several operating modes: an aircraft going through different sequences of maneuvers from the pilot, a production line producing successive batches of different products, or rotating equipment (such as a motor or a fan) operating at different speeds depending on the need.
A multivariate time series dataset can be collected across multiple episodes or events, as follows:
Each aircraft flight can log a time series dataset from hundreds of sensors and can be matched to a certain sequence of actions executed by the aircraft pilot. Of course, a given aircraft will go through several overlapping maintenance cycles, each flight is different, and the aircraft components themselves go through a natural aging process that can generate additional stress and behavior changes due to the fatigue of going through hundreds of successive flights.A beauty-care plant produces multiple distinct products on the same production line (multiple types and brands of shampoos and shower gels), separated by a clean, in-place process to ensure there is no contamination of a given product by a previous one. Each batch is associated with a different recipe, with different raw materials and different physical characteristics. Although the equipment and process time series are recorded continuously and can be seen as a single-flow variable indexed by time, they can be segmented by the batch they are associated with.In some cases, a multivariate dataset must be associated with additional context to understand which operating mode a given segment of a time series can be associated with. If the number of situations to consider is reasonably low, a service such as Amazon Lookout for Equipment can be used to perform anomaly detection on this type of dataset.
You might also encounter situations where you have multiple time series data that does not form a multivariate time series dataset. These are situations where you have multiple independent signals that can each be seen as a single univariate time series. Although full independence might be debatable depending on your situation, there are no additional insights to be gained by considering potential relationships between the different univariate series.
Here are some examples of such a situation:
Closing price for multiple stock market actions: For any given company, the trading stock can be influenced by both exogenous factors (for example, a worldwide pandemic pushing entire countries into shelter-in-place situations) and endogenous decisions (board of directors' decisions; a strategic move from leadership; major innovation delivered by a research and development (R&D) team). Each stock price is not necessarily impacted by other companies' stock prices (competitors, partners, organizations operating on the same market).Sold items for multiple products on an online retail store: Although some products might be related (more summer clothes when temperatures are rising again in spring), they do not have an influence on each other and they happen to sport similar behavior.Multiple time series are hard to analyze and process as true multivariate time series data as the mechanics that trigger seemingly linked behaviors are most of the time coming from external factors (summer approaching and having a similar effect on many summer-related items). Modern neural networks are, however, able to train global models on all items at once: this allows them to uncover relationships and context that are not provided in the dataset to reach a higher level of accuracy than traditional statistical models that are local (univariate-focused) by nature.
We will see later on that Amazon Forecast (for forecasting with local and global models) and Amazon Lookout for Metrics (for anomaly detection on univariate business metrics) are good examples of services provided by Amazon that can deal with this type of dataset.
Simply speaking, there are three main ways an ML model can learn something new, as outlined here:
Supervised learning (SL): Models are trained using input data and labels (or targets). The labels are provided as an instructor would provide directions to a student learning a new move. Training a model to approximate the relationship between input data and labels is a supervised approach.Unsupervised learning (UL): This approach is used when using ML to uncover and extract underlying relationships that may exist in a given dataset. In this case, we only operate on the input data and do not need to provide any labels or output data. We can, however, use labels to assess how good a given unsupervised model is at capturing reality.Reinforcement learning (RL): To train a model with RL, we build an environment that is able to send feedback to an agent. We then let the agent operate within this environment (using a set of actions) and react based on the feedback provided by the environment in response to each action. We do not have a fixed training dataset anymore, but an environment that sends an input sample (feedback) in reaction to an action from the agent.Whether you are dealing with univariate, multiple, or multivariate time series datasets, you might need to provide extra context: location, unique identification (ID) number of a batch, components from the recipes used for a given batch, sequence of actions performed by a pilot during an aircraft flight test, and so on. The same sequence of values for univariate and multivariate time series could lead to a different interpretation in different contexts (for example, are we cruising or taking off; are we producing a batch of shampoo or shower gel?).
All this additional context can be provided in the form of labels, related time series, or metadata that will be used differently depending on the type of ML you leverage. Let's have a look at what these pieces of context can look like.
Labels can be used in SL settings where ML models are trained using input data (our time series dataset) and output data (the labels). In a supervised approach, training a model is the process of learning an approximation between the input data and the labels. Let's review a few examples of labels you can encounter along with your time series datasets, as follows:
The National Aeronautics and Space Administration (NASA) has provided the community with a very widely used benchmark dataset that contains the remaining useful lifetime of a turbofan measured in cycles: each engine (identified by unit_number in the following table) has its health measured with multiple sensors, and readings are provided after each flight (or cycle). The multivariate dataset recorded for each engine can be labeled with the remaining useful lifetime (rul) known or estimated at the end of each cycle (this is the last column in the following table). Here, each individual timestamp is characterized by a label (the remaining lifetime measured in a cycle):Figure 1.2 – NASA turbofan remaining useful lifetime
The ECG200 dataset is another widely used time series dataset as a benchmark for time series classification. The electrical activity recorded during human heartbeats can be labeled as Normal or Ischemia (myocardial infarction), as illustrated in the following screenshot. Each time series as a whole is characterized by a label:Figure 1.3 – Heartbeat activity for 100 patients (ECG200 dataset)
Kaggle also offers a few time series datasets of interest. One of them contains sensor data from a water pump with known time ranges where the pump is broken and when it is being repaired. In the following case, labels are available as time ranges:Figure 1.4 – Water pump sensor data showcasing healthy and broken time ranges
As you can see, labels can be used to characterize individual timestamps of a time series, portions of a time series, or even whole time series.
Related time series are additional variables that evolve in parallel to the time series that is the target of your analysis. Let's have a look at a few examples, as follows:
In the case of a manufacturing plant producing different batches of product, a critical signal to have is the unique batch ID that can be matched with the starting and ending timestamps of the time series data.The electricity consumption of multiple households from London can be matched with several pieces of weather data (temperature, wind speed, rainfall), as illustrated in the following screenshot:Figure 1.5 – London household energy consumption versus outside temperature in the same period
In the water pump dataset, the different sensors' data could be considered as related time series data for the pump health variable, which can either take a value of 0 (healthy pump) or 1 (broken pump).When your dataset is multivariate or includes multiple time series, each of these can be associated with parameters that do not depend on time. Let's have a look at this in more detail here:
In the example of a manufacturing plant mentioned before, each batch of products could be different, and the metadata associated with each batch ID could be the recipe used to manufacture this very batch.For London household energy consumption, each time series is associated with a household that could be further associated with its house size, the number of people, its type (house or flat), the construction time, the address, and so on. The following screenshot lists some of the metadata associated with a few households from this dataset: we can see, for instance, that 27 households fall into the ACORN-A category that has a house with 2 beds:Figure 1.6 – London household metadata excerpt
Now you have understood how time series can be further described with additional context such as labels, related time series, and metadata, let's now dive into common challenges you can encounter when analyzing time series data.
Time series data is a very compact way to encode multi-scale behaviors of the measured phenomenon: this is the key reason why fundamentally unique approaches are necessary compared to tabular datasets, acoustic data, images, or videos. Multivariate datasets add another layer of complexity due to the underlying implicit relationships that can exist between multiple signals.
This section will highlight key challenges that ML practitioners must learn to tackle to successfully uncover insights hidden in time series data.
These challenges can include the following:
Technical challengesVisualization challengesBehavioral challengesMissing insights and contextIn addition to time series data, contextual information can also be stored as separate files (or tables from a database): this includes labels, related time series, or metadata about the items being measured. Related time series will have the same considerations as your main time series dataset, whereas labels and metadata will usually be stored as a single file or a database table. We will not focus on these items as they do not pose any challenges different from any usual tabular dataset.
When you discover a new time series dataset, the first thing you have to do before you can apply your favorite ML approach is to understand the type of processing you need to apply to it. This dataset can actually come in several files that you will have to assemble to get a complete overview, structured in one of the following ways:
By time ranges: With one file for each month and every sensor included in each file. In the following screenshot, the first file will cover the range in green (April 2018) and contains all the data for every sensor (from sensor_00 to sensor_09), the second file will cover the range in red (May 2018), and the third file will cover the range in purple (June 2018):Figure 1.7 – File structure by time range (example: one file per month)
By variable: With one file per sensor for the complete time range, as illustrated in the following screenshot:Figure 1.8 – File structure by variable (for example, one sensor per file)
Or, you could use both the time range and variables, as follows:Figure 1.9 – File structure by variable and by time range (for example, one file for each month and each sensor)
When you deal with multiple time series (either independent or multivariate), you might want to assemble them in a single table (or DataFrame if you are a Python pandas practitioner). When each time series is stored in a distinct file, you may suffer from misaligned timestamps, as in the following case:
Figure 1.10 – Multivariate time series with misaligned timestamps
There are three approaches you can take, and this will depend on the actual processing and learning process you will set up further down the road. You could do one of the following:
Leave each time series in its own file with its own timestamps. This leaves your dataset untouched but will force you to consider a more flexible data structure when you want to feed it into an ML system.Resample every time series to a common sampling rate and concatenate the different files by inserting each time series as a column in your table. This will be easier to manipulate and process but you won't be dealing with your raw data anymore. In addition, if your contextual data also provides timestamps (to separate each batch of a manufactured product, for instance), you will have to take them into account (one approach could be to slice your data to have time series per batch and resample and assemble your dataset as a second step).Merge all the time series and forward fill every missing value created at the merge stage (see all the NaN values in the following screenshot). This process is more compute-intensive, especially when your timestamps are irregular:Figure 1.11 – Merging time series with misaligned timestamps
The format used to store the time series data itself can vary and will have its own benefits or challenges. The following list exposes common formats and the Python libraries that can help tackle them:
Comma-separated values (CSV): One of the most common and—unfortunately—least efficient formats to deal with when it comes to storing time series data. If you need to read or write time series data multiple times (for instance, during EDA), it is highly recommended to transform your CSV file into another more efficient format. In Python, you can read and write CSV files with pandas (read_csv and to_csv), and NumPy (genfromtxt and savetxt).Microsoft Excel (XLSX): In Python, you can read and write Excel files with pandas (read_excel and to_excel) or dedicated libraries such as openpyxl or xlsxwriter. At the time of this writing, Microsoft Excel is limited to 1,048,576 rows (220) in a single file. When your dataset covers several files and you need to combine them for further processing, sticking to Excel can generate errors that are difficult to pinpoint further down the road. As with CSV, it is highly recommended to transform your dataset into another format if you plan to open and write it multiple times during your dataset lifetime.Parquet: This is a very efficient column-oriented storage format. The Apache Arrow project hosts several libraries that offer great performance to deal with very large files. Writing a 5 gigabyte (GB) CSV file can take up to 10 minutes, whereas the same data in Parquet will take up around 3.5 GB and be written in 30 seconds. In Python, Parquet files and datasets can be managed by the pyarrow.parquet module.Hierarchical Data Format 5 (HDF5): HDF5 is a binary data format dedicated to storing huge amounts of numerical data. With its ability to let you slice multi-terabyte datasets on disk to bring only what you need in memory, this is a great format for data exploration. In Python, you can read and write HDF5 files with pandas (read_hdf and to_hdf) or the h5py library.Databases: Your time series might also be stored in general-purpose databases (that you will query using standard Structured Query Language (SQL) languages) or may be purpose-built for time series such as Amazon Timestream or InfluxDB. Column-oriented databases or scalable key-value stores such as Cassandra or Amazon DynamoDB can also be used while taking benefit from anti-patterns useful for storing and querying time series data.As with any other type of data, time series can be plagued by multiple data quality issues. Missing data (or Not-a-Number values) can be filled in with different techniques, including the following:
Replace missing data points by the mean or median of the whole time series: the fancyimpute (http://github.com/iskandr/fancyimpute) library includes a SimpleFill method that can tackle this task.Using a rolling window of a reasonable size before replacing missing data points by the average value: the impyute module (https://impyute.readthedocs.io) includes several methods of interest for time series such as moving_window to perform exactly this.Forward fill missing data points by the last known value: This can be a useful technique when the data source uses some compression scheme (an industrial historian system such as OSIsoft PI can enable compression of the sensor data it collects, only recording a data point when the value changes). The pandas library includes functions such as Series.fillna whereby you can decide to backfill, forward fill, or replace a missing value with a constant value.You can also interpolate values between two known values: Combined with a resampling to align every timestamp for multivariate situations, this yields a robust and complete dataset. You can use the Series.interpolate method from pandas to achieve this.For all these situations, we highly recommend plotting and comparing the original and resulting time series to ensure that these techniques do not negatively impact the overall behavior of your time series: imputing data (especially interpolation) can make outliers a lot more difficult to spot, for instance.
Important note
Imputing scattered missing values is not mandatory, depending on the analysis you want to perform—for instance, scattered missing values may not impair your ability to understand a trend or forecast future values of a time series. As a matter of fact, wrongly imputing scattered missing values for forecasting may bias the model if you use a constant value (say, zero) to replace the holes in your time series. If you want to perform some anomaly detection, a missing value may actually be connected to the underlying reason of an anomaly (meaning that the probability of a value being missing is higher when an anomaly is around the corner). Imputing these missing values may hide the very phenomena you want to detect or predict.
Other quality issues can arise regarding timestamps: an easy problem to solve is the supposed monotonic increase. When timestamps are not increasing along with your time series, you can use a function such as pandas.DataFrame.sort_index or pandas.DataFrame.sort_values to reorder your dataset correctly.
Duplicated timestamps can also arise. When they are associated with duplicated values, using pandas.DataFrame.duplicated will help you pinpoint and remove these errors. When the sampling rate is lesser or equal to an hour, you might see duplicate timestamps with different values: this can happen around daylight saving time changes. In some countries, time moves forward by an hour at the start of summer and back again in the middle of the fall—for example, Paris (France) time is usually Central European Time (CET); in summer months, Paris falls into Central European Summer Time (CEST). Unfortunately, this usually means that you have to discard all the duplicated values altogether, except if you are able to replace the timestamps with their equivalents, including the actual time zone they were referring to.
Data quality at scale
In production systems where large volumes of data must be processed at scale, you may have to leverage distribution and parallelization frameworks such as Dask, Vaex, or Ray. Moreover, you may have to move away from Python altogether: in this case, services such as AWS Glue, AWS Glue DataBrew, and Amazon Elastic MapReduce (Amazon EMR) will provide you a managed platform to run your data transformation pipeline with Apache Spark, Flink, or Hive, for instance.
Taking a sneak peek at a time series by reading and displaying the first few records of a time series dataset can be useful to make sure the format is the one expected. However, more often than not, you will want to visualize your time series data, which will lead you into an active area of research: how to transform a time series dataset into a relevant visual representation.
Here are some key challenges you will likely encounter:
Plotting a high number of data pointsPreventing key events from being smoothed out by any resamplingPlotting several time series in parallelGetting visual cues from a massive amount of time seriesUncovering multiple scales behavior across long time seriesMapping labels and metadata on time seriesLet's have a look at different techniques and approaches you can leverage to tackle these challenges.
Raw time series data is usually visualized with line plots: you can easily achieve this in Microsoft Excel or in a Jupyter notebook (thanks to the matplotlib library with Python, for instance). However, bringing long time series in memory to plot them can generate heavy files and images difficult or long to render, even on powerful machines. In addition, the rendered plots might consist of more data points than what your screen can display in terms of pixels. This means that the rendering engine of your favorite visualization library will smooth out your time series. How do you ensure, then, that you do not miss the important characteristics of your signal if they happen to be smoothed out?
On the other hand, you could slice a time series to a more reasonable time range. This may, however, lead you to inappropriate conclusions about the seasonality, the outliers to be processed, or potential missing values to be imputed on certain time ranges outside of your scope of analysis.
This is where interactive visualization comes in. Using such a tool will allow you to load a time series, zoom out to get the big picture, zoom in to focus on certain details, and pan a sliding window to visualize a movie of your time series while keeping full control of the traveling! For Python users, libraries such as plotly (http://plotly.com) or bokeh (http://bokeh.org) are great options.
When you need to plot several time series and understand how they evolve with regard to each other, you have different options depending on the number of signals you want to visualize and display at once. What is the best representation to plot several time series in parallel? Indeed, different time series will likely have different ranges of possible values, and we only have two axes on a line plot.
If you have just a couple of time series, any static or interactive line plot will work. If both time series have a different range of values, you can assign a secondary axis to one of them, as illustrated in the following screenshot:
Figure 1.12 – Visualizing a low number of plots
If you have more than two time series and fewer than 10 to 20 that have similar ranges, you can assign a line plot to each of them in the same context. It is not too crowded yet, and this will allow you to detect any level shifts (when all signals go through a sudden significant change). If the range of possible values each time series takes is widely different from one another, a solution is to normalize them by scaling them all to take values comprised between 0.0 and 1.0 (for instance). The scikit-learn library includes methods that are well known by ML practitioners for doing just this (check out the sklearn.preprocessing.Normalizer or the sklearn.preprocessing.StandardScaler methods).
The following screenshot shows a moderate number of plots being visualized:
Figure 1.13 – Visualizing a moderate number of plots
Even though this plot is a bit too crowded to focus on the details of each signal, we can already pinpoint some periods of interest.
Let's now say that you have hundreds of time series
