Codeless Time Series Analysis with KNIME - KNIME AG - E-Book

Codeless Time Series Analysis with KNIME E-Book

KNIME AG

0,0
33,59 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

This book will take you on a practical journey, teaching you how to implement solutions for many use cases involving time series analysis techniques.
This learning journey is organized in a crescendo of difficulty, starting from the easiest yet effective techniques applied to weather forecasting, then introducing ARIMA and its variations, moving on to machine learning for audio signal classification, training deep learning architectures to predict glucose levels and electrical energy demand, and ending with an approach to anomaly detection in IoT. There’s no time series analysis book without a solution for stock price predictions and you’ll find this use case at the end of the book, together with a few more demand prediction use cases that rely on the integration of KNIME Analytics Platform and other external tools.
By the end of this time series book, you’ll have learned about popular time series analysis techniques and algorithms, KNIME Analytics Platform, its time series extension, and how to apply both to common use cases.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 431

Veröffentlichungsjahr: 2022

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Codeless Time Series Analysis with KNIME

A practical guide to implementing forecasting models for time series analysis applications

Corey Weisinger

Maarit Widmann

Daniele Tonini

BIRMINGHAM—MUMBAI

Codeless Time Series Analysis with KNIME

Copyright © 2022 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Reshma Raman

Publishing Product Manager: Reshma Raman

Senior Editor: Nithya Sadanandan

Technical Editor: Pradeep Sahu

Copy Editor: Safis Editing

Project Coordinator: Deeksha Thakkar

Proofreader: Safis Editing

Indexer: Manju Arasan

Production Designer: Prashant Ghare

Marketing Coordinator: Priyanka Mhatre

First published: July 2022

Production reference: 1220722

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-80323-206-5

www.packt.com

Thanks to my colleagues at KNIME for the technical support and encouragement, especially to Andisa Dewi and Tobias Kötter for the Taxi Demand Prediction application, and Rosaria Silipo, Phil Winters, and Iris Adä for the Anomaly Detection application.

– Maarit Widmann

I would like to thank the KNIME team for including me in this great project. Especially thanks to Rosaria Silipo, for her trust and support, and to my co-authors Maarit and Corey, for taking this long journey with me.

– Daniele Tonini

Contributors

About the authors

Corey Weisinger is a data scientist with KNIME in Austin, Texas. He studied mathematics at Michigan State University focusing on actuarial techniques and functional analysis. Before coming to work for KNIME, he worked as an analytics consultant for the auto industry in Detroit, Michigan. He currently focuses on signal processing and numeric prediction techniques and is the author of the From Alteryx to KNIME guidebook.

Maarit Widmann is a data scientist and an educator at KNIME: the instructor behind the KNIME self-paced courses and a teacher of the KNIME courses. She is the author of the From Modeling to Model Evaluation eBook and she publishes regularly on the KNIME blog and on Medium. She holds a master’s degree in data science and a bachelor’s degree in sociology.

Daniele Tonini is an experienced advisor and educator in the field of advanced business analytics and machine learning. In the last 15 years, he designed and deployed predictive analytics systems, and data quality management and dynamic reporting tools, mainly for customer intelligence, risk management, and pricing applications. He is an Academic Fellow at Bocconi University (Department of Decision Science) and SDA Bocconi School of Management (Decision Sciences & Business Analytics Faculty). He’s also an adjunct professor in data mining at Franklin University, Switzerland. He currently teaches statistics, predictive analytics for data-driven decision making, big data and databases, market research, and data mining.

About the reviewers

Miguel Infestas Maderuelo has a Ph.D. in applied economics and has developed his career around data analytics in different fields (digital marketing, data mining, academic research, and so on). His last project is as a founder of a digital marketing agency, applying analytics on digital data to optimize digital communication.

Rosaria Silipo, Ph.D., now head of data science evangelism at KNIME, has spent 25+ years in applied AI, predictive analytics, and machine learning at Siemens, Viseca, Nuance Communications, and private consulting. Sharing her practical experience in a broad range of industries and deployments, including IoT, customer intelligence, financial services, social media, and cybersecurity, Rosaria has authored 50+ technical publications, including her recent books Guide to Intelligent Data Science (Springer) and Codeless Deep Learning with KNIME (Packt).

Table of Contents

Preface

Part 1: Time Series Basics and KNIME Analytics Platform

Chapter 1: Introducing Time Series Analysis

Understanding TSA

Exploring time series properties and examples

Continuous and discrete time series

Independence and serial correlation

Time series examples

TSA goals and applications

Goals of TSA

Domains of applications and use cases

Exploring time series forecasting techniques

Quantitative forecasting properties and techniques

Summary

Questions

Chapter 2: Introduction to KNIME Analytics Platform

Exploring the KNIME software

Introducing KNIME Analytics Platform for creating data science applications

Introducing KNIME Server for productionizing data science applications

Introducing nodes and workflows

Introducing nodes

Introducing workflows

Searching for and sharing resources on the KNIME Hub

Building your first workflow

Creating a new workflow (group)

Reading and transforming data

Filtering rows

Visualizing data

Building a custom interactive view

Documenting workflows

Configuring the time series integration

Introducing the time series components

Configuring Python in KNIME

Summary

Questions

Chapter 3: Preparing Data for Time Series Analysis

Introducing different sources of time series data

Time granularity and time aggregation

Defining time granularity

Finding the right time granularity

Aggregating time series data

Equal spacing and time alignment

Explaining the concept of equal spacing

Missing value imputation

Defining the different types of missing values

Introducing missing value imputation techniques

Summary

Questions

Chapter 4: Time Series Visualization

Technical requirements

Introducing an energy consumption time series

Describing raw energy consumption data

Clustering energy consumption data

Introducing line plots

Displaying simple dynamics with a line plot

Interpreting the dynamics of a time series based on a line plot

Building a line plot in KNIME

Introducing lag plots

Introducing insights derived from a lag plot

Building a lag plot in KNIME

Introducing seasonal plots

Comparing seasonal patterns in a seasonal plot

Building a seasonal plot in KNIME

Introducing box plots

Inspecting variability of data in a box plot

Building a box plot in KNIME

Summary

Questions

Chapter 5: Time Series Components and Statistical Properties

Technical requirements

Trend and seasonality components

Trend

Seasonality

Decomposition

Autocorrelation

Stationarity

Summary

Questions

Part 2: Building and Deploying a Forecasting Model

Chapter 6: Humidity Forecasting with Classical Methods

Technical requirements

The importance of predicting the weather

Other IoT sensors

The use case

Streaming humidity data from an Arduino sensor

What is an Arduino?

Moving data to KNIME

Storing the data to create a training set

Resampling and granularity

Aligning data timestamps

Missing values

Aggregation techniques

Training and deployment

Types of classic models available in KNIME

Training a model in KNIME

Available deployment options

Building the workflow

Writing model predictions to a database

Summary

Questions

Chapter 7: Forecasting the Temperature with ARIMA and SARIMA Models

Recapping regression

Defining a regression

Introducing the (S)ARIMA models

Requirements of the (S)ARIMA model

How to configure the ARIMA or SARIMA model

Fitting the model and generating forecasts

The data

Summary

Further reading

Questions

Chapter 8: Audio Signal Classification with an FFT and a Gradient-Boosted Forest

Technical requirements

Why do we want to classify a signal?

Windowing your data

Windowing your data in KNIME

What is a transform?

The Fourier transform

Discrete Fourier Transform (DFT)

Fast Fourier Transform (FFT)

Applying the Fourier transform in KNIME

Preparing data for modeling

Reducing dimensionality

Training a Gradient Boosted Forest

Applying the Fourier transform in KNIME

Applying the Gradient Boosted Trees Learner

Deploying a Gradient Boosted Forest

Summary

Questions

Chapter 9: Training and Deploying a Neural Network to Predict Glucose Levels

Technical requirements

Glucose prediction and the glucose dataset

Glucose prediction

The glucose dataset

A quick introduction to neural networks

Artificial neurons and artificial neural networks

The backpropagation algorithm

Other types of neural networks

Training a feedforward neural network to predict glucose levels

KNIME Deep Learning Keras Integration

Building the network

Training the network

Scoring the network and creating the output rule

Deploying an FFNN-based alarm system

Summary

Questions

Chapter 10: Predicting Energy Demand with an LSTM Model

Technical requirements

Introducing recurrent neural networks and LSTMs

Recapping recurrent neural networks

The architecture of the LSTM unit

Forget Gate

Input Gate

Output Gate

Encoding and tensors

Input data

Reshaping the data

Training an LSTM-based neural network

The Keras Network Learner node

Deploying an LSTM network for future prediction

Scoring the forecasts

Summary

Questions

Chapter 11: Anomaly Detection – Predicting Failure with No Failure Examples

Technical requirements

Introducing the problem of anomaly detection in predictive maintenance

Introducing the anomaly detection problem

IoT data preprocessing

Exploring anomalies visually

Detecting anomalies with a control chart

Introducing a control chart

Implementing a control chart

Predicting the next sample in a correctly working system with an auto-regressive model

Introducing an auto-regressive model

Training an auto-regressive model with the linear regression algorithm

Deploying an auto-regressive model

Summary

Questions

Part 3: Forecasting on Mixed Platforms

Chapter 12: Predicting Taxi Demand on the Spark Platform

Technical requirements

Predicting taxi demand in NYC

Connecting to the Spark platform and preparing the data

Introducing the Hadoop ecosystem

Accessing the data and loading it into Spark

Introducing the Spark compatible nodes

Training a random forest model on Spark

Exploring seasonalities via line plots and auto-correlation plot

Preprocessing the data

Training and testing the random forest model on Spark

Building the deployment application

Predicting the trip count in the next hour

Predicting the trip count in the next 24 hours

Summary

Questions

Chapter 13: GPU Accelerated Model for Multivariate Forecasting

Technical requirements

From univariate to multivariate – extending the prediction problem

Building and training the multivariate neural architecture

Enabling GPU execution for neural networks

Setting up a new GPU Python environment

Switching Python environments dynamically

Building the deployment application

Summary

Questions

Chapter 14: Combining KNIME and H2O to Predict Stock Prices

Technical requirements

Introducing the stock price prediction problem

Describing the KNIME H2O Machine Learning Integration

Starting a workflow running on the H2O platform

Introducing the H2O nodes for machine learning

Accessing and preparing data within KNIME

Accessing stock market data from Yahoo Finance

Preparing the data for modeling on H2O

Training an H2O model from within KNIME

Optimizing the number of predictor columns

Training, applying, and testing the optimized model

Consuming the H2O model in the deployment application

Summary

Questions

Final note

Answers

Chapter 1

Chapter 2

Chapter 3

Chapter 4

Chapter 5

Chapter 6

Chapter 7

Chapter 8

Chapter 9

Chapter 10

Chapter 11

Chapter 12

Chapter 13

Chapter 14

Other Books You May Enjoy

Preface

This book gives an overview of the basics of time series data and time series analysis and of KNIME Analytics Platform and its time series integration. It shows how to implement practical solutions for a wide range of use cases, from demand prediction to signal classification and signal forecasting, and from price prediction to anomaly detection. It also demonstrates how to integrate other tools in addition to KNIME Analytics Platform within the same application.

The book instructs you on common preprocessing steps of time series data and statistics and machine learning-based techniques for forecasting. These things need to be learned to master the field of time series analysis. The book also points you to examples implemented in KNIME Analytics Platform, which is a visual programming tool that is accessible and fast to learn. This removes the common time and skill barrier of learning to code.

Who this book is for

This book is for data analysts and data scientists who want to develop forecasting applications on time series data. The first part of the book targets beginners in time series analysis by introducing the main concepts of time series analysis and visual exploration and preprocessing of time series data. The subsequent parts of the book challenge both beginners and advanced users by introducing real-world time series analysis applications.

What this book covers

Chapter 1, Introducing Time Series Analysis, explains what a time series is, states some classic time series problems, and introduces the two historical approaches: statistics and machine learning.

Chapter 2, Introduction to KNIME Analytics Platform, explains the basic concepts of KNIME Analytics Platform and its time series integration. This chapter covers installation, an introduction to the platform, and a first workflow example.

Chapter 3, Preparing Data for Time Series Analysis, introduces the common first steps in a time series analysis project. It explores different sources of time series data and shows time alignment, time aggregation, and missing value imputation as common preprocessing steps.

Chapter 4, Time Series Visualization, explores time series visualization. It provides an exploration of the most common visualization techniques to visually represent and display the time series data: from the classic line plot to the lag plot, and from the seasonal plot to the box plot.

Chapter 5, Time Series Components and Statistical Properties, introduces common concepts and measures for descriptive statistics of time series, including the decomposition of a time series, autocorrelation measures and plots, and the stationarity property.

Chapter 6, Humidity Forecasting with Classical Methods, completes a classic time series analysis use case: forecasting. It introduces some simple yet powerful classical methods, which often solve the time series analysis problem quickly without much computational expense.

Chapter 7, Forecasting the Temperature with ARIMA and SARIMA Models, delves into the ARIMA and SARIMA models. It aims at predicting tomorrow’s temperatures with the whole range of ARIMA models: AR, ARMA, ARIMA, and SARIMA.

Chapter 8, Audio Signal Classification with an FFT and a Gradient Boosted Forest, introduces a use case for signal classification. It performs the classification of audio signals via a Gradient Boosted Forest model and the FFT transforms the raw audio signals before modeling.

Chapter 9, Training and Deploying a Neural Network to Predict Glucose Levels, gives an example of a critical prediction problem: predicting the glucose level for a timely insulin intervention. This chapter also introduces neural networks.

Chapter 10, Predicting Energy Demand with an LSTM Model, introduces recurrent neural networks based on Long Short Term Memory (LSTM) layers, which are advanced predictors when temporal context is involved. It tests whether the prediction accuracy improves considerably from an ARIMA model when using a recurrent LSTM-based neural network.

Chapter 11, Anomaly Detection – Predicting Failure with No Failure Examples, tackles the problem of anomaly detection in predictive maintenance by introducing approaches that work exclusively on the data from a correctly working system.

Chapter 12, Predicting Taxi Demand on the Spark Platform, implements a solution to the demand prediction problem via a Random Forest to run on a Spark platform in an attempt to make the solution more scalable.

Chapter 13, GPU Accelerated Model for Multivariate Forecasting, extends the demand prediction problem to a multivariate by taking into account exogenous time series as well, and scalable, by training the recurrent neural network on a GPU-enabled machine.

Chapter 14, Combining KNIME and H2O to Predict Stock Prices, describes the integration of KNIME Analytics Platform with H2O, another open source platform, to implement a solution for stock price prediction.

To get the most out of this book

This book will introduce the basics of the open source visual programming tool KNIME Analytics Platform and time series analysis. Basic knowledge of data transformations is assumed, while no coding skills are required thanks to the codeless implementation of the examples. Python installation is required for using the time series integration in KNIME.

The installation of some use case-specific extensions and integrations will be indicated and instructed in the respective chapters. We will introduce KNIME Server for enterprise features in Chapter 2, Introduction to KNIME Analytics Platform, but all practical examples are implemented in the open source KNIME Analytics Platform.

If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Codeless-Time-Series-Analysis-with-KNIME and https://hub.knime.com/knime/spaces/Codeless%20Time%20Series%20Analysis%20with%20KNIME/latest/~GxjXX6WmLi-WjLNx/. If there’s an update to the code, it will be updated in the GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots and diagrams used in this book. You can download it here: https://packt.link/2RomT.

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: “For example, the ../sales.csv workflow relative path reads the sales.csv file located in the same workflow group as the executing workflow.”

Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: “If you want to do that, you will need to unlink it via the component’s context menu by selecting Component | Disconnect Link.”

Tips or Important Notes

Appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Share Your Thoughts

Once you’ve read Codeless Time Series Analysis with KNIME, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

Part 1: Time Series Basics and KNIME Analytics Platform

By the end of this part, you will know what a time series is, how to preprocess, visualize, and explore it, and how to configure and use KNIME Analytics Platform for time series analysis. The following are the chapters included in this part:

Chapter 1, Introducing Time Series AnalysisChapter 2, Introduction to KNIME Analytics PlatformChapter 3, Preparing Data for Time Series AnalysisChapter 4, Time Series VisualizationChapter 5, Time Series Components and Statistical Properties

Chapter 1: Introducing Time Series Analysis

In this introductory chapter, we’ll examine the concept of time series, explore some examples and case studies, and then understand how Time Series Analysis (TSA) can be useful in different frameworks and applications. Finally, we’ll provide a brief overview of the forecasting models used over the years, highlighting their key features, which will be further explored in the following chapters.

In this chapter, we will cover the following topics:

Understanding TSA and its importance within data analyticsTime series properties and examplesTSA goals and applicationsOverview of the main forecasting techniques used over the years

By the end of the chapter, you will have a good understanding of the key aspects of TSA, gaining the foundation to explore the subsequent chapters of the book with greater confidence.

Understanding TSA

When analyzing business data, it’s quite common to focus on what happened at a particular point in time: sales figures at the end of the month, customer characteristics at the end of the year, conversion results at the end of a marketing campaign, and more. Even in the development of the most sophisticated ML models, in most cases, we collect information that refers to different objects at a specific instant in time (or by taking a few snapshots of historical data). This approach, which is absolutely valid and correct for many applications, not only in business, uses cross-sectional data as the basis for analytics: data collected by observing many subjects (such as individuals, companies, shops, countries, equipment, and more) at one point or period of time.

Although the fact of not considering the temporal factor in the analysis is widespread and rooted in common practice, there are several situations where the analysis of the temporal evolution of a phenomenon provides more complete and interesting results. In fact, it’s only through the analysis of the temporal dynamics of the data that it is possible to identify the presence of some peculiar characteristics of the phenomenon we are analyzing, be it sales/consumption data, rather than a physical parameter or a macroeconomic index. These characteristics that act over time, such as trends, periodic fluctuations, level changes, anomalous observations, turning points, and more can have an effect in the short or long term, and often, it is important to be able to measure them precisely. Furthermore, it is only by analyzing data over time that it is possible to provide a reliable quantitative estimate of what might occur in the future (whether immediate or not). Since economic conditions are constantly changing over time, data analysts must be able to assess and predict the effects of these changes in order to suggest the most appropriate actions to take for the future.

For these reasons, TSA can be a very useful tool in the hands of business analysts and data scientists when it comes to both describing the patterns of a phenomenon along the time axis and providing a reliable forecast for it. Through the use of the right tools, TSA can significantly expand the understanding of any variable of interest (typically numerical) such as sales, financial KPIs, logistic metrics, sensors’ measurements, and more. More accurate and less biased forecasts that have been obtained through quantitative TSA can be one of the most effective drivers of performance in many fields and industries.

In the next sections of this chapter, we will provide definitions, examples, and some additional elements to gain a further understanding of how to recognize some key features of time series and how to approach their analyses in a structured way.

Exploring time series properties and examples

A general definition of a time series is the following:

A Time Series is a collection of observations made sequentially through time, whose dynamics are often characterized by short/long period fluctuations and/or long period direction.

This definition highlights two fundamental aspects of a time series: the fact that observations are a function of time and that, as a consequence of this fact, some typical temporal features are often observed. The fluctuations and the long period direction of the series are just some of these features, as there might be other relevant aspects to take into consideration such as autocorrelation, stationarity, and the order of integration. We will explore these aspects in more detail in future chapters. In this section, we will focus on the distinction between discrete time series and continuous time series, on the concept of independence between observations, and finally, we will show some examples of real-world time series.

Continuous and discrete time series

A Time Series is defined as continuous when observations are collected continuously over time, that is, there can be an infinite number of observations in a given time range. Typically, continuous time series data is sampled at irregular time intervals. Consider the measurement of a patient’s blood pressure in a hospital done at varying time points during the day, not equally spaced. This happens because, in some settings, regular monitoring at fixed intervals is not possible. For instance, in Figure 1.1, there are four medical continuous time series, relative to the health parameters of four patients:

Mean blood pressureHeart rateTemperatureGlucose data

As evident from the graphs, there are some temporal ranges where the measures are not present, for example, the temperature and glucose between approximately 20 hours and 30 hours of the monitoring period. There are other time points where data is collected more frequently than in other periods. These time series features are due to the fact that the data has been collected manually by the physician or by the nurse, not at fixed moments of the day. Therefore, this type of time series is inherently irregularly sampled:

Figure 1.1 – Four continuous, irregularly sampled, medical time series

A time series is defined as discrete when observations are collected regularly at specific times, typically equally spaced (that is, hourly, daily, weekly, and yearly data points).

A time series of this type can be natively discrete, such as the annual budget data of a company, or it can be created through the aggregation or accumulation of a numerical variable in equal time intervals. For example, the monthly sales of a supermarket or the number of daily passengers in a train station. A continuous time series can be discretized by binning/grouping the original data and, eventually, obtaining a discrete time series.

Classical TSA focuses on discrete time series because they are more common in real-world applications and easier to analyze. Therefore, in this book, we mainly deal with discrete time series, where observations are collected at equal intervals. When we consider irregularly sampled time series, first, we will try to transform them into regularly sampled data points.

Independence and serial correlation

One of the most distinctive characteristics of a time series is the mutual dependence between the observations, generally called serial correlation or autocorrelation.

In many statistical models, observations are assumed to be generated by a random sampling process and to be independent of each other (consider the linear regression model). Typically, this assumption turns out to be inconsistent with time series data, where simply collecting the data sequentially, along the time axis, generally produces observations that are not independent of each other.

Think of the daily sales of an e-commerce company. It’s reasonable to imagine that today’s sales are somehow related to the previous day’s sales: successive observations are dependent. However, in this context, which clearly can create some problems in using classical statistical tools, it is however possible to exploit the temporal dependence of observations to improve the forecasting process. If today’s sales are related to yesterday’s, and we can consistently estimate this relationship, then we can improve the forecast of tomorrow’s sales based on today’s result.

Time series examples

Interesting examples of time series can be collected in a multitude of information domains: business/economics, industrial production, social sciences, physics, and more. The time series obtained from these fields might be profoundly different in terms of statistical properties and the granularity of the available data, yet the methodologies of descriptive analysis and forecasting are essentially the same.

Here, we will explore a line chart (also called a time plot) of some representative discrete time series, with the aim of showing how it is possible to observe very different dynamics, depending on the type of data and the field of reference. Figure 1.2 shows the pattern of two annual time series, that is, the Number of PhDs awarded in the US, split between the subjects of engineering and education:

Figure 1.2 – Time series example 1: number of PhDs awarded in the US, showing the annual data for Engineering versus Education

In the preceding graph, we can see that both time series do not show periodic fluctuations, and this is typical of annual data. The engineering doctorate series appears to be increasing over time, especially in the last 5 years presented, while the education doctorate series shows a flatter trend, with a level shift between 2010 and 2011.

Figure 1.3 – Time series example 2: monthly carbon dioxide concentration (globally averaged from marine surface sites)

Focusing on a different series, the Monthly carbon dioxide concentration in Figure 1.3 shows a completely different pattern than the previous series. In fact, the dynamics of this monthly time series are dominated by periodic fluctuations, which are repeated consistently every year. In addition, we observe the constant growth of the level of the carbon concentration, year after year. In summary, this series shows an increasing oscillatory pattern that appears to be quite stable and, therefore, easily predictable.

Figure 1.4 – Time series example 3: LinkedIn’s daily stock market closing price

In contrast, the evolution of the time series shown in Figure 1.4 seems to be much more unpredictable. In this case, we have daily data points of LinkedIn’s stock market closing price. The pattern during the 5 years of observation seems to be very irregular, without periodic fluctuations, with sudden changes of direction superimposed on an increasing trend in the long run.

Figure 1.5 – Time series example 4: number of photos uploaded onto Instagram every minute (regional sub-sample)

Considering another example in the social media theme, we can look at Figure 1.5, in which the plot shows the Number of photos uploaded onto Instagram every minute (regional sub-sample). In this case, the granularity of the data is very high (one observation every minute) and the dynamics of the time series show both elements of regularity, such as constant fluctuations and peaks that are observed in the early afternoon of each day. At the same time, there are also discontinuities such as the presence of some anomalous observations.

Figure 1.6 – Time series example 5: acceleration detected by smartphone sensors during a workout session (10 seconds)

Finally, the analysis of the three time series shown in Figure 1.6, highlights how, for the same phenomenon (a workout session), both regular and irregular dynamics can be observed, depending on the point of observation. In this case, the three accelerometers mounted to the wearable device show fairly constant peaks along one spatial dimension and greater irregularity on the others.

In conclusion, from the examples that we have shown in this section, we notice that time series might have characteristics that are very different from one another. Determining aspects such as the origin of the data and the reference industry, the granularity of the data, and the length of the observation period can drastically influence the dynamics of the time series, revealing really heterogeneous patterns.

TSA goals and applications

When it comes to analyzing time series, depending on the industry and the type of project, different goals can be pursued, from the simplest to the most complex. Likewise, multiple analytical applications can be developed where TSA plays a crucial role. In this section, we will look at the main goals of time series analysis, followed by some examples of real-world applications.

Goals of TSA

In common practice, TSA is directly associated with forecasting, almost as if it were a synonym for this task. Although the objective of predicting the data for a future horizon is probably the most common (and challenging) goal, we should not assume TSA is only that. Often, the purpose of the analysis is to obtain a correct representation of data over time: think of the construction of a tool for data visualization and business intelligence or analyzing the data of a manufacturing process to detect possible anomalies.

Therefore, there are different objectives in the analysis of time series that can be listed in the following four points:

Exploratory analysis and visualization: This consists of the use of descriptive analytics tools dedicated to the summary of data points with respect to time. Through these analyses, it’s possible to identify the presence of specific temporal dynamics (for example, trends, seasonality, or cycles), detect outliers/gaps in the data, or search for a specific pattern. In business intelligence, it is critical to correctly represent time series within enterprise dashboards in order to provide immediate insights to business users for the decision-making process.Causal effect discovery and simulation: In many sectors, often, it is useful to verify how one or more exogenous variables impact a target variable. For example, how advertising investments on different channels (whether digital or not) impact the sales of a company or how some environmental conditions impact the quality of the industrial production of a particular product. These types of problems are very common and, in data analytics, are frequently addressed through the estimation of multiple regression models (adapted to work well with time series data). Once possible causal relationships are identified, it is possible to simulate the outcome of the objective variable as a function of the values assumed by the exogenous variables.Anomaly detection and process control (Figure 1.7): We can use TSA to prevent negative events (such as failures, damage, or performance drops):

Figure 1.7 – Anomaly detection using time series

The main idea is to promptly detect an anomaly during the operation of a device or the behavior of a subject, even if the specific anomaly has never been observed before. For many companies, reducing anomalies and improving quality is a key factor for growth and success; for example, reducing fraud in the banking sector or preventing cyber attacks in IT security systems. In manufacturing, process engineers use control charts to monitor the stability of a production process and also a measurement system. Typically, a control chart is obtained by plotting the data points of a time series related to a specific parameter of the manufacturing process (for example, wire pull strength, the concentration of a chemical, oxide thickness, and more) and adding some control limits, which is useful to identify possible process drifts or anomalies.

Forecasting: This definitely constitutes the main objective of time series analysis and consists of predicting the future values of a time series observed in the past. The forecasting horizon can be short-term or long-term. There are many methods used to obtain the predicted values; we will discuss these aspects in more detail in the Exploring Time series forecasting techniques section.

Domains of applications and use cases

The fields of application of TSA are numerous. Demand Forecasting and Planning is one of the most common applications, as it’s an important process for many companies (especially retailers) to anticipate demand for products throughout the entire supply chain, especially under uncertain conditions. However, from industry to industry, there are many more interesting uses of TSA. Right now, it would be almost impossible to list all applications where the use of TSA plays an important role in creating business solutions and assets; therefore, we will limit ourselves to a few examples that might give you an idea of the heterogeneity of use cases in the field of TSA.

For instance, consider the following list of examples:

Workforce planning: For a company operating in the logistics and transportation industry, it is crucial to predict the workload so that the right number of staff/couriers are available to handle it properly. In a workforce planning context, correctly forecasting the volume of parcels to be handled can help to effectively allocate effort and resources, which means eventually improving the bottom line for companies with, typically, low-profit margins. Forecasting of sales during promotions: E-commerce, supermarkets, and retailers increasingly use promotions, discount periods, and special sales to increase sales volume; however, stock-out problems are often generated, resulting in customer dissatisfaction and extra operative costs. Therefore, it is essential to use forecasting models that integrate the effects of promotions into sales forecasting in order to optimize warehouses and avoid losses, both economic and reputational.Insurance claim reserving: For insurance companies, estimating the claims reserve plays an important role in maintaining capital, determining premiums, and being in line with requirements imposed by the policyholder. Therefore, it is necessary to estimate the future number and amount of claims as correctly as possible. In recent years, actuarial practitioners have used several time series-based approaches to obtain reliable forecasts of claims and estimate the degree of uncertainty of the predictions.Predictive maintenance: In the context of the Internet of Things, the availability of real-time information generated by sensors mounted on devices and manufacturing equipment enables the development of analytics solutions that can prevent negative events (such as failures, damage, or drops in performance) in order to improve the quality of products or reduce operating costs. Anomaly detection based on TSA is one of the most widely used methods for creating effective predictive maintenance solutions. In Chapter 11, Anomaly Detection – Predicting Failure with No Failure Examples, we will provide a detailed use case in this area. Energy load forecasting: In deregulated energy markets, forecasting the consumption and price of electricity is crucial for defining effective bidding strategies to maximize a company’s profits. In this context, TSA is a widely used approach for day-ahead forecasting.

The applications just listed provide insight into how the application of TSA and forecasting techniques form the core of many processes and solutions developed in different industries.

Exploring time series forecasting techniques

Within the data science domain, doing time series forecasting first means extending a KPI (or any measure of interest) into the future in the most accurate and least biased way possible. And while this remains the primary goal of forecasting, often, the activity does not boil down to just that as it’s sometimes necessary to include an assessment of the uncertainty of forecasted values and comparisons with previous forecasting benchmarks. The approaches to time series forecasting are essentially two, listed as follows:

Qualitative forecasting methods are adopted when historical data is not available (for example, when estimating the revenues of a new company that clearly doesn’t have any data available). They are highly subjective methods. Among the most important qualitative forecasting techniques, it is possible to mention the Delphi method.Quantitative forecasting techniques are based on historical quantitative data; the analyst/data scientist, starting from this data, tries to understand the underlying structure of the phenomenon of interest and then uses the same data for forecasting purposes. Therefore, the analyst’s task is to identify, isolate, and measure these temporal dynamics behind a time series of past data in order to make optimal predictions and eventually support decisions, planning, and business control. The quantitative approach to forecasting is certainly the most widely used, as it generates results that are typically more robust and more easily deployed into business processes. Therefore, from now on (including in the next chapters), we will focus exclusively on it.

In the following section, we will explore the details of quantitative forecasting, focusing on the basic requirements for carrying it out properly and the main quantitative techniques used in recent years.

Quantitative forecasting properties and techniques

First and foremost, the development of a quantitative forecasting model depends on the available data, both in terms of the amount of data and the quality of historical information. In general, we can say that there are two basic requirements for effectively creating a reliable quantitative forecasting model:

Obtain an adequate number of observations, which means a sufficient depth of historical data, in order to correctly understand the phenomenon under analysis, estimate the models, and then apply the predictions. Probably one of the most common questions asked by those who are facing the development of a forecasting model for the first time is how long does the Time Series need to be to obtain a reliable model, which, in simple terms, means how much past do I need? The answer is not simple. It would be incorrect to say at least 50 observations are needed or that the depth should be at least 5 years. In fact, the amount of data points to consider depends on the following:The complexity of the model to be developed and the number of parameters to be estimated.The amount of randomness in the data.The granularity of the data (such as monthly, daily, and hourly) and its characteristics. (Is it intermittent? Are there strong periods of discontinuity to consider?)The presence of one or more seasonal components that need to be estimated in relation to the granularity of the data (for example, to include a weekly seasonality pattern of hourly data in the model, at least several hundred observations must be available).Collect information about the «time dimension» of the time series in order to determine the starting/ending points of the data and a possible length for the seasonal components (if present).

Given a set of sufficient historical data, the basis for a quantitative forecasting model is the assumption that there are factors that influenced the dynamics of the series in the past and these factors continue to bring similar effects in the future, too.

There are several criteria used to classify quantitative forecasting techniques. It is possible to consider the historical evolution of the methods (from the most classical to the most modern), how the methods use the information within the model, or even the domain of method development (purely statistical versus ML). Here, we present one possible classification of the techniques used for quantitative forecasting, which takes into account multiple relevant elements that characterize the different methods. We can consider these three main groups of methods as follows:

Classical univariate forecasting methods: In these statistical techniques, the formation of forecasts is only based on the same time series to be forecast through the identification of structural components, such as trends and seasonality, and the study of the serial correlation. Some popular methods in this group are listed as follows:Classical decomposition: This considers the observed series as the overlap of three elementary components (trend-cycle, seasonality, and residual), connected with different patterns that are typically present in many economics time series; classical decomposition (such as other types of decomposition) is a common way to explore and interpret the characteristics of a time series, but it can certainly be used to produce forecasts. In Chapter 5, Time Series Components and Statistical Properties, we will delve deeper into this method.Exponential smoothing: Forecasts produced by exponential smoothing methods are based on weighted averages of past observations, with weights decaying exponentially as the observations get older; this decreasing weights method could also take into account the overlap of some components, such as trends and seasonality.AutoRegressive Integrated Moving Average (ARIMA): Essentially, this is a regression-like approach that aims to model, as effectively as possible, the serial correlation among the observations in a time series. To do this effectively, several parameters in the model can handle trends and seasonality, although less directly than decomposition or exponential smoothing.Explanatory models: These techniques work in a multivariate fashion, so the forecasts are based on both past observations of the reference time series and external predictors, which helps to achieve better accuracy but also to obtain a more extensive interpretation of the model. The most popular example in this group is the ARIMAX model (or regression with ARIMA errors).ML methods: These techniques can be either univariate or multivariate. However, their most distinctive feature is that they originated outside the statistical domain and were not specifically designed to analyze time series data; typically, they are artificial neural networks (such as multilayer perceptron, long-short memory networks, and dilated convolutional neural networks) or tree-based algorithms (such as random forest or gradient boosted trees) originally made for cross-sectional data that can be adapted for time series forecasting.

A very common question asked by students and practitioners who are new to TSA is whether there is one forecasting method that is better than the others. The answer (for now) is no. All of the models have their own pros and cons. In general, exponential smoothing, ARIMA, and all the classical methodologies have been around the longest. They are quite easy to implement and typically very reliable, but they require the verification of some assumptions, and sometimes, they are not as flexible as you would like them to be. In contrast, ML algorithms are really flexible (they don’t have assumptions to check), but commonly, you need a large amount of data to train them properly. Moreover, they can be more complicated (a lot of hyperparameters to tune), and to be effective, you need to create some extra-temporal features to catch the time-related patterns within your data.

But what does the best forecasting model mean? Consider that it’s never just a matter of the pure performance of the model, as you need to consider other important items in the model selection procedure. For instance, consider the following list of items:

Forecast horizon in relation to TSA objectives: Are you going to predict the short term or the long term? For the same time series, you could have a model that is the best one for short-term forecasts, but you need to use another one for long-term forecasts.The type/amount of available data: In general, for small datasets, a classical forecasting method could be better than an ML approach.The required readability of the results: A classical model is more interpretable than an ML model.The number of series to forecast: Using classical methods with thousands of time series can be inefficient, so in this case, an ML approach could be better.Deployment-related issues: Also, consider the frequency of the delivery of the forecasts, the software environment, and the usage of the forecasts.

In summary, when facing the modeling part of your time series forecasting application, don’t just go with one algorithm. Try different approaches, considering your goals and the type/amount of data that you have.

Summary

In this chapter, we introduced TSA, starting by defining what a time series is and then providing some examples of series taken from various contexts and industries. Next, we focused on the goals that are typically related to TSA and also provided some examples of applications in real-world scenarios. Finally, we covered a brief review of the main forecasting methods, providing a taxonomy of methodologies and generally describing the characteristics of the main models, from the most classic to the most modern.

In this chapter, the basic concepts provided are of great importance for approaching the subsequent chapters of the book in a structured way, having the concepts of time series and forecasting clear in your head.

In the next chapter, we’ll cover the basic concepts of KNIME Analytics Platform and its time series integration, introducing the software and showing a first workflow example.

Questions

The answers to the following questions can be found in the Assessment section at the end of the book:

What is a discrete Time Series?A collection of observations made continuously over time.A series where there can be an infinite number of observations in a given time range.A collection of observations that are sampled regularly at specific times, typically equally spaced.A series where observations follow a Bernoulli distribution.Which of the following is not a typical goal pursued in Time Series Analysis?Causal effect discovery and simulation.Function approximation.Anomaly detection and process control.Forecasting.Which is a basic requirement to develop a reliable quantitative forecasting model?Obtain an adequate number of historical observations.Collect time-independent observations.Collect a time series that shows a trend.Obtain a time series without gaps and outliers.Which of the following is not a group of methods typically used in quantitative Time Series Forecasting?Classical univariate methods.Machine learning techniques.Explanatory models.Direct clustering algorithms.

Chapter 2: Introduction to KNIME Analytics Platform

In this chapter, we will introduce KNIME Analytics Platform—your tool for codeless time series analysis. We will explain different features of the KNIME software and the basic concepts of visual programming. We will also guide you through installing KNIME Analytics Platform, building your first workflow, and configuring the time series integration. These topics are covered in the following sections:

Exploring the KNIME softwareIntroducing nodes and workflowsBuilding your first workflowConfiguring the time series integration

You will learn basic visual programming skills and install the necessary software for time series analysis in KNIME.

Exploring the KNIME software

In this first section, we will introduce you to the features of the KNIME software, which covers two products: the open source KNIME Analytics Platform, and the KNIME Server commercial product. Together, these two products enable all operations in a data science application, from data access to modeling and from deployment to model monitoring.

We will first introduce you to KNIME Analytics Platform.

Introducing KNIME Analytics Platform for creating data science applications

KNIME Analytics Platform is an open source tool for creating data science applications. It is based on visual programming, making it fast to learn, accessible, and transparent. If needed, you can also integrate other tools—including scripts—into your visual workflows.

In visual programming, each individual task is indicated by a colored block, which in KNIME Analytics Platform is called a node. A node has an intuitive name describing its task and a graphical user interface (GUI). Individual nodes are connected into a pipeline of subsequent tasks, which we call a workflow. In a workflow, the connection lines from node to node emulate the flow of data. A node is the counterpart of a line of code in scripting languages, while a workflow is the counterpart of an entire script. We will tell you more about nodes and workflows in the Introducing nodes and workflows section.

The following diagram shows an example of a KNIME visual workflow:

Figure 2.1 – An example of a visual workflow

The workflow reads data in different file formats and from a database, blends them, and displays them in an interactive browser-based table. As you can already see from this simple workflow, KNIME Analytics Platform is open to different file formats and sources, platforms, and external tools. You can access data from any source—database software, cloud environments, big data platforms, and different file types. You can also use scripting languages such as Python, access open source machine learning (ML) libraries such as H2O, and connect to reporting tools such as Tableau via KNIME integrations.

In the following subsection, we will show how to install KNIME Analytics Platform.

Installing KNIME Analytics Platform

KNIME Analytics Platform is open source, and you can install it right away by following these steps:

Go to https://www.knime.com/downloads and fill in your name, email address, and—if you want—some additional information. This step (signing up) is voluntary but recommended to get started quickly with just a few introductory emails and to keep up to date about new resources. After that, click Next.Select KNIME Analytics Platform for Windows, Linux, or macOS according to your operating system.Optionally, click Next, and learn more about KNIME via the beginners’ guide, videos, and other learning material.After downloading the installation package, start it, and follow the instructions on your screen.Finally, start the application from the desktop link/application/a link in the Start