38,39 €
We live in a serendipitous era where the explosion in the quantum of data collected and a renewed interest in data-driven techniques such as machine learning (ML), has changed the landscape of analytics, and with it, time series forecasting. This book, filled with industry-tested tips and tricks, takes you beyond commonly used classical statistical methods such as ARIMA and introduces to you the latest techniques from the world of ML.
This is a comprehensive guide to analyzing, visualizing, and creating state-of-the-art forecasting systems, complete with common topics such as ML and deep learning (DL) as well as rarely touched-upon topics such as global forecasting models, cross-validation strategies, and forecast metrics. You’ll begin by exploring the basics of data handling, data visualization, and classical statistical methods before moving on to ML and DL models for time series forecasting. This book takes you on a hands-on journey in which you’ll develop state-of-the-art ML (linear regression to gradient-boosted trees) and DL (feed-forward neural networks, LSTMs, and transformers) models on a real-world dataset along with exploring practical topics such as interpretability.
By the end of this book, you’ll be able to build world-class time series forecasting systems and tackle problems in the real world.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 787
Veröffentlichungsjahr: 2022
Explore industry-ready time series forecasting using modern machine learning and deep learning
Manu Joseph
BIRMINGHAM—MUMBAI
Copyright © 2022 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Publishing Product Manager: Dhruv Kataria
Senior Editors: Roshan Ravikumar, Tazeen Shaikh
Content Development Editor: Shreya Moharir
Technical Editor: Devanshi Ayare
Copy Editor: Safis Editing
Project Coordinator: Farheen Fathima
Proofreader: Safis Editing
Indexer: Subalakshmi Govindhan
Production Designer: Alishon Mendonca
Marketing Coordinator: Shifa Ansari
First published: November 2022
Production reference: 1181122
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-80324-680-2
www.packt.com
For my son, Zane,
For his boundless curiosity,
For his endless questions,
And for his innocent love of learning.
(All great qualities for adults who read this book as well.)
Manu Joseph is a self-made data scientist with more than a decade of experience working with many Fortune 500 companies, enabling digital and AI transformations, specifically in machine learning-based demand forecasting. He is considered an expert, thought leader, and strong voice in the world of time series forecasting. Currently, Manu leads applied research at Thoucentric, where he advances research by bringing cutting-edge AI technologies to the industry. He is also an active open source contributor and has developed an open source library—PyTorch Tabular—which makes deep learning for tabular data easy and accessible. Originally from Thiruvananthapuram, India, Manu currently resides in Bengaluru, India, with his wife and son.
Dr. Julien Siebert is currently working as a researcher at the Fraunhofer Institute for Experimental Software Engineering (IESE), in Kaiserslautern, Germany. He studied engineering sciences and AI and obtained a PhD in computer science on the topic of modeling and simulation of complex systems. After several years of research both in computer science and theoretical physics, Dr. Julien Siebert worked as a data scientist for an e-commerce fashion company. Since 2018, he has been working at the intersection between software engineering and data science.
Gerzson David Boros is the owner and CEO of Data Science Europe and a senior data scientist who has been involved in data science for more than 10 years. He has an MSc and is a candidate for an MBA. In the last 5 years, he and his team have made business proposals for 100 different executives and worked on more than 30 different projects on the topic of data science and artificial intelligence. His motto is “Social responsibility is also achievable with the help of data.”
We dip our toes into time series forecasting by understanding what a time series is, how to process and manipulate time series data, and how to analyze and visualize time series data. This part also covers classical time series forecasting methods, such as ARIMA, to serve as strong baselines.
This part comprises the following chapters:
Chapter 1, Introducing Time SeriesChapter 2, Acquiring and Processing Time Series DataChapter 3, Analyzing and Visualizing Time Series DataChapter 4, Setting a Strong Baseline ForecastIntroducing Time Series
In the previous chapter, we learned what a time series is and established a few standard notations and terminologies. Now, let’s switch tracks from theory to practice. In this chapter, we are going to get our hands dirty and start working with data. Although we said time series data is everywhere, we are still yet to get our hands dirty with a few time series datasets. We are going to start working on the dataset we have chosen to work on throughout this book, process it in the right way, and learn about a few techniques for dealing with missing values.
In this chapter, we will cover the following topics:
Understanding the time series datasetpandas datetime operations, indexing, and slicing – a refresherHandling missing dataMapping additional informationSaving and loading files to diskHandling longer periods of missing dataYou will need to set up the Anaconda environment following the instructions in the Preface of the book to get a working environment with all the packages and datasets required for the code in this book.
The code for this chapter can be found at https://github.com/PacktPublishing/Modern-Time-Series-Forecasting-with-Python-/tree/main/notebooks/Chapter02.
Handling time series data is like handling other tabular datasets, but with a focus on the temporal dimension. As with any tabular dataset, pandas is perfectly equipped to handle time series data as well.
Let’s start getting our hands dirty and work through a dataset from the beginning. We are going to use the London Smart Meters dataset throughout this book. If you have not downloaded the data already as part of the environment setup, go to the Preface and do that now.
This is the key first step in any new dataset you come across, even before Exploratory Data Analysis (EDA), which we will be covering in Chapter 3, Analyzing and Visualizing Time Series Data. Understanding where the data is coming from, the data generating process behind it, and the source domain is essential to having a good understanding of the dataset.
London Data Store, a free and open data-sharing portal, provided this dataset, which was collected and enriched by Jean-Michel D and uploaded on Kaggle.
The dataset contains energy consumption readings for a sample of 5,567 London households that took part in the UK Power Networks-led Low Carbon London project between November 2011 and February 2014. Readings were taken at half-hourly intervals. Some metadata about the households is also available as part of the dataset. Let’s look at what metadata is available as part of the dataset:
CACI UK segmented the UK’s population into demographic types, called Acorn. For each household in the data, we have the corresponding Acorn classification. The Acorn classes (Lavish Lifestyles, City Sophisticates, Student Life, and so on) are grouped into parent classes (Affluent Achievers, Rising Prosperity, Financially Stretched, and so on). A full list of Acorn classes can be found in Table 2.1. The complete documentation detailing each class is available at https://acorn.caci.co.uk/downloads/Acorn-User-guide.pdf.The dataset contains two groups of customers – one group who was subjected to dynamic time-of-use (dToU) energy prices throughout 2013, and another group who were on flat-rate tariffs. The tariff prices for the dToU were given a day ahead via the Smart Meter IHD or via text message.Jean-Michel D also enriched the dataset with weather and UK bank holidays data.The following table shows the Acorn classes:
Table 2.1 – ACORN classification
Important note
The Kaggle dataset also preprocesses the time series data daily and combines all the separate files. Here, we will ignore those files and start with the raw files, which can be found in the hhblock_dataset folder. Learning to work with the raw files is an integral part of working with real-world datasets in the industry.
Once we understand where the data is coming from, we can look at the data, understand the information present in the different files, and figure out a mental model of how to relate the different files. You may call it old school, but Microsoft Excel is an excellent tool for gaining this first-level understanding. If the file is too big to open in Excel, we can also read it in Python and save a sample of the data to an Excel file and open it. However, keep in mind that Excel sometimes messes with the format of the data, especially dates, so we need to take care to not save the file and write back the formatting changes Excel made. If you are allergic to Excel, you can do it in Python as well, albeit with a lot more keystrokes. The purpose of this exercise is to see what the different data files contain, explore the relationship between the different files, and so on. We can make this more formal and explicit by drawing a data model, similar to the one shown in the following diagram:
Figure 2.1 – Data model of the London Smart Meters dataset
The data model is more for us to understand the data rather than any data engineering purpose. Therefore, it only contains bare-minimum information, such
