28,99 €
Written by Databricks Senior Solutions Architect Yoni Ramaswami, whose expertise in Data and AI has shaped innovative digital transformations across industries, this comprehensive guide bridges foundational concepts of time series analysis with the Spark framework and Databricks, preparing you to tackle real-world challenges with confidence.
From preparing and processing large-scale time series datasets to building reliable models, this book offers practical techniques that scale effortlessly for big data environments. You’ll explore advanced topics such as scaling your analyses, deploying time series models into production, Generative AI, and leveraging Spark's latest features for cutting-edge applications across industries. Packed with hands-on examples and industry-relevant use cases, this guide is perfect for data engineers, ML engineers, data scientists, and analysts looking to enhance their expertise in handling large-scale time series data.
By the end of this book, you’ll have mastered the skills to design and deploy robust, scalable time series models tailored to your unique project needs—qualifying you to excel in the rapidly evolving world of big data analytics.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 335
Veröffentlichungsjahr: 2025
Time Series Analysis with Spark
A practical guide to processing, modeling, and forecasting time series with Apache Spark
Yoni Ramaswami
Copyright © 2025 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Portfolio Director: Sunith Shetty
Relationship Lead: Vaideeshwari Muralikrishnan
Project Manager: Hemangi Lotlikar
Content Engineer: Shrishti Pandey
Technical Editor: Gaurav Gavas
Copy Editor: Safis Editing
Proofreader: Shrishti Pandey
Indexer: Pratik Shirodkar
Production Designer: Gokul Raj S.T
Growth Lead: Bhavesh Amin
DevRel Marketing Coordinator: Ankur Mulasi
First published: March 2025
Production reference: 1070325
Published by Packt Publishing Ltd.
Grosvenor House
11 St Paul’s Square
Birmingham
B3 1RB, UK
ISBN 978-1-80323-225-6
www.packtpub.com
This book has been a long journey in the making, and I dedicate it to those who have shaped and uplifted me along the way. To my loving wife, Mokshada, whose unwavering support and partnership have been my greatest strength throughout this endeavor and our shared journey. To my son, Aryan (aka Black Panther, aka ElectricMax), for inspiring me daily with his dedication to improve his skills in tennis and violin. He motivates me to be the best role model I can be. In memory of my late father, Vijay, and to my mother, Satya, for their sacrifices and love, and for bestowing on me the passion for learning. They shared with me invaluable lessons of perseverance and determination. Their unwavering belief in me has been a guiding light, pushing me toward greater heights, no matter the challenges.
– Yoni Ramaswami
In the rapidly evolving landscape of data science and analytics, time series analysis stands as a critical yet often underutilized tool. Traditionally, this field has been constrained by proprietary technologies and specialist knowledge, limiting its application across diverse business domains. These constraints have resulted in static models ill-equipped to handle the dynamic, real-time challenges faced by modern enterprises.
Yoni Ramaswami’s Time Series Analysis with Spark arrives at a pivotal moment, offering a fresh perspective on this vital discipline. By leveraging Apache Spark, Yoni democratizes time series analysis, transforming it from a niche skill into an accessible, scalable, and flexible approach applicable to a wide range of industries.
The impact of this democratization cannot be overstated. By blending time series analysis with other data types and enabling real-time applications, this book opens doors to more representative solutions for real-world challenges. Dynamic and complex domains such as supply chain management, logistics, and financial markets stand to benefit immensely from this approach, which breaks free from the limitations of proprietary technology and embraces evolving methodologies and AI-enabled techniques.
Yoni’s talent shines through in his first-principles approach to the subject. He has crafted this book not just as a technical guide, but as a catalyst for evolving a field that is increasingly necessary in our data-driven world. Readers will find themselves equipped with the tools and understanding to apply time series analysis in ways that were previously out of reach, fostering innovation and driving efficiency across their organizations.
As we navigate an era where data flows ceaselessly and business landscapes shift at unprecedented speeds, the ability to extract meaningful insights from temporal data is more crucial than ever. Time Series Analysis with Spark is not just a book; it’s a key that unlocks the potential of time series data for businesses of all sizes and sectors.
Yoni Ramaswami has delivered a book that is both timely and essential. It empowers data scientists, analysts, and business leaders to tackle real-world problems with cutting-edge tools and methodologies. In doing so, he has laid the groundwork for a new era of accessible, impactful time series analysis – one that promises to reshape how we understand and interact with the temporal dimensions of our data-rich world.
Dael Williamson
Field Chief Technology Officer, Databricks.
For centuries, humanity has been fascinated by the idea that even the simplest sequences - whether of numbers, events, or patterns - may hold the key to understanding a vast array of possibilities. This search for hidden meaning has fueled countless discoveries, from ancient astronomers charting the stars to the mysterious practice of reading the future in the flight patterns of birds. In fact, some of the earliest "data analysts" might have been soothsayers peering into the future through chicken bones or sheep entrails, convinced that the universe had secret messages waiting to be deciphered!
Today, in our digital age, we face an explosion of data unlike anything in history. Every moment, transaction, and action can be captured as a sequence of numbers. No longer do we need to rely on animal bones for our predictions - now, we look to massive datasets, believing that hidden patterns and trends might reveal the next big breakthrough. Across all scales, from the smallest dataset to the most complex, we are still driven by the same fundamental quest: to uncover the meaning that lies beneath the surface and gain insights into the world around us.
Since the early 2000s, companies have progressively realized that their data, when properly harnessed and structured (easier said than done… ask any Chief Data Officer!), represents a huge potential. Among all forms of data, those tied to time - sequences of events, measurements, or activities - are perhaps the most valuable. They tell stories, reveal trends, and, most interestingly, enable us to predict the future. Since the 1990s, factories have been equipped with sensors, monitoring systems, and databases, retailers have tried to standardize product identifiers across countries to consolidate their sales data, and more.
Imagine being able to predict when critical equipment is about to fail, saving a company from costly downtimes that would otherwise disrupt production. Or picture the ability to fine-tune an industrial process so that every batch of product meets the same high standard, day after day. Even in the most complex industrial systems, identifying anomalies before they spiral out of control can mean the difference between a minor issue and a major breakdown.
In the world of business, time series analysis can optimize inventory management, adjusting stock levels based on real-time consumption patterns, ensuring businesses are never overstocked or caught off guard by sudden demand. Similarly, the ability to detect fraudulent activity based on behavioral trends allows companies to respond quickly, minimizing financial loss and damage to their reputation.
On the healthcare front, time series data can play a crucial role in detecting early signs of heart issues from patient data, potentially saving lives before a condition becomes critical. And when it comes to market trends, analyzing historical data allows businesses to optimize strategic decisions, ensuring they stay one step ahead of their competition.
Yet, this wealth of data remains underutilized… and this is where Apache Spark comes in.
This tool is designed to process, analyze, and extract insights from massive data streams at speeds and scales once unimaginable. This book, Time Series Analysis with Spark, is an invitation to dive into this fascinating world. It will show you how to manipulate and analyze time series data to solve real-world problems in sectors as diverse as energy, finance, healthcare, and logistics.
This book is intended for data scientists, engineers, and technical decision-makers – those who understand that data is not only a source of power but also a responsibility. With clear explanations and concrete examples, this guide will equip you with the tools to transform raw data into actionable insights. You’ll learn to unlock the full potential of time series data by harnessing the power of Apache Spark to fuel your ambitions.
In an era flooded with data, the companies that will succeed tomorrow won’t simply be those that gather vast amounts of information. Rather, they will be the ones who understand it, share it, and use it to create lasting value.
But with this power comes responsibility. The goal is not merely to predict, but to foresee with awareness, acknowledging that every model - no matter how sophisticated - is just an imperfect projection of what we, as humans, choose to do with our future. Our ability to read the patterns and trends that emerge from time series data offers immense potential, but it also demands a careful approach. After all, in the end, the choices we make today shape the world we’ll live in tomorrow.
This book, like much of Yoni's work, is guided by a deep sense of humanity. Over the 20 years that I have known him, from those early days of shared curiosity to this moment of collective insight, Yoni has always been a guide, not just in knowledge, but in human connection. His commitment to sharing his knowledge reflects a belief that true progress comes not just from accumulating insights, but from passing them on to others, so that together we can build a better tomorrow. His passion for learning, teaching, and empowering others serves as a reminder that knowledge is not only power, but also responsibility.
So, as we stand on the brink of this data-driven era, the question isn’t just "What can you predict?" but also "What will you do with what you'll find?".
Jan Govaere
Chief Information Officer & IT leader.
Yoni Ramaswami is a Senior Solutions Architect at Databricks with two decades of experience in IT, data, and AI. Recognized for his contributions to projects spanning digitally innovative technologies across industries, Yoni combines thought leadership, architecture, and implementation expertise. Originally from Mauritius, Yoni earned his Diplôme d’Ingénieur from UTC in France and Chalmers in Sweden, grounding his global perspective in both technical rigor and cultural insight. When not devising practical, high-impact solutions, he can be found exploring the lush landscapes of Mauritius with his son.
I would like to extend my thanks to Shrishti and the team at Packt, to the Technical Reviewers (Guillaume, Lorin, Mohammad, Sonali) and Ryuta, to the foreword writers (Dael and Jan), to Gita and Greg, Seeram and Sadna, Ammam and Baam, Manoj, François and Erika, Devind and Savita, Abdul, Manish, Chris, Gérard, Rushdee, Danny and JP. Your sound advice and encouragements have made this book possible.
Guillaume Meister is an IT professional with over 25 years of experience in the tech industry, specializing in databases, big data, cloud architecture, and network infrastructures. Recognized for his leadership and problem-solving abilities, he has contributed to digital transformation and infrastructure migration projects for organizations such as Airbus, Amadeus, TSMC, and ANZ Bank. He holds a master’s degree in computer science and has certifications from AWS and Microsoft. Guillaume has also authored publications on open-source software and is passionate about leveraging technology to drive impactful solutions.
Lorin Dawson is a technology professional with expertise in cloud architecture, platform and data engineering. As a member of the Digital Native Business team within Databricks Field Engineering, Lorin designs and optimizes secure, high-performance data and AI systems for enterprise-level applications. Lorin contributes to the time series project Tempo in Databricks Labs, enhancing Apache Spark’s capabilities in advanced data analytics. When not working, Lorin enjoys mountaineering and exploring culinary arts. He resides in Denver, Colorado, with his wife.
Mohammad Shahedi is a Specialist Solutions Architect at Databricks, supporting data engineering and data warehousing use cases. He holds a Master’s in Economics and Quantitative Finance from the University of Milan, where his thesis explored clustering financial time series. His Bachelor’s in Civil Engineering provided a strong mathematical foundation, invaluable to his quantitative finance work.
Sonali Guleria is a recognized thought leader with over 12 years of professional experience in data, machine learning, and artificial intelligence. She helps organizations effectively scale their cloud data strategies with a strong focus on innovation. Sonali obtained her undergraduate degree in computer science from Amity University in India and later earned her master’s degree in data and machine learning from Carnegie Mellon University in Pittsburgh. Currently, she serves as a Lead Solutions Architect at Databricks, specializing in financial services, machine learning, and artificial intelligence.
Join our community’s Discord space for discussions with the authors and other readers:
https://packt.link/ds
Time series are everywhere, at every time, ever-growing. With the right tools that can be scaled up, you can unleash their temporal insights with ease, giving you the edge over time.
Time series analysis—extracting insights from time series—is crucial for businesses and organizations to make informed decisions. This is achieved by analyzing patterns, trends, and anomalies in data collected at intervals of time. Apache Spark is a powerful big data processing framework that enables the efficient processing of large-scale time series data, making it an ideal tool for handling the volume and complexity of such data.
There are three main pillars for any time series analysis engagement with Apache Spark:
Data Preparation and Processing: This involves collecting, cleaning, and transforming time series data into a format suitable for analysis.Modeling and Forecasting: This includes applying statistical models or machine learning algorithms to uncover patterns and predict future trends.Deployment and Maintenance: This involves integrating the models into operational systems and continuously monitoring and updating them to ensure accuracy and relevance.This book, Time Series Analysis with Spark, aims to cover all these pillars. It will provide practical techniques for processing, modeling, and forecasting time series data using Apache Spark. The book is based on two main sources of information:
Practical Experience: Drawing from real-world projects and experiences in handling large-scale time series data with Apache Spark.Industry Insights: Incorporating insights from experts and practitioners in the field of time series analysis and big data processing.As the use of Apache Spark for time series analysis continues to grow, the demand for professionals skilled in this area is increasing rapidly. This book will guide you through the best practices and techniques necessary to leverage Apache Spark effectively for time series analysis, helping you to stay ahead in this rapidly evolving field.
Professionals in data and AI, especially with time-dependent datasets, will find Time Series Analysis with Spark beneficial for enhancing their skills in leveraging Apache Spark and Databricks for time series analysis. The book caters to a broad audience, from those new to time series analysis and Apache Spark to experienced practitioners seeking to leverage Spark for temporal data analysis.
More specifically, data engineers will enhance their abilities in utilizing Spark and Databricks for the large-scale preparation of time series data. Machine learning (ML) engineers will find it easier to expand the scope of their ML projects. Data scientists and analysts will acquire fresh time series analysis skills to broaden their range of tools.
Chapter 1, What Are Time Series?, introduces the concept of time series data and the unique challenges in its analysis. This foundation is required to effectively analyze and forecast time-dependent data.
Chapter 2, Why Time Series Analysis?, elaborates on the importance of analyzing time-dependent data in enabling predictive modeling, trend identification, and anomaly detection. This is illustrated with real-world applications across industries.
Chapter 3, Introduction to Apache Spark, dives into Apache Spark and its distributed computing capabilities for processing large-scale time series data.
Chapter 4, End-to-End View of a Time Series Analysis Project, guides us through the entire process of a time series analysis project. Starting with use cases, it covers key stages such as data processing, feature engineering, model selection, and evaluation.
Chapter 5, Data Preparation, delves into the critical steps of organizing, cleaning, and transforming time series data. It covers techniques for handling missing values, dealing with outliers, and structuring data, enhancing the reliability of subsequent analytical processes.
Chapter 6, Exploratory Data Analysis, goes through exploratory data analysis to uncover patterns and insights in time series data. These steps are crucial for identifying characteristics such as trends and seasonality, informing subsequent modeling decisions.
Chapter 7, Building and Testing Models, focuses on constructing predictive models for time series data, covering the diverse types of models, which one to choose, and how to train, tune, and evaluate models.
Chapter 8, Going at Scale, addresses the considerations for scaling time series analysis in large and distributed computing environments. It covers the different ways that Apache Spark can be used to scale feature engineering, hyperparameter tuning, and single- and multi-model training.
Chapter 9, Going to Production, explores the practical considerations and steps involved in deploying time series models into production, while ensuring the reliability and effectiveness of time series models in operational environments.
Chapter 10, Going Further with Apache Spark, provides answers to the challenges of setting up and managing the platform by using Databricks as a cloud-based, managed, platform-as-a-service solution to go further with Apache Spark.
Chapter 11, Recent Developments in Time Series Analysis, explores recent developments in the field of time series analysis, including an approach from the exciting field of generative AI applied to time series forecasting, as well as new approaches to making the outcome of time series analysis accessible to them in non-technical ways.
This book requires you to have a basic understanding of the Python programming language along with a fundamental knowledge of data science and machine learning concepts.
Chapters 1, 2, 5, 6, and 7 use the Databricks Community Edition.Chapters 3, 4, and 9 use local containerized environments. The examples in this book were tested with Docker on macOS. They should work with Docker or Podman on Windows or Linux with adaptation. You can skip the hands-on part of these chapters if you do not intend to build your own environment locally and prefer to use a managed platform such as Databricks.Chapters 8, 10, and 11 use the Databricks platform.Additional installation instructions and information for getting set up are documented in the individual chapters.
Software/hardware covered in the book
Operating system requirements
Databricks Community Edition
Databricks on Amazon Web Services (AWS) or Microsoft Azure
Docker v4.48 or Podman v1.16
Windows, macOS, or Linux
Additional software packages required for the code examples are installed automatically at code execution. As software packages and User Interfaces are subject to changes, refer to the corresponding package or product documentation for information on changes.
If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.
If there is an update to instructions, it will be added to the README.md of the individual chapters on the GitHub repository to the extent possible.
You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Time-Series-Analysis-with-Spark. If there’s an update to the code, it will be updated in the GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
There are a number of text conventions used throughout this book.
Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: “Mount the downloaded WebStorm-10*.dmg disk image file as another disk in your system.”
A block of code is set as follows:
#### Summary Statistics # Code in cell 10 df.summary().display()When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:
sns.boxplot(x='dayOfWeek', y='Global_active_power', data=pdf)Any command-line input or output is written as follows:
Test SMAPE: 41.193985580947896 Test WAPE: 0.35355667972102317Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: “Other sections of the report cover Alerts, shown in Figure 6.8, with outcomes of tests run on the dataset, including time-series-specific ones, and a Reproduction section with details on the profiling run.”
Tips or important notes
Appear like this.
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.
Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Once you’ve read Time Series Analysis with Spark, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.
Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.
Thanks for purchasing this book!
Do you like to read on the go but are unable to carry your print books everywhere?
Is your eBook purchase not compatible with the device of your choice?
Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.
Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.
The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily
Follow these simple steps to get the benefits:
Scan the QR code or visit the link belowhttps://packt.link/free-ebook/978-1-80323-225-6
Submit your proof of purchaseThat’s it! We’ll send your free PDF and other benefits to your email directlyIn this part, you will be introduced to time series analysis and Apache Spark. Starting with the foundational concepts of time series data, we will dive into the practical significance of time series analysis and use cases across industries with some hands-on examples. You will then get introduced to Apache Spark to understand how it is used, its architecture, and how it works, and conclude by installing it in your own environment.
This part has the following chapters:
Chapter 1, What Are Time Series?Chapter 2, Why Time Series Analysis?Chapter 3, Introduction to Apache Spark“Time is the wisest counselor of all.” – Pericles
History is fascinating. It offers a profound narrative of our origins, the journey we are on, and the destination we strive toward. History equips us with learnings from the past to better face the future.
Let’s take, for example, the impact of meteorological data on history. Disruptions in weather patterns, starting in the Middle Ages and worsened by the Laki volcanic eruption in 1783, caused widespread hardship in France. This climatic upheaval contributed to the social unrest that ultimately led to the French Revolution in 1789. (Find out more about this in the Further reading section.)
Time series embody this narrative with numbers echoing our past. They are history quantified, a numerical narrative of our collective past, with lessons for the future.
This book takes you on a comprehensive journey with time series, starting with foundational concepts, guiding you through practical data preparation and model building techniques, and culminating in advanced topics such as scaling, and deploying to production, while staying abreast of recent developments for cutting-edge applications across industries. By the end of this book, you will be equipped to build robust time series models, in combination with Apache Spark, to meet the requirements of the use cases in your industry.
As a start on this journey, this chapter introduces the fundamental concepts of time series data, exploring its sequential nature and the unique challenges it poses. The content covers key components such as trend and seasonality, providing a foundation to embark on time series analysis at scale using the Spark framework. This knowledge is crucial for data scientists and analysts as it forms the basis for leveraging Spark’s distributed computing capabilities in effectively analyzing and forecasting time-dependent data and making informed decisions in various domains such as finance, healthcare, and marketing.
We will cover the following topics in this chapter:
Introduction to time seriesBreaking time series into their componentsAdditional considerations with time series analysisIn the first part of the book, which sets the foundations, you can follow along without participating in hands-on examples (although it’s recommended). The latter part of the book will be more practice-driven. If you want to get hands-on from the beginning, the code for this chapter can be found in the GitHub repository of this book at:
https://github.com/PacktPublishing/Time-Series-Analysis-with-Spark/tree/main/ch1
Note
Refer to this GitHub repository for the latest revisions of the code, which will be commented on if updated post-publication. The updated code (if any) might differ from what is presented in the book's code sections.
The following hands-on sections will give you further details to get started with time series analysis.
In this section, we will develop an understanding of what time series are and some related terms. This will be illustrated by hands-on examples to visualize time series. We will look at different types of time series and what characterizes them. This knowledge of the nature of time series is necessary for us to choose the appropriate time series analysis approach in the upcoming chapters.
Let’s start with an example of a time series with the average temperature in Mauritius every year since 1950. A short sample of the data is shown in Table 1.1.
Year
Average temperature
1950
22.66
1951
22.35
1952
22.50
1953
22.71
1954
22.61
1955
22.40
1956
22.22
1957
22.53
1958
22.71
1959
22.49
Table 1.1: Sample time series data – average temperature
While visualizing and explaining this example, we will be introduced to some terms related to time series. The code to visualize this dataset is covered in the hands-on section of this chapter.
In the following figure, we see the change in temperature over the years since 1950. If we focus on the period after 1980, we can observe the variations more closely, with similarly increasing temperatures over the years (trend – shown with a dashed line in both figures) to the current temperature.
Figure 1.1: Average temperature in Mauritius since 1950
If the temperature continues to increase in the same way, we are heading to a warmer future, a manifestation of what is now widely accepted as global warming. At the same time as the temperature has been increasing over the years, it also goes up every summer and down during the winter months (seasonality). We will visualize this and other components of temperature time series in the hands-on section of this chapter.
With the temperatures getting warmer over the years (trend), global warming has an impact (causality) on our planet and its inhabitants. This impact can also be represented with time series – for example, sea level or rainfall measurements. The consequences of global warming can be dramatic and irreversible, which further highlights the importance of understanding this trend.
These time-over-time readings of temperature form what we call a time series. Analysis and understanding of such a time series is critical for our future.
So, what is a time series in more general terms? It is simply a chronological series of measurements together with the specific time at which it was generated by a source system. In the example of temperature, the source system is the thermometer at a specific geographical location.
Time series can also be represented in an aggregated form, such as the average temperature every year, as shown in Table 1.1.
From this definition, illustrated with an example, let’s now probe further into the nature of time series. We will also cover in further detail in the rest of this book the terms introduced here, such as trend, seasonality, and causality.
At the beginning of the chapter, we mentioned chronological order while defining time series, this is because it is a major factor that differentiates the approach when working with time series data compared to other datasets. One of the main reasons why order matters is due to potential auto-correlation within time series, where measurement at time t is related to measurement at n time steps earlier (lag). Ignoring this order will make our analysis incomplete and even incorrect. We will look at the method to identify auto-correlation later, in Chapter 6 on exploratory data analysis.
It is worth noting that, in many cases with time series, auto-correlation tends to make measurements closer in time closer in value, as compared to measurements further apart in time.
Another reason to respect chronological order is to avoid data leakage during model training. In some of the analysis and forecasting methods, we will be training models on past data to predict value at a future target date. We need to ensure that all data points used are prior to the target date. Data leakage during training, often tricky to spot with time series data, will invalidate the integrity of the approach and create models that perform misleadingly well during development, then not so well when faced with new unseen data.
Terms introduced here, such as auto-correlation, lags, and data leakage, will be further explained in the rest of the book.
Chronological order, discussed here, is one defining characteristic of time series. In the next section, we will highlight regularity or the lack of it, which is another characteristic.
Time series can be regular or irregular with regard to the interval of their measurements.
Regular time series have values expected at regular intervals in time, say every minute, hour, month, and so on. This is usually the case for source systems generating a continuous value, which is then measured at a regular interval. This regularity is expected, but not guaranteed, as these time series can have gaps or values at zero, due to missing data points or just the measurement itself being zero. In this case, they will still be considered of a regular nature.
Irregular time series are when measurements are not generated at regular intervals at the source. This is usually the case of events occurring at irregular points in time, for which events some type of value is then measured. These irregular interval values can be resampled to a regular interval with a lower frequency—effectively turning into a regular time series. For example, an irregular event not occurring every minute may have a likelihood of occurring every hour and be considered regular in nature at the hourly rate.
This book will primarily focus on regular time series. After the regularity of time series, another characteristic we will consider in the next section is stationarity.
Considering the statistical properties of time series over time, they can be further categorized as stationary or non-stationary.
Stationary time series are those for which statistical properties such as mean and variance do not vary over time.
Non-stationary time series have changing statistical properties. These time series can be converted to stationary by a combination of methods: for example, one or more orders of differencing to stabilize the mean and using the log value to stabilize the variance. This distinction is important as it will determine which analysis method can be used. For instance, if an analysis method is based on the assumption of stationary series, the above conversion can be applied to non-stationary data first. You will learn about the method to identify stationarity in Chapter 6 on exploratory data analysis.
Note
Converting a non-stationary time series to a stationary one removes the trend and seasonal components, which may not be what we want if we want to analyze these components.
This section was an important one to understand the underlying nature of time series, which is a prerequisite to identifying the right analysis method to use in the later part of this book. Figure 1.2 summarizes the types of time series and conversation operations that can be used.
Figure 1.2: Types of time series
This concludes the theoretical part of this chapter. In the next section, we will have our first hands-on experience, setting up the coding environment along the way. We will start with visualizing and decomposing time series in this chapter. We will get into different types of time series analysis and when they are used in the next chapter.
Let’s go through the hands-on exercise to load a time series dataset and visualize it. We will try to create the visual representation we’ve already seen in Figure 1.1.
In order to run the code, you will need a Python development environment where you can install Apache Spark and other required libraries. Specific libraries will be detailed, together with installation instructions, in the corresponding chapters when required.
An easy way to get going with these requirements is by using Databricks Community Edition, which is free. This comes with a notebook-based development interface, as well as compute with pre-installed Spark and some other libraries.
The instructions to sign up for Databricks Community Edition can be found here:
https://docs.databricks.com/en/getting-started/community-edition.html
Community Edition’s compute size is limited as it is a free cloud-based PaaS. You can also sign up for a 14-day free trial of Databricks, which, depending on the signup option you choose, may require you to first have an account with a cloud provider. Some cloud providers may have promotions with some free credits at the start. This will give you access to more resources than on Community Edition, for a limited time.
Sign up for the free trial to Databricks at the following URL: https://www.databricks.com/try-databricks
The folks at Databricks are the original creators of Apache Spark, so you will be in a good place there.
The examples in the early chapters will use Community Edition and the open source version of Apache Spark. We will use the full Databricks platform in Chapter 8 and Chapter 10.
Alternatively, you can build your own environment, setting up the full stack, for instance, in a Docker container. This will be covered in Chapter 3, Introduction to Apache Spark.
The code for this section is in the following notebook file titled ts-spark_ch1_1.dbc in the ch1 folder of this book’s GitHub repository, as per the Technical requirements section.
The location URL is as follows: https://github.com/PacktPublishing/Time-Series-Analysis-with-Spark/raw/main/ch1/ts-spark_ch1_1.dbc
Once the development and runtime environment are chosen, the other consideration is the dataset. The one we will be using is the observed annual average mean surface air temperature of Mauritius, available on the Climate Change Knowledge Portal at https://climateknowledgeportal.worldbank.org/country/mauritius.
A copy of the dataset (in the file titled ts-spark_ch1_ds1.csv) is available in the ch1 GitHub folder. It can be downloaded using the code mentioned earlier.
Next, you will be working on the Databricks Community Edition workspace, which will be your own self-contained environment.
Now that we have everything set up, let’s get our hands on the first coding exercise. First, log in to Databricks Community Edition to import the code, create a cluster, and finally run