32,36 €
Build efficient forecasting models using traditional time series models and machine learning algorithms.
Key Features
Book Description
Time series analysis is the art of extracting meaningful insights from, and revealing patterns in, time series data using statistical and data visualization approaches. These insights and patterns can then be utilized to explore past events and forecast future values in the series.
This book explores the basics of time series analysis with R and lays the foundations you need to build forecasting models. You will learn how to preprocess raw time series data and clean and manipulate data with packages such as stats, lubridate, xts, and zoo. You will analyze data and extract meaningful information from it using both descriptive statistics and rich data visualization tools in R such as the TSstudio, plotly, and ggplot2 packages. The later section of the book delves into traditional forecasting models such as time series linear regression, exponential smoothing (Holt, Holt-Winter, and more) and Auto-Regressive Integrated Moving Average (ARIMA) models with the stats and forecast packages. You'll also cover advanced time series regression models with machine learning algorithms such as Random Forest and Gradient Boosting Machine using the h2o package.
By the end of this book, you will have the skills needed to explore your data, identify patterns, and build a forecasting model using various traditional and machine learning methods.
What you will learn
Who this book is for
Hands-On Time Series Analysis with R is ideal for data analysts, data scientists, and all R developers who are looking to perform time series analysis to predict outcomes effectively. A basic knowledge of statistics is required; some knowledge in R is expected, but not mandatory.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 470
Veröffentlichungsjahr: 2019
Copyright © 2019 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Commissioning Editor: Sunith ShettyAcquisition Editor:Aditi GourContent Development Editor:Pratik AndradeTechnical Editor:Nilesh SawakhandeCopy Editor: Safis EditingLanguage Support Editor: Storm MannProject Coordinator:Namrata SwettaProofreader: Safis EditingIndexer:Rekha NairGraphics:Jisha ChirayilProduction Coordinator:Aparna Bhagat
First published: May 2019
Production reference: 1310519
Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.
ISBN 978-1-78862-915-7
www.packtpub.com
Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.
Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals
Improve your learning with Skill Plans built especially for you
Get a free eBook or video every month
Mapt is fully searchable
Copy and paste, print, and bookmark content
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.
Rami Krispin is a data scientist at a major Silicon Valley company, where he focuses on time series analysis and forecasting. In his free time, he also develops open source tools and is the author of several R packages, including the TSstudio package for time series analysis and forecasting applications. Rami holds an MA in applied economics and an MS in actuarial mathematics from the University of Michigan—Ann Arbor.
Fernando C. Barbi (@fcbarbi) is a product manager at Analyx Labs in Switzerland, developing data analysis and risk management tools for the financial industry. He runs the Private Equity Lab, where he researches and teaches investment modeling. He has authored some R packages and, as a Python and R enthusiast, is often an instructor at tech conferences and online courses. He holds a PhD in economics from the São Paulo School of Economics (EESP) FGV.
Fiqry Revadiansyah is a data scientist at Bukalapak, where he provides insights and analytical strategies to enhance product quality by utilizing machine learning and any statistical experiment. He graduated from Universitas Padjadjaran, Bandung, with a BS in statistics. He is a statistician working in data science as a statistics researcher and as an academic consultant. His primary interests are research related to time series analysis and regression modeling, artificial intelligence, immersive computing, and gamification. He uses several programming languages, including R, Python, and C#.
Dr. Naftali Cohen is a research scientist at AI Research, JP Morgan. He has over 10 years of R&D work experience in numerical modeling, predictive analytics, machine learning, and AI in both academic and industrial settings.
Before joining JP Morgan, Dr. Cohen worked as an academic researcher at Yale University and Columbia University.
He holds a Ph.D. in applied mathematics from the Courant Institute of Mathematical Sciences—New York University. His academic research focused on climate science and storm formation. Dr. Cohen is a MacCracken fellow and an elected member of the International Space Science Institute.
If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.
Title Page
Copyright and Credits
Hands-On Time Series Analysis with R
Dedication
About Packt
Why subscribe?
Packt.com
Contributors
About the author
About the reviewers
Packt is searching for authors like you
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Download the color images
Conventions used
Get in touch
Reviews
Introduction to Time Series Analysis and R
Technical requirements
Time series data
Historical background of time series analysis
Time series analysis
Learning with real-life examples
Getting started with R
Installing R
A brief introduction to R
R operators
Assignment operators
Arithmetic operators
Logical operators
Relational operators
The R package
Installation and maintenance of a package
Loading a package in the R working environment
The key packages
Variables
Importing and loading data to R
Flat files
Web API
R datasets
Working and manipulating data
Querying the data
Help and additional resources
Summary
Working with Date and Time Objects
Technical requirements
The date and time formats
Date and time objects in R
Creating date and time objects
Importing date and time objects
Reformatting and converting date objects
Handling numeric date objects
Reformatting and conversion of time objects
Time zone setting
Creating a date or time index
Manipulation of date and time with the lubridate package
Reformatting date and time objects – the lubridate way
Utility functions for date and time objects
Summary
The Time Series Object
Technical requirement
The Natural Gas Consumption dataset
The attributes of the ts class
Multivariate time series objects
Creating a ts object
Creating an mts object
Setting the series frequency
Data manipulation of ts objects
The window function
Aggregating ts objects
Creating lags and leads for ts objects
Visualizing ts and mts objects
The plot.ts function
The dygraphs package
The TSstudio package
Summary
Working with zoo and xts Objects
Technical requirement
The zoo class
The zoo class attributes
The index of the zoo object
Working with date and time objects
Creating a zoo object
Working with multiple time series objects
The xts class
The xts class attributes
The xts functionality
The periodicity function
Manipulating the object index
Subsetting an xts object based on the index properties
Manipulating the zoo and xts objects
Merging time series objects
Rolling windows
Creating lags
Aggregating the zoo and xts objects
Plotting zoo and xts objects
The plot.zoo function
The plot.xts function
xts, zoo, or ts – which one to use?
Summary
Decomposition of Time Series Data
Technical requirement
The moving average function
The rolling window structure
The average method 
The MA attributes
The simple moving average
Two-sided MA
A simple MA versus a two-sided MA
The time series components
The cycle component
The trend component
The seasonal component
The seasonal component versus the cycle component
White noise
The irregular component
The additive versus the multiplicative model
Handling multiplicative series
The decomposition of time series
Classical seasonal decomposition
Seasonal adjustment
Summary
Seasonality Analysis
Technical requirement
Seasonality types
Seasonal analysis with descriptive statistics
Summary statistics tables
Seasonal analysis with density plots
Structural tools for seasonal analysis
Seasonal analysis with the forecast package
Seasonal analysis with the TSstudio package
Summary
Correlation Analysis
Technical requirement
Correlation between two variables
Lags analysis
The autocorrelation function
The partial autocorrelation function
Lag plots
Causality analysis
Causality versus correlation
The cross-correlation function
Summary
Forecasting Strategies
Technical requirement
The forecasting workflow
Training approaches
Training with single training and testing partitions
Forecasting with backtesting
Forecast evaluation
Residual analysis
Scoring the forecast
Forecast benchmark
Finalizing the forecast
Handling forecast uncertainty
Confidence interval
Simulation
Horse race approach
Summary
Forecasting with Linear Regression
Technical requirement
The linear regression
Coefficients estimation with the OLS method
The OLS assumptions
Forecasting with linear regression
Forecasting the trend and seasonal components
Features engineering of the series components
Modeling the series trend and seasonal components
The tslm function
Modeling single events and non-seasonal events
Forecasting a series with multiseasonality components – a case study
The UKgrid series
Preprocessing and feature engineering of the UKdaily series
Training and testing the forecasting model 
Model selection 
Residuals analysis
Finalizing the forecast 
Summary
Forecasting with Exponential Smoothing Models
Technical requirement
Forecasting with moving average models
The simple moving average
Weighted moving average
Forecasting with exponential smoothing
Simple exponential smoothing model
Forecasting with the ses function
Model optimization with grid search
Holt method
Forecasting with the holt function 
Holt-Winters model
Summary
Forecasting with ARIMA Models
Technical requirement
The stationary process
Transforming a non-stationary series into a stationary series
Differencing time series
Log transformation
The random walk process
The AR process
Identifying the AR process and its characteristics
The moving average process
Identifying the MA process and its characteristics
The ARMA model
Identifying an ARMA process 
Manual tuning of the ARMA model
Forecasting AR, MA, and ARMA models
The ARIMA model
Identifying an ARIMA process 
Identifying the model degree of differencing
The seasonal ARIMA model
Tuning the SARIMA model
Tuning the non-seasonal parameters
Tuning the seasonal parameters 
Forecasting US monthly natural gas consumption with the SARIMA model – a case study
The auto.arima function
Linear regression with ARIMA errors
Violation of white noise assumption
Modeling the residuals with the ARIMA model
Summary
Forecasting with Machine Learning Models
Technical requirement
Why and when should we use machine learning?
Why h2o?
Forecasting monthly vehicle sales in the US – a case study
Exploratory analysis of the USVSales series
The series structure
The series components
Seasonal analysis
Correlation analysis
Exploratory analysis – key findings
Feature engineering 
Training, testing, and model evaluation
Model benchmark
Starting a h2o cluster
Training an ML model 
Forecasting with the Random Forest model
Forecasting with the GBM model
Forecasting with the AutoML model
Selecting the final model
Summary
Other Books You May Enjoy
Leave a review - let other readers know what you think
Time series analysis is the art of extracting meaningful insights and revealing patterns from time series data using statistical and data visualization approaches. These insights and patterns can then be utilized to explore past events and forecast future values in the series.
This book goes through all the steps of the time series analysis process, from getting the raw data, to building a forecasting model using R. You will learn how to use tools from packages such as stats, lubridate, xts, and zoo to clean and reformat your raw data into structural time series data. As you make your way through Hands-On Time Series Analysis with R, you will analyze data and extract meaningful information from it using both descriptive statistics and rich data visualization tools in R, such as the TSstudio, plotly, and ggplot2 packages. The latter part of the book delves into traditional forecasting models such as time series regression models, exponential smoothing, and autoregressive integrated moving average (ARIMA) models using the forecast package. Last but not least, you will learn how to utilize machine learning models such as Random Forest and Gradient Boosting Machine to forecast time series data with the h2o package.
This book is ideal for the following groups of people:
Data scientists who wish to learn how to perform time series analysis and forecasting with R.
Data analysts who perform Excel-based time series analysis and forecasting and wish to take their forecasting skills to the next level.
Basic knowledge of statistics (for example, regression analysis and hypothesis testing) is required, and some knowledge of R is expected but is not mandatory(for those who never use R, Chapter 1, Introduction to Time Series Analysis and R, provides a brief introduction).
Chapter 1, Introduction to Time Series Analysis and R, provides a brief introduction to the time series analysis process and defines the attributes and characteristics of time series data. In addition, the chapter provides a brief introduction to R for readers with no prior knowledge of R. This includes the mathematical and logical operators, loading data from multiple sources (such as flat files and APIs), installing packages, and so on.
Chapter 2, Working with Date and Time Objects, focuses on the main date and time classes in R—the Date and POSIXct/lt classes—and their attributes. This includes ways to reformat date and time objects with the base and lubridate packages.
Chapter 3, The Time Series Object, focuses on the ts class, an R core class for time series data. This chapter dives deep into the attributes of the ts class, methods for creating and manipulating ts objects using tools from the stats package, and data visualization applications with the TSstudio and dygraphs packages.
Chapter 4, Working with zoo and xts Objects, covers the applications of the zoo and xts classes, an advanced format for time series data. This chapter focuses on the attributes of the zoo and xts classes and the preprocessing and data visualization tools from the zoo and xts packages
Chapter 5, Decomposition of Time Series Data, focuses on decomposing time series data down to its structural patterns—the trend, seasonal, cycle, and random components. Starting with the moving average function, the chapter explains how to use the function for smoothing, and then focuses on decomposing a time series to down its components with the moving average.
Chapter 6, Seasonality Analysis, explains approaches and methods for exploring and revealing seasonal patterns in time series data. This includes the use of summary statistics, along with data visualization tools from the forecast, TSstudio, and ggplot2 packages.
Chapter 7, Correlation Analysis, focuses on methods and techniques for analyzing the relationship between time series data and its lags or other series. This chapter provides a general background for correlation analysis, and introduces statistical methods and data visualization tools for measuring the correlation between time series and its lags or between multiple time series.
Chapter 8, Forecasting Strategies, introduces approaches, strategies, and tools for building time series forecasting models. This chapter covers different training strategies, different error metrics, benchmarking, and evaluation methods for forecasting models.
Chapter 9, Forecasting with Linear Regression, dives into the forecasting applications of the linear regression model. This chapter explains how to model the different components of a series with linear regression by creating new features from the series. In addition, this chapter covers the advanced modeling of structural breaks, outliers, holidays, and time series with multiple seasonality.
Chapter 10, Forecasting with Exponential Smoothing Models, focuses on forecasting time series data with exponential smoothing functions. This chapter explains the usage of smoothing parameters to forecast time series data. This includes simplistic models such as simple exponential smoothing, which is for time series with neither trend nor seasonal components, to advanced smoothing models such as Holt-Winters forecasting, which is for forecasting time series with both trend and seasonal components.
Chapter 11, Forecasting with ARIMA Models, covers the ARIMA family of forecasting models. This chapter introduces the different types of ARIMA models—the autoregressive (AR), moving average (MA), ARMA, ARIMA, and seasonal ARIMA (SARIMA) models. In addition, the chapter focuses on methods and approaches to identify, tune, and optimize ARIMA models with both autocorrelation and partial correlation functions using applications from the stats and forecast packages.
Chapter 12, Forecasting with Machine Learning Models, focuses on methods and approaches for forecasting time series data with machine learning models with the h2o package. This chapter explains the different steps of modeling time series data with machine learning models. This includes feature engineering, training and tuning approaches, evaluation, and benchmarking a forecasting model's performance.
This book was written under the assumption that its readers have the following knowledge and skills:
Basic knowledge of statistics or econometrics, which includes topics such as regression modeling, hypothesis testing, normal distribution, and so on
Experience with R, or another programming language
You will need to have R installed (https://cran.r-project.org/) and it is recommended to install the RStudio IDE (https://www.rstudio.com/products/rstudio/).
You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.
You can download the code files by following these steps:
Log in or register at
www.packt.com
.
Select the
SUPPORT
tab.
Click on
Code Downloads & Errata
.
Enter the name of the book in the
Search
box and follow the onscreen instructions.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR/7-Zip for Windows
Zipeg/iZip/UnRarX for Mac
7-Zip/PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Hands-On-Time-Series-Analysis-with-R. In case there's an update to the code, it will be updated on the existing GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://www.packtpub.com/sites/default/files/downloads/9781788629157_ColorImages.pdf.
There are a number of text conventions used throughout this book.
CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "We will use the Sys.Date and Sys.time functions to pull date and time objects respectively."
A block of code is set as follows:
library(TSstudio)data(USgas)
The output of the R code is prefixed by the ## sign:
ts_info(USgas)
## The USgas series is a ts object with 1 variable and 227 observations## Frequency: 12 ## Start time: 2000 1 ## End time: 2018 11
Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "Select System info from the Administration panel."
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!
For more information about Packt, please visit packt.com.
Time series analysis is the art of extracting meaningful insights from time series data by exploring the series' structure and characteristics and identifying patterns that can then be utilized to forecast future events of the series. In this chapter, we will discuss the foundations, definitions, and historical background of time series analysis, as well as the motivation of using it. Moreover, we will present the advantages and motivation of using R for time series analysis and provide a brief introduction to the R programming language.
In this chapter, we will cover the following topics:
Time series data
Time series analysis
Key R packages in this book
R and time series analysis
In order to be able to execute the R code in this book, you need the following requirements:
You need R programming language version 3.2 and above; however, it is recommended to install one of the most recent versions (3.5 or 3.6). More information about the hardware requirements per operating system (for example, macOS, Windows, and Linux) is available on the CRAN website:
https://cran.r-project.org/
.
The following packages will be used in this book:
forecast
: Version 8.5 and above
h2o
: Version 3.22.1.1 and above and
Java
version 7 and above
TSstudio
:
Version 0.1.4 and above
plotly
: Version 4.8 and above
ggplot2
: Version 3.1.1 and above
dplyr
: Version 0.8.1 and above
lubridate
: Version 1.7.4 and above
xts
: Version 0.11-2 and above
zoo
: Version 1.8-5 and above
UKgrid
: Version
0.1.1 and above
You can access the codes for this book from the following link:
https://github.com/PacktPublishing/Hands-On-Time-Series-Analysis-with-R
Time series data is one of the most common formats of data, and it is used to describe an event or phenomena that occurs over time. Time series data has a simple requirement—its values need to be captured at equally spaced time intervals, such as seconds, minutes, hours, days, months, and so on. This important characteristic is one of the main attributes of the series and is known as the frequency of the series. We usually add the frequency along with the name of the series. For example, the following diagram describes the four time series from different domains (power and utilities, finance, economics, and science):
The UK
hourly
demand for electricity
The S&P 500
daily
closing values
The US
monthly
unemployment rate
The
annual
number of sunspots
The following diagram shows the (1) UK hourly demand for electricity, (2) S&P 500 daily closing values, (3) US monthly unemployment rate, and (4) annual number of sunspots:
Taking a quick look at the four series, we can identify common characteristics of time series data:
Seasonality
: If we look at graph 1, there is high demand during the day and low demand during the night time.
Trend
: A clear upper trend can be seen in graph 2 that's between
2013
and
2017
.
Cycles
: We can see cyclic patterns in both graph 3 and graph 4.
Correlation
: Although S&P 500 and the US unemployment rate are presented with different frequencies, you can see that the unemployment rate has decreased since
2013
(negative trend). On the other hand, S&P 500 increased during the same period (positive trend). We can make a hypothesis that there is a negative correlation between the two series and then test it.
Don't worry if you are not familiar with these terms at the moment. In Chapter 5, Decomposing Time Series Data, we will dive into the details of the series' structural components—seasonality, trend, and cycle. Chapter 6, Seasonality Analysis, is dedicated to the analysis of seasonal patterns of time series data, and Chapter 7, Correlation Analysis, is dedicated to methods and techniques for analyzing and identifying correlation in time series data.
Until recently, the use of time series data was mainly related to fields of science, such as economics, finance, physics, engineering, and astronomy. However, in recent years, as the ability to collect data improved with the use of digital devices such as computers, mobiles, sensors, or satellites, time series data is now everywhere. The enormous amount of data that's collected every day probably goes beyond our ability to observe, analyze, and understand it.
The development of time series analysis and forecasting did not start with the introduction of the stochastic process during the previous century. Ancient civilizations such as the Greeks, Romans, or Mayans researched and learned how to utilize cycled events such as weather, agriculture, and astronomy over time to forecast future events. For example, during the classic period of the Mayan civilization (between 250 AD and 900 AD), the Maya priesthood assumed that there are cycles in astronomy events and therefore they patiently observed, recorded, and learned those events. This allowed them to create a detailed time series table of past events, which eventually allowed them to forecast future events, such as the phases of the moon, eclipses of the moon and the sun, and the movement of stars such as Venus, Jupiter, Saturn, and Mars. The Mayan's priesthood used to collect data and analyze the data to identify patterns and cycles. This analysis was then utilized to predict future events. We can find a similarity between the Mayan's ancient analytical process and the time series analysis process we use now. However, the modern time series analysis process is based on statistical modeling and heavy calculations that are possible with today's computers and software, such as R.
Now that we defined the main characteristics of time series data, we can move forward and start to discuss the main characteristics of time series analysis.
Time series analysis is the process of extracting meaningful insights from time series data with the use of data visualization tools, statistical applications, and mathematical models. Those insights can be used to learn and explore past events and to forecast future events. The analysis process can be divided into the following steps:
Data collection
: This step includes extracting data from different data sources, such as flat files (such as CSV, TXT, and XLMS), databases (for example, SQL Server, and Teradata), or other internet sources (such as academic resources and the Bureau of Statistics datasets). Later on in this chapter, we will learn how to load data to R from different sources.
Data preparation
: In most cases, raw data is unstructured and may require cleaning, transformation, aggregation, and reformatting. In
Chapter 2
,
Working with Date and Time Objects
;
Chapter 3
,
The Time Series Object
; and
Chapter 4
,
Working with zoo and xts Objects
, we will focus on the core data preparation methods of time series data with R.
Descriptive analysis
: This is used in summary statistics and data visualization tools to extract insights from the data, such as patterns, distributions, cycles, and relationships with other drivers to learn more about past events. In
Chapter 5
,
Decomposition of Time Series Data
;
Chapter 6
,
Seasonality Analysis
; and
Chapter 7
,
Correlation Analysis
, we will focus on descriptive analysis methods of time series data.
Predictive analysis
: We use this to apply statistical methods in order to forecast future events.
Chapter 8
,
Forecasting Strategies
;
Chapter 9
,
Forecasting with Linear Regression
;
Chapter 10
,
Forecasting with Exponential Smoothing Models
;
Chapter 11
,
Forecasting with ARIMA Models
; and
Chapter 12
,
Forecasting with Machine Learning Models
, we will focus on traditional forecasting approaches (such as linear regression, exponential smoothing, and ARIMA models), as well as advanced forecasting approaches with machine learning models.
It may be surprising but, in reality, the first two steps may take most of the process time and effort, which is mainly due to data challenges and complexity. For instance, companies tend to restructure their business units (BU) and IT systems every couple of years, and therefore it is hard to identify and track the historical contribution (production, revenues, unit sales, and so on) of a specific BU before the changes.
In other cases, additional effort is required to clean the raw data and handle missing values and outliers. This sadly leaves less time for the analysis itself. Fortunately, R has a variety of wonderful applications for data preparations, visualizations, and time series modeling. This helps to reduce the time that's spent on the preparation steps and lets you allocate more time to the analysis itself. Throughout the rest of this chapter, we will provide background information on R and its applications for time series analysis.
Throughout the learning journey in this book, we will use real-life examples of time series data in order to apply the methods and techniques of the analysis. All of the datasets that we will use are available in the TSstudio and UKgrid packages (unless stated otherwise).
The first time series data we will look at is the monthly natural gas consumption in the US. This data is collected by the US Energy Information Administration (EIA) and measures the monthly natural gas consumption from January 2000 until November 2018. The unit of measurement is billions of cubic feet (not seasonally adjusted). The following graph shows the monthly natural gas consumption in the US:
The following series describe the total vehicle sales in the US from January 1976 until January 2019. The units of this series are in thousands of units (not seasonally adjusted). The data is sourced from the US Bureau of Economic Analysis. The following graph shows the total monthly vehicle sales in the US:
Another monthly series that we will use is the monthly US unemployment rate, which represents the number of unemployed as a percentage of the labor force. The series started in January 1948 and ended in January 2019. The data is sourced from the US Bureau of Labor Statistics. The following graph shows the monthly unemployment rate in the US:
Last but not the least, we will use the national demand for electricity in the UK (as measured on the grid systems) between 2011 and 2018, since it provides an example of high-frequency time series data with half-hourly intervals. The data source is the UK National Grid website, and the information is shown in the following graph:
Let's start by installing R.
R is an open source and free programming language for statistical computing and graphics. With more than 13,500 indexed packages (as of May 2019, as you can see in the following graph) and a large number of applications for statistics, machine learning, data mining, and data visualizations, R is one of the most popular statistical programming languages. One of the main reasons for the fast growth of R in recent years is the open source structure of R, where users are also the main package developers. Among the package developers, you can find individuals like us, as well as giant companies such as Microsoft, Google, and Facebook. This reduces the dependency of the users significantly with any specific company (as opposed to traditional statistical software), allowing for fast knowledge sharing and a diverse portfolio of solutions.
The following graph shows the amount packages that have been shared on CRAN over time:
You can see that, whenever we come across any statistical problem, it is likely that someone has already faced the same problem and developed a package with a solution (and if not, you should create one!). Furthermore, there are a vast amount of packages for time series analysis, from tools for data preparations and visualization to advance statistical modeling applications. Packages such as forecast, stats,zoo,xts, andlubridatemade R the leading software for time series analysis. In the A brief introduction to R section in this chapter, we will discuss the key packages we will use throughout this book in more detail.
Now, we will learn how to install R.
To install R on Windows, Mac, or Linux, go to the Comprehensive R Archive Network (CRAN) main page at https://cran.r-project.org/, where you can select the relevant operating system.
For Windows users, the installation file includes both the 32-bit and the 64-bit versions. You can either install one of the versions or the hybrid version, which includes both the 32-bit and 64-bit versions. Technically, after the installation, you can start working with R using the built-in Integrated Development Environment (IDE).
However, it is highly recommended to install the RStudio IDE and set it as your working environment for R. RStudio will make your code writing and debugging and the use of visualization tools or other applications easier and simple.
RStudio offers a free version of its IDE, which is available at https://www.rstudio.com/products/rstudio/download/.
Throughout the learning process in this book, we will use R intensively to introduce methods, techniques, and approaches for time series analysis. If you have never used R before, this section provides a brief introduction, which includes the basic foundations of R, the operators, the packages, different data structures, and loading data. This won't make you an R expert, but it will provide you with the basic R skills you will require to start the learning journey of this book.
Like any other programming language, the operators are one of the main elements of programming in R. The operators are a collection of functions that are represented by one or more symbols and can be categorized into four groups, as follows:
Assignment operators
Arithmetic operators
Logical operators
Relational operators
Assignment operators are probably the family of operators that you will use the most while working with R. As the name of this group implies, they are used to assign objects such as numeric values, strings, vectors, models, and plots to a name (variable). This includes operators such as the back arrow (<-) or the equals sign ():
# Assigning values to new variablestr <- "Hello World!" # Stringint <- 10 # Integervec <- c(1,2,3,4) # Vector
We can use the print function to view the values of the objects:
print(c(str, int))## [1] "Hello World!" "10"
This is one more example of the print function:
print(vec)## [1] 1 2 3 4
While both of the operators can be used to assign values to a variable, it is not common to use the symbol to assign values other than within functions (for reasons that are out of the scope of this book; more information about operator assignment is available on the assignOps function documentation or ?assignOps function).
This family of operators includes basic arithmetic operations, such as addition, division, exponentiation, and remainder. As you can see, it is straightforward to apply these operators. We will start by assigning the values 10 and 2 to the x and y variables, respectively:
x <- 10y <- 2
The following code shows the usage of the addition operator:
# Additionx + y ## [1] 12
The following code shows the usage of the division operator:
x/ 2.5 ## [1] 4
The following code shows the usage of the exponentiation operator:
y ^ 3 ## [1] 8
Now, let's look at the logical operators.
Logical operators in R can be applied to numeric or complex vectors or Boolean objects, that is, TRUE or FALSE, where numbers greater than one are equivalent to TRUE. It is common to use those operators to test single or multiple conditions under the if…else statement:
# The following are reserved names in R for Boolean objects:# TRUE, FALSE or their shortcut T and Fa <- TRUEb <- FALSE# We can also test if a Boolean object is TRUE or FALSEisTRUE(a)## [1] TRUEisTRUE(b)## [1] FALSE
The following code shows the usage of the AND operator:
# The AND operatora & b ## [1] FALSE
The following code shows the usage of the OR operator:
# The OR operatora | b ## [1] TRUE
The following code shows the usage of the NOT operator:
# The NOT operator!a ## [1] FALSE
We can see the applications of those operators by using an if...else statement:
# The AND operator will return TRUE only if both a and b are TRUEif (a & b) { print("a AND b is true")} else { print("a And b is false")
The following code shows an example of the OR operator, along with the if...else statement:
# The OR operator will return FALSE only if both a and b are FALSEif(a | b){ print("a OR b is true")} else { print("a OR b is false")## [1] "a OR b is true"
Likewise, we can check whether the Boolean object is TRUE or FALSE with the isTRUE function:
isTRUE(a)## [1] TRUE
Here, the condition is FALSE:
isTRUE(b)## [1] FALSE
Now, let's look at relational operators.
The naked version of R (without any installed packages) comes with seven core packages that contain the built-in applications and functions of the software. This includes applications for statistics, visualization, data processing, and a variety of datasets. Unlike any other package, the core packages are inherent in R, and therefore they load automatically. Although the core packages provide many applications, the vast amount of the R applications are based on the uninherent packages that are stored on CRAN or in GitHub repository.
As of May 2019, there are more than 13,500 packages with applications for statistical modeling, data wrangling, and data visualization for a variety of domains (statistics, economics, finance, astronomy, and so on). A typical package may contain a collection of R functions, as well as compiled code (utilizing other languages, such as C, Java, and FORTRAN). Moreover, some packages include datasets that, in most cases, are related to the package's main application. For example, the forecast package comes with a time series dataset, which is used to demonstrate the forecasting models that are available in the package.
There are a few methods that you can use to install package, the most common of which is by using the install.packages function:
# Installing the forecast package: install.packages("forecast")
You can use this function to install more than one package at once by using a vector type of input:
install.packages(c("TSstudio", "xts", "zoo"))
Most of the packages frequently get updates. This includes new features, improvements, and error fixing. R provides a function for updating your installed packages. The packageVersion function returns the version details of the input package:
packageVersion("forecast")[1] '8.5'
The old.packages function identifies whether updates are available for any of the installed packages, and the update.packages function is used to update all of the installed packages automatically. You can update a specific package using the install.packages function, with the package name as input. For instance, if we wish to update the lubridate package, we can use the following code:
install.packages("lubridate")
Last but not least, removing a package can be done with the remove.packages function:
remove.packages("forecast")
The R working environment defines the working space where the functions, objects, and data that are loaded are kept and are available to use. By default, when opening R, the global environment is loaded, and the built-in packages of R are loaded.
An installed package becomes available for use on the R global environment once it is loaded. The search function provides an overview of the loaded packages within your environment. For example, if we execute the search function when opening R, this is the output you expect to see:
search()
## [1] ".GlobalEnv" "package:stats" "package:graphics"
## [4] "package:grDevices" "package:utils" "package:datasets"
## [7] "package:methods" "Autoloads" "package:base"
As you can see from the preceding output, currently, only the seven core packages of R are loaded. Loading a package into the environment can be done with either the library or the require function. While both of these functions will load an installed package and its attached functions, the require function is usually used within a function as it returns FALSE upon failure (compared to an error that the library function returns upon failure). Let's load the TSstudio package and see the change in environment:
library(TSstudio)
Now, we will check the global environment again and review the changes:
search()
We get the following output:
## [1] ".GlobalEnv" "package:TSstudio" "package:stats"
## [4] "package:graphics" "package:grDevices" "package:utils"
## [7] "package:datasets" "package:methods" "Autoloads"
## [10] "package:base"
Similarly, you can unload a package from the environment by using the detach function:
detach("package:TSstudio", unload=TRUE)
Let's check the working environment after detaching the package:
search()
## [1] ".GlobalEnv" "package:stats" "package:graphics"
## [4] "package:grDevices" "package:utils" "package:datasets"
## [7] "package:methods" "Autoloads" "package:base"
Here is a short list of the key packages that we will use throughout this book by topic:
Data preparation and utility functions. These include the following::
stats
: One of the base packages of R, this provides a set of statistical tools, including applications for time series, such as time series objects (
ts
) and the window function.
zoo
and
xts
: With applications for data manipulation, aggregation, and visualization, these packages are some of the main tools that you use to handle time series data in an efficient manner.
lubridate
: This provides a set of tools for handling a variety of dates objects and time formats.
dplyr
: This is one of the main packages in R for data manipulation. This provides a powerful tool for data transformation and aggregation.
Data visualization and descriptive analysis. These include the following:
TSstudio
: This package focuses on both descriptive and predictive analysis of time series data. It provides a set of interactive data visualizations tools, utility functions, and training methods for forecasting models. In addition, the package contains all the datasets that are used throughout this book.
ggplot2
and
plotly
: Packages for
data visualization
applications.
Predictive analysis, statistical modeling, and forecasting. These include the following:
forecast
: This is one of the main packages for time series analysis in R and has a variety of applications for analyzing and forecasting time series data. This includes statistical models such as ARIMA, exponential smoothing, and neural network time series models, as well as automation tools.
h2o
: This is one of the main packages in R for machine learning modeling. It provides machine learning algorithms such as Random Forest, gradient boosting machine, deep learning, and so on.
Variables in R have a broader definition and capabilities than most typical programming languages. Without the need to declare the type or the attribute, any R object can be assigned to a variable. This includes objects such as numbers, strings, vectors, tables, plots, functions, and models. The main features of these variables are as follows:
Flexibility
: Any R object can be assigned to a variable, without any pre-step (such as declaring the variable type). Furthermore, when assigning the object to a new variable, all the attributes of the object transform, along with its content to the new variable.
Attribute
: Neither the variable nor its attributes are needed to be defined prior to the assignment of the object. The object attribute passes to the variable upon assignment (this simplicity is one of the strengths of R). For example, we will assign the
Hello World!
string to the
a
variable:
a <-
"Hello World!"
Let's look at the attributes of the a variable:
class
(a)
We get the following output:
## [1] "character"
Now, let's assign the a variable to the b variable and check out the characteristics of the new variable:
b <-
a
b
We get the following output:
## [1] "Hello World!"
Now, let's check the characteristics of the new variable:
class
(b)
We get the following output:
## [1] "character"
As you can see, the b variable inherited both the value and attribute of the a variable.
Name
: A valid variable name could consist of letters, numbers, and the dot or underline characters. However, it must start with either a letter or a dot, followed by a letter (that is,
var_1
,
var.1
,
var1
, and
.var1
are examples of valid names, while
1var
and
.1var
are examples of invalid names). In addition, there are sets of reserve names that R uses for its key operations, such as
if
,
TRUE
, and
FALSE
, and therefore cannot be used as variable names. Last but not least, variable names are case-sensitive. For example,
Var_1
and
var_1
will refer to two different variables.
Now that we have discussed operators, packages, and variables, it is time to jump into the water and start working with real data!
Importing or loading data is one of the key elements of the work flow in any analysis. R provides a variety of methods that you can use to import or load data into the environment, and it supports multiple types of data formats. This includes importing data from flat files (for example, CSV and TXT), web APIs or databases (SQL Server, Teradata, Oracle, and so on), and loading datasets from R packages. Here, we will focus on the main methods that we will use in this book—that is, importing data from flat files or the web API and loading data from the R package.
