E-Book
33,59 €

Data Forecasting and Segmentation Using Microsoft Excel E-Book

Fernando Roque

0,0

33,59 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Lebensstil
Sprache: Englisch

Beschreibung

Data Forecasting and Segmentation Using Microsoft Excel guides you through basic statistics to test whether your data can be used to perform regression predictions and time series forecasts. The exercises covered in this book use real-life data from Kaggle, such as demand for seasonal air tickets and credit card fraud detection.
You’ll learn how to apply the grouping K-means algorithm, which helps you find segments of your data that are impossible to see with other analyses, such as business intelligence (BI) and pivot analysis. By analyzing groups returned by K-means, you’ll be able to detect outliers that could indicate possible fraud or a bad function in network packets.
By the end of this Microsoft Excel book, you’ll be able to use the classification algorithm to group data with different variables. You’ll also be able to train linear and time series models to perform predictions and forecasts based on past data.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

MOBI

Seitenzahl: 276

Veröffentlichungsjahr: 2022

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Data Forecasting and Segmentation Using Microsoft Excel

Perform data grouping, linear predictions, and time series machine learning statistics without using code

Fernando Roque

BIRMINGHAM—MUMBAI

Data Forecasting and Segmentation Using Microsoft Excel

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Publishing Product Manager: Heramb Bhavsar

Senior Editor: David Sugarman

Content Development Editor: Sean Lobo

Technical Editor: Rahul Limbachiya

Copy Editor: Safis Editing

Language Support Editor: Safis Editing

Project Coordinator: Aparna Ravikumar Nair

Proofreader: Safis Editing

Indexer: Hemangini Bari

Production Designer: Sinhayna Bais

Marketing Coordinator: Priyanka Mhatre

First published: June 2022

Production reference: 1130522

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-80324-773-1

www.packt.com

To Sesi: I hope that someday you help to protect our natural biodiversity by looking at the North star, the pyramids, and the numbers.

Thanks to Philly, Peter de la Union, and Mercedes Thaddeus for your continuous assistance.

Contributors

About the author

Fernando Roque has 24 years of experience in working with statistics for quality control and financial risk assessment of projects after planning, budgeting, and execution. In his work, Fernando applies Python k-means and time series machine learning algorithms, using Normalized Difference Vegetation Index (NDVI) drone images to find crop regions with more resilience to droughts. He also applies time series and k-means algorithms for supply chain management (logistics) and inventory planning for seasonal demand.

About the reviewer

Ashwini Badgujar is a machine learning engineer at Impulselogic Inc. She has been involved in the machine learning domain for the last 3 years. She has research experience in Natural Language Processing and computer vision, specifically neural networks. She has also worked in the earth science domain, working on a number of earth science projects in collaboration with machine learning. She has worked at NASA in data processing and fire project analysis in calculating mixing heights, and at Comcast, where she worked on optimizing machine learning models.

Antonio L. Amadeu is a data science consultant, who is passionate about artificial intelligence and neural networks. He researches and applies machine learning and deep learning algorithms in his daily challenges, solving all types of issues in various business industries. He has worked for big companies such as Unilever, Lloyds Bank, TE Connectivity, Microsoft, and Samsung. As an aspiring astrophysicist, he does some research on astronomy object classification, using machine and deep learning techniques and International Virtual Observation Alliance (IVOA). He also participates in research at the Institute of Astronomy, Geophysics, and Atmospheric Sciences at Universidade de São Paulo.

Preface

Part 1 – An Introduction to Machine Learning Functions

Chapter 1: Understanding Data Segmentation

Segmenting data concepts

Grouping data in segments of two and three variables

Summary

Questions

Answers

Chapter 2: Applying Linear Regression

Understanding the influence of variables in linear regression

Projecting values from predictor variables

Summary

Questions

Answers

Chapter 3: What is Time Series?

Technical requirements

Understanding time series data

Designing the time series data model

Analyzing the air passenger 10-year data chart

Conducting a Durbin-Watson test on our 10-year data

Computing the centered moving average of each period lag of the data

Analyzing the seasonal irregularity

Trending component of the time series

Doing the forecast

Summary

Questions

Answers

Part 2 – Grouping Data to Find Segments and Outliers

Chapter 4: Introduction to Data Grouping

Technical requirements

Grouping with the K-means machine learning function

Finding groups of multiple variables

Calculating centroids and the optimal number of segments for one variable

Calculating centroids and the optimal number of segments for two or more variables

Understanding outliers

Summary

Questions

Answers

Chapter 5: Finding the Optimal Number of Single Variable Groups

Technical requirements

Finding an optimal number of groups for one variable

Instructions to run the required add-in in Excel

Running K-means elbow to get the optimal number of groups

Passing the data values to the K-means elbow function

Executing and interpreting the resulting chart of the optimal number of groups

Running the K-means function to get the centroids or group average

Finding the groups and centroids of one-variable data with K-means and Excel

Assigning values for every group

Calculating the centroid or the average point for every group

Exploring the range of values for each segment

Finding the segments for products and profits

Finding the optimal number of groups using the K-means elbow function

Running the K-means function to do the group segmentation

Summary

Questions

Answers

Chapter 6: Finding the Optimal Number of Multi-Variable Groups

Technical requirements

Calculating the optimal number of groups for two and three variables

Finding the optimal number of segments for two variables – revenue and quantity

Using the elbow function to get the number of groups for three variables – revenue, quantity, and month of sale

Determining the groups and average value (centroids) of two and three variables

Getting the groups with the K-means algorithm for two and three variables

Visualizing centroids or the average value of each group for two and three variables

Charting the product value range of each group for revenue, quantity, and month

Using the Elbow and K-means functions with four variables

Summary

Questions and answers

Answers

Chapter 7: Analyzing Outliers for Data Anomalies

Technical requirements

Representing the data in a 3D chart

Kaggle credit card fraud dataset

Kaggle suspicious logins

Kaggle insurance money amount complaints

K-means data grouping

Running the elbow algorithm

Kaggle credit card fraud dataset

Running the K-means function

Pivot analysis of the outliers

Kaggle credit card fraud dataset

Kaggle suspicious logins

Kaggle insurance money amount complaints

Summary

Questions

Answers

Part 3 – Simple and Multiple Linear Regression Analysis

Chapter 8: Finding the Relationship between Variables

Technical requirements

Charting the predictive model's regression variables

Plotting the variables to analyze the possible relationship

Calculating the linear model confidence percentage

Coefficient of determination

Correlation coefficient

Statistical significance of the slope

The regression model's value ranges

Summary

Questions

Answers

Chapter 9: Building, Training, and Validating a Linear Model

Technical requirements

Calculating the intercept and slope with formulas

Computing coefficient significance – t-statistics and p-value

Coefficient of determination

Coefficient of correlation

t-statistics and p-value

Getting the residual standard error

Calculating the r-squared

Calculating the f-statistics

Training and testing the model

Doing prediction scenarios with the regression model

Summary

Questions

Answers

Chapter 10: Building, Training, and Validating a Multiple Regression Model

Technical requirements

Exploring the variables with more influence

Calculating t-statistics and p-values

Determination coefficient

Correlation coefficient

t-statistics

p-value

Calculating residuals standard error and f-statistics

Calculating residuals standard errors

Calculating f-statistics

Training and testing the model

Writing a linear regression model formula

Building the prediction model

Summary

Questions

Answers

Part 4 – Predicting Values with Time Series

Chapter 11: Testing Data for Time Series Compliance

Technical requirements

Visualizing seasonal trends

Researching autocorrelation – past values' influence over present values

Performing the Durbin-Watson autocorrelation test

Calculating Durbin-Watson by hand in Excel

Summary

Questions and answers

Answers

Chapter 12: Working with Time Series Using the Centered Moving Average and a Trending Component

Technical requirements

Calculating the CMA

Calculating the moving average and CMA

Estimating the season irregular and season components

Calculating the trend line

Producing the forecast – season and trend line

Summary

Questions

Answers

Chapter 13: Training, Validating, and Running the Model

Technical requirements

Training the model

Conducting the Durbin-Watson test

Building and training the forecast model

Testing the forecast model

Doing the forecast

Summary

Questions

Answers

Other Books You May Enjoy

Preface

This book is about giving you basic statistical knowledge to work with machine learning using complex algorithms to classify data, such as the K-means method. You will use an included add-in for Excel to practice the concepts of grouping statistics without the need for a deep programming background in the R language or Python.

The book covers three topics of machine learning:

Data segmentationLinear regressionForecasts with time series

Data segmentation has many practical applications because it allows applying different strategies depending on the segment data ranges. It has applications in marketing and inventory rotation to act accordingly to the location and season of the sales.

The linear regression statistical concepts in this book will help you to explore whether the variables that we are using are useful to build a predictive model.

The time series model helps to do a forecast depending on the different seasons of the year. It has applications in inventory planning to allocate the correct quantities of products and avoid stalled cash flow in the warehouses. The time series depends on statistical tests to see whether the present values depend on the past, so they are useful to forecast the future.

Who this book is for

This book is for any professional that needs to analyze the data generated by the industry or academic scope using machine learning principles and algorithms. This book can help to better understand the different groups of data to apply a different approach to each one. Then, you can use the statistical tests of this book to see the most relevant variables that affect your performance using projections with linear regression. You will be able to link these variables with the time and season and use time series analysis to build a forecast that could help you to improve your planning in your professional scope.

What this book covers

Chapter 1, Understanding Data Segmentation, looks at how classifying the data of similar values is an approach for planning a strategy depending on the characteristics of the range of values of the groups. This strategy is more important when you deal with a problem with several variables, for example, finding the different groups of revenues for each season of the year and the quantities delivered for logistics demand planning.

Chapter 2, Applying Linear Regression, shows that the target of linear regression is to use related variables to predict the behavior of the values and build scenarios of what could happen in different situations, using the regression model as a framework for foreseeing the situations.

Chapter 3, What is Time Series?, examines how a time series model could do a forecast of the data, taking into account the seasonal trends based on the past time values.

Chapter 4, An Introduction to Data Grouping, delves into the importance of finding a different approach for each group. In complex multivariable problems, we need the assistance of machine learning algorithms such as K-means to find the optimal number of segments and the group's values range.

Chapter 5, Finding the Optimal Number of Single Variable Groups, shows how running an add-in for Excel that uses the K-means algorithm can help to get the optimal number of groups for the data that we are researching. In this case, we will start with a problem of just one variable to explain the concepts.

Chapter 6, Finding the Optimal Number of Multi-Variable Groups, demonstrates how to use the Excel add-in to do the grouping of problems of several variables, for example, the classification of quantity, revenue, and season of the inventory rotation.

Chapter 7, Analyzing Outliers for Data Anomalies, delves into another approach to data segmentation: researching what happens with the values that have a long-distance separation of all the groups. These values are anomalies, such as very short value expenses happening at non-business hours that could indicate evidence of possible fraud attempts.

Chapter 8, Finding the Relationship between Variables, shows how we have to do statistical tests of the relationship of the variables to check whether they are useful to design a predictive model before building a linear model.

Chapter 9, Building, Training, and Validating a Linear Model, talks about what happens after the relationship of the variables is statistically tested as useful to build a predictive model; we will use a portion of the data (regularly 20%) to test the model and see whether it gives a good sense of results similar to the known data.

Chapter 10, Building, Training, and Validating a Multiple Regression Model, discusses multiple regression, which involves three or more variables. We will see how to apply the statistical tests to see the most useful variables to build the predictive model. Then, we will test the regression with 20% of the data and see whether it makes sense to use the model to build new scenarios with unknown data.

Chapter 11, Testing Data for Time Series Compliance, shows how the time series forecast relies on the relationship of the present values to the past values. We will apply statistical methods to find whether the data is useful for a forecast model.

Chapter 12, Working with Time Series Using the Centered Moving Average and a Trending Component, explores the forecast model's dependence on two components: the centered moving average (which gives the seasonal ups and downs variations) and the linear regression (which gives the positive or negative orientation of the trend). Once we have these calculations, we will be able to test and use the model.

Chapter 13, Training, Validating, and Running the Model, covers the statistical tests time series and then models with 80% of the data. Then, we will test the time series with the remaining 20% and see whether the model returns results that make sense depending on our experience. Finally, we will use the model to do forecasts.

To get the most out of this book

To better understand this book, you must have a basic knowledge of statistical concepts such as average and standard deviation. You must also be able to use statistical functions in Excel to mark the cells' ranges input for calculations.

If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book's GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Data-Forecasting-and-Segmentation-Using-Microsoft-Excel. If there's an update to the code, it will be updated in the GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots and diagrams used in this book. You can download it here: https://static.packt-cdn.com/downloads/9781803247731_ColorImages.pdf.

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Mount the downloaded WebStorm-10*.dmg disk image file as another disk in your system."

A block of code is set as follows:

html, body, #map {

height: 100%;

margin: 0;

padding: 0

}

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

[default]

exten => s,1,Dial(Zap/1|30)

exten => s,2,Voicemail(u100)

exten => s,102,Voicemail(b100)

exten => i,1,Voicemail(s0)

Any command-line input or output is written as follows:

$ mkdir css

$ cd css

Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: "Select System info from the Administration panel."

Tips or Important Notes

Appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Share Your Thoughts

Once you've read Hands-On Financial Modeling with Microsoft Excel 365, we'd love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we're delivering excellent quality content.

Part 1 – An Introduction to Machine Learning Functions

Learn the basic concepts of statistics and machine learning topics in this book, with practical applications in market segmentation, sales, and inventory.

This part includes the following chapters:

Chapter 1, Understanding Data SegmentationChapter 2Applying Linear RegressionChapter 3, What is Time Series?

Chapter 1: Understanding Data Segmentation

Machine learning has two types of algorithms depending on the level of adjustments that you require to give a response:

SupervisedUnsupervised

Supervised algorithms need continuous improvement in the form of the data used to train them. For example, a supervised machine learning function of a linear model needs a starter group of data to train and generate the initial conditions. Then, we have to test the model and use it. We need continuous surveillance of the results to interpret whether they make sense or not. If the model fails, we probably need to train the model again.

Unsupervised algorithms do not require any previous knowledge of the data. The unsupervised machine learning process takes data and starts analyzing it until it reaches a result. Contrary to supervised linear regression and time series, this data does not need a test to see whether it is useful to build a model. That is the case with the K-means algorithm, which takes unknown and untested data to classify the values of the variables and returns the classification segments.

In this book, we will cover three different topics of machine learning:

Grouping statistics to find data segmentsLinear regressionTime series

For grouping statistics, we will use an add-on for Excel that will do the classification automatically for us. This add-on is included with the book, and we will learn how to use it throughout this book. For linear regression, we will use Excel formulas to find out whetherthe data can be used to make predictions with regression models and forecasts from the time series.

We need a machine learning algorithm to classify and group data for the following reasons:

A large amount of data is difficult to classify manually.Segmentation by observing a 2D or 3D chart is not accurate.Segmenting multiple variables is impossible because it is not possible to do a chart of multiple dimensions.

Before we do group segmentation using K-means clustering, we need to find the optimal number of groups for our data. The reason for this is that we want compact groups with points close to the average value of the group. It is not a good practice to have scattered points that do not belong to any group and that could be outliers that do not perform like the rest of the data, as they could be anomalies that deserve further research.

The K-means function will also help to get the optimal number of groups for our data. The best-case scenario is to have compact groups with points near their center.

We will review the basic statistical concepts to work with data grouping. These concepts are as follows:

MeanStandard deviation

In the data grouping segment, the mean is the center, or centroid, of the group. The best case is that the values are compact and close to the segment's centroid.

The level of separation of the values within a group from its centroid is measured by the standard deviation. The best case is to have compact groups with values close to the group's mean point with a low standard deviation for each group.

When we have values and segments that are scattered with a large standard deviation, that means they are outliers. Outliers are data that behaves differently from the majority of other segments. It is a special kind of data because it requires further research. Outliers could indicate an anomaly that could grow and cause a problem in the future. Practical examples of outliers that require attention are as follows:

Values that are different from the normal transaction amounts in sales and purchases. These could indicate a system test that could lead to a bigger issue in the future. A timeline of suspicious system performance. This could indicate hacking attempts.

In this chapter, we will cover the following topics:

Segmenting data conceptsGrouping data in segments of two and three variables

Segmenting data concepts

Before explaining data segments, we have to review basic statistical concepts such as mean and standard deviation. The reason is that each segment has a mean, or central, value, and each point is separated from the central point. The best case is that this separation of points from the mean point is as small as possible for each segment of data.

For the group of data in Figure 1.1, we will explain the mean and the separation of each point from the center measured by the standard deviation:

Figure 1.1 – Average, standard deviation, and limits. The data on the left is represented in the chart

The mean of the data on the left of the chart is 204. The group's centroid is represented by the middle line in Figure 1.1.

The standard deviation for this data is 12.49. So, the data upper limit is 216.49 and the lower limit is 191.51.

The standard deviation is the average separation of all the points from the centroid of the segment. It affects the grouping segments, as we want compact groups with a small separation between the group's data points. A small standard deviation means a smaller distance from the group's points to the centroid. The best case for the data segments is that these data points are as close as possible to the centroid. So, the standard deviation of the segment must be a small value.

Now, we will explore four segments of a group of data. We will find out whether all the segments are optimal, and whether the points are close to their respective centroids.

In Figure 1.2, the left column is sales revenue data. The right column is the data segments:

Figure 1.2 – Segments, mean, and standard deviation

We have four segments, and we will analyze the mean and the standard deviation to see whether the points have an optimal separation from the centroid. The separation is given by the standard deviation.

Figure 1.3 is the chart for all the data points in Figure 1.2. We can identify four possible segments by simple visual analysis:

Figure 1.3 – Data segments

We will analyze the centroid and the separation of the points for each segment in Figure 1.3. We can see that the group between 0 and 60 on the y axis is probably an outlier because the revenue is very low compared with the rest of the segments. The other groups appear to be compact around their respective centroid. We will confirm this in the charts of each segment.

The mean for the first segment is 18.775. The standard deviation is 15.09. That means there is a lot of variation around the centroid. This segment is not very compact, as we can see in Figure 1.4. The data is scattered and not close to the centroid value of 18.775:

Figure 1.4 – Segment 1, mean and standard deviation

The centroid of this segment is 18.775. The separation of the points measured by the standard deviation is 15.06. The points fall in the range of 3 to 33. That means the separation is wide and the segment is not compact. An explanation for this type of segment is that the points are outliers. They are points that do not have normal behavior and deserve special analysis to research. When we have points that are outside the normal operation values, for example, transactions with smaller amounts than normal at places and times that do not correspond to the rest of the data, we have to do deeper research because they could be indicators of fraud. Or, maybe they are sales that occur only at specific times of the month or year.

Figure 1.5 – Segment 2, mean and standard deviation

The second segment is more compact than the first one. The mean is 204 and there's a small standard deviation of 12.49. The upper limit is 216 and the lower limit is 192. This is an example of a good segmentation group. The distance from the data points to the centroid is small.

Next is segment number three:

Figure 1.6 – Segment 3, mean and standard deviation

The mean is 204, the upper limit is 216, and the lower limit is 192. By the standard deviation of the points, we also conclude that the segment is compact enough to give reliable information.

The points are close to the centroid, so the behavior of the members of the group or segment is very similar.

Segment number four is the smallest of all. It is shown in Figure 1.7:

Figure 1.7 – Segment 3, mean and standard deviation

The limits are 62 and 86 and the mean is 74. Figure 1.3 shows that segment four is the group with the second-lowest revenue after segment one. But segment one is scattered with a large standard deviation, so it is a not compact group, and the information is not reliable.

After reviewing the four segments, we conclude that segment number one is the lowest revenue group. It also has the highest separation of points from its centroid. It is probably an outlier and represents the non-regular behavior of sales.

In this section, we reviewed the basic statistical concepts and how they relate to segmentation. We learned that the best-case scenario is to have compact groups with a small standard deviation from the group's mean. It is important to follow up on the points that are outside the groups. These outliers (with very different behavior compared with the rest of the values) could be indicators of fraud. In the next section, we will apply these concepts to multi-variable analysis. We will have groups with two or more variables.

Grouping data in segments of two and three variables

Now, we are going to segment data with two variables. Several real-world problems need to group two or more variables to classify data where one variable influences the other. For example, we can use the month number and the sales revenue dataset to find out the time of the year with higher and lower sales. We will use online marketing and sales revenue. Figure 1.8 shows the four segments of the data and the relationship between online marketing investment and revenue. We can see that segments 1, 2, and 4 are relatively compact. The exception is segment 3 because it has a point that appears to be an outlier. This outlier will affect the average and the standard deviation of the segment:

Figure 1.8 – Grouping with two variables

Segment 4 appears to have the smallest standard deviation. This group looks compact. Segment 2 also appears to be compact and it has a high value of revenue.

In Figure 1.9, we will find out the mean and the standard deviation of segment 2:

Figure 1.9 – Segment two mean and standard deviation

As we are analyzing two variables, the centroid of the segment has two coordinates: the online marketing spend and the revenue.

The mean has the following coordinates:

Online marketing: 5.04 Revenue: 204.11

In Figure 1.9, the centroid is at these coordinates.

The standard deviation of online marketing is 1.53, and for revenue, it is 76.63.

The limits of the revenue are the black lines. They are 160 and 280. So, segment two is not compact because the majority of points are between 160 and 210 with an outlier close to 280.

When we analyze data with three variables, the mean and the standard deviation are represented by three coordinates. Figure 1.10 shows data with three variables and the segment that each of them belongs to:

Figure 1.10 – Segments with three variables

The mean and standard deviation have three coordinates. For example, for segment three, these are the coordinates:

Figure 1.11 – Mean and standard deviation coordinates with three variables

The standard deviation of revenue is large, 13.73. This means the points are widely scattered from the centroid, 15.8. This segment probably does not give accurate information because the points are not compact.

Summary

In this chapter, we learned why it's important to find the optimal number of groups before we conduct K-means clustering. Once we have the groups, we analyze whether they are compliant with the best-case scenario for segments having a small standard deviation. Research outliers to find out whether their behavior could lead to further investigation, such as fraud detection.

We need a machine learning function such as K-means clustering to segment data because classifying by simple inspection using a 2D or 3D chart is not practical and is sometimes impossible. Segmentation with three or more variables is more complicated because it is not possible to plot them.

K-means clustering helps us to find the optimal number of segments or groups for our data. The best case is to have segments that are as compact as possible.

Each segment has a mean, or centroid, and its values are supposed to be as close as possible to the centroid. This means that the standard deviation of each segment must be as small as possible.

You need to pay attention to segments with large standard deviations because they could be outliers. This type of value in our dataset could mean a preview for future problems because they have a random and irregular behavior outside the rest of the data's normal execution.

In the next chapter, we will get an introduction to the linear regression supervised machine learning algorithm. Linear regression needs statistical tests for the data to measure its level of relationship and to check whether it is useful for the model. Otherwise, it is not worth building the model.

Questions

Here are a few questions to assess your learning from this chapter:

Why is it necessary to know the optimal number of groups for the data before running the K-means classification algorithm?Is it possible to use K-means clustering for data with four or more variables?What are outliers, and how do we process them?

Answers

Here are the answers to the previous questions:

Having the optimal number of groups helps to get more compact groups and prevents us from having a large number of outliers.Yes, it is possible. It is more difficult to visualize the potential groups with a chart, but we can use K-means clustering to get the optimal number of groups and then do the classification.These are points that do not have the same behavior as the rest of the groups. It is necessary to do further research on them because it could lead to finding potential fraud or system performance degradation.

Data Forecasting and Segmentation Using Microsoft Excel E-Book

Fernando Roque

Data Forecasting and Segmentation Using Microsoft Excel

Data Forecasting and Segmentation Using Microsoft Excel

Contributors

About the author

About the reviewer

Table of Contents

Preface

Part 1 – An Introduction to Machine Learning Functions

Chapter 1: Understanding Data Segmentation

Segmenting data concepts

Grouping data in segments of two and three variables

Summary

Questions

Answers

Further reading

Chapter 2: Applying Linear Regression

Understanding the influence of variables in linear regression

Projecting values from predictor variables

Summary

Questions

Answers

Further reading

Chapter 3: What is Time Series?

Technical requirements

Understanding time series data

Designing the time series data model

Analyzing the air passenger 10-year data chart

Conducting a Durbin-Watson test on our 10-year data

Computing the centered moving average of each period lag of the data

Analyzing the seasonal irregularity

Trending component of the time series

Doing the forecast

Summary

Questions

Answers

Further reading

Part 2 – Grouping Data to Find Segments and Outliers

Chapter 4: Introduction to Data Grouping

Technical requirements

Grouping with the K-means machine learning function

Finding groups of multiple variables

Calculating centroids and the optimal number of segments for one variable

Calculating centroids and the optimal number of segments for two or more variables

Understanding outliers

Summary

Questions

Answers

Further reading

Chapter 5: Finding the Optimal Number of Single Variable Groups

Technical requirements

Finding an optimal number of groups for one variable

Instructions to run the required add-in in Excel

Running K-means elbow to get the optimal number of groups

Passing the data values to the K-means elbow function

Executing and interpreting the resulting chart of the optimal number of groups

Running the K-means function to get the centroids or group average

Finding the groups and centroids of one-variable data with K-means and Excel

Assigning values for every group

Calculating the centroid or the average point for every group

Exploring the range of values for each segment

Finding the segments for products and profits

Finding the optimal number of groups using the K-means elbow function

Running the K-means function to do the group segmentation

Summary

Questions

Answers

Further reading

Chapter 6: Finding the Optimal Number of Multi-Variable Groups

Technical requirements

Calculating the optimal number of groups for two and three variables

Finding the optimal number of segments for two variables – revenue and quantity

Using the elbow function to get the number of groups for three variables – revenue, quantity, and month of sale

Determining the groups and average value (centroids) of two and three variables

Getting the groups with the K-means algorithm for two and three variables

Visualizing centroids or the average value of each group for two and three variables

Charting the product value range of each group for revenue, quantity, and month

Using the Elbow and K-means functions with four variables

Summary