Data Science for Marketing Analytics. - Mirza Rahim Baig - E-Book

Data Science for Marketing Analytics. E-Book

Mirza Rahim Baig

0,0
27,59 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Unleash the power of data to reach your marketing goals with this practical guide to data science for business.

This book will help you get started on your journey to becoming a master of marketing analytics with Python. You'll work with relevant datasets and build your practical skills by tackling engaging exercises and activities that simulate real-world market analysis projects.

You'll learn to think like a data scientist, build your problem-solving skills, and discover how to look at data in new ways to deliver business insights and make intelligent data-driven decisions.

As well as learning how to clean, explore, and visualize data, you'll implement machine learning algorithms and build models to make predictions. As you work through the book, you'll use Python tools to analyze sales, visualize advertising data, predict revenue, address customer churn, and implement customer segmentation to understand behavior.

By the end of this book, you'll have the knowledge, skills, and confidence to implement data science and machine learning techniques to better understand your marketing data and improve your decision-making.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 551

Veröffentlichungsjahr: 2021

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Data Science for Marketing Analytics

second edition

A practical guide to forming a killer marketing strategy through data analysis with Python

Mirza Rahim Baig, Gururajan Govindan, and Vishwesh Ravi Shrimali

Data Science for Marketing Analytics

second edition

Copyright © 2021 Packt Publishing

All rights reserved. No part of this course may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this course to ensure the accuracy of the information presented. However, the information contained in this course is sold without warranty, either express or implied. Neither the authors nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this course.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this course by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Authors: Mirza Rahim Baig, Gururajan Govindan, and Vishwesh Ravi Shrimali

Reviewers: Cara Davies and Subhranil Roy

Managing Editors: Prachi Jain and Abhishek Rane

Acquisitions Editors: Royluis Rodrigues, Kunal Sawant, and Sneha Shinde

Production Editor: Salma Patel

Editorial Board: Megan Carlisle, Mahesh Dhyani, Heather Gopsill, Manasa Kumar, Alex Mazonowicz, Monesh Mirpuri, Bridget Neale, Abhishek Rane, Brendan Rodrigues, Ankita Thakur, Nitesh Thakur, and Jonathan Wray

First published: March 2019

First edition authors: Tommy Blanchard, Debasish Behera, and Pranshu Bhatnagar

Second edition: September 2021

Production reference: 1060921

ISBN: 978-1-80056-047-5

Published by Packt Publishing Ltd.

Livery Place, 35 Livery Street

Birmingham B3 2PB, UK

Table of Contents

Preface

1. Data Preparation and Cleaning

Introduction

Data Models and Structured Data

pandas

Importing and Exporting Data with pandas DataFrames

Viewing and Inspecting Data in DataFrames

Exercise 1.01: Loading Data Stored in a JSON File

Exercise 1.02: Loading Data from Multiple Sources

Structure of a pandas DataFrame and Series

Data Manipulation

Selecting and Filtering in pandas

Creating DataFrames in Python

Adding and Removing Attributes and Observations

Combining Data

Handling Missing Data

Exercise 1.03: Combining DataFrames and Handling Missing Values

Applying Functions and Operations on DataFrames

Grouping Data

Exercise 1.04: Applying Data Transformations

Activity 1.01: Addressing Data Spilling

Summary

2. Data Exploration and Visualization

Introduction

Identifying and Focusing on the Right Attributes

The groupby(  ) Function

The unique(  ) function

The value_counts(  ) function

Exercise 2.01: Exploring the Attributes in Sales Data

Fine Tuning Generated Insights

Selecting and Renaming Attributes

Reshaping the Data

Exercise 2.02: Calculating Conversion Ratios for Website Ads.

Pivot Tables

Visualizing Data

Exercise 2.03: Visualizing Data With pandas

Visualization through Seaborn

Visualization with Matplotlib

Activity 2.01: Analyzing Advertisements

Summary

3. Unsupervised Learning and Customer Segmentation

Introduction

Segmentation

Exercise 3.01: Mall Customer Segmentation – Understanding the Data

Approaches to Segmentation

Traditional Segmentation Methods

Exercise 3.02: Traditional Segmentation of Mall Customers

Unsupervised Learning (Clustering) for Customer Segmentation

Choosing Relevant Attributes (Segmentation Criteria)

Standardizing Data

Exercise 3.03: Standardizing Customer Data

Calculating Distance

Exercise 3.04: Calculating the Distance between Customers

K-Means Clustering

Exercise 3.05: K-Means Clustering on Mall Customers

Understanding and Describing the Clusters

Activity 3.01: Bank Customer Segmentation for Loan Campaign

Clustering with High-Dimensional Data

Exercise 3.06: Dealing with High-Dimensional Data

Activity 3.02: Bank Customer Segmentation with Multiple Features

Summary

4. Evaluating and Choosing the Best Segmentation Approach

Introduction

Choosing the Number of Clusters

Exercise 4.01: Data Staging and Visualization

Simple Visual Inspection to Choose the Optimal Number of Clusters

Exercise 4.02: Choosing the Number of Clusters Based on Visual Inspection

The Elbow Method with Sum of Squared Errors

Exercise 4.03: Determining the Number of Clusters Using the Elbow Method

Activity 4.01: Optimizing a Luxury Clothing Brand's Marketing Campaign Using Clustering

More Clustering Techniques

Mean-Shift Clustering

Exercise 4.04: Mean-Shift Clustering on Mall Customers

Benefits and Drawbacks of the Mean-Shift Technique

k-modes and k-prototypes Clustering

Exercise 4.05: Clustering Data Using the k-prototypes Method

Evaluating Clustering

Silhouette Score

Exercise 4.06: Using Silhouette Score to Pick Optimal Number of Clusters

Train and Test Split

Exercise 4.07: Using a Train-Test Split to Evaluate Clustering Performance

Activity 4.02: Evaluating Clustering on Customer Data

The Role of Business in Cluster Evaluation

Summary

5. Predicting Customer Revenue Using Linear Regression

Introduction

Regression Problems

Exercise 5.01: Predicting Sales from Advertising Spend Using Linear Regression

Feature Engineering for Regression

Feature Creation

Data Cleaning

Exercise 5.02: Creating Features for Customer Revenue Prediction

Assessing Features Using Visualizations and Correlations

Exercise 5.03: Examining Relationships between Predictors and the Outcome

Activity 5.01: Examining the Relationship between Store Location and Revenue

Performing and Interpreting Linear Regression

Exercise 5.04: Building a Linear Model Predicting Customer Spend

Activity 5.02: Predicting Store Revenue Using Linear Regression

Summary

6. More Tools and Techniques for Evaluating Regression Models

Introduction

Evaluating the Accuracy of a Regression Model

Residuals and Errors

Mean Absolute Error

Root Mean Squared Error

Exercise 6.01: Evaluating Regression Models of Location Revenue Using the MAE and RMSE

Activity 6.01: Finding Important Variables for Predicting Responses to a Marketing Offer

Using Recursive Feature Selection for Feature Elimination

Exercise 6.02: Using RFE for Feature Selection

Activity 6.02: Using RFE to Choose Features for Predicting Customer Spend

Tree-Based Regression Models

Random Forests

Exercise 6.03: Using Tree-Based Regression Models to Capture Non-Linear Trends

Activity 6.03: Building the Best Regression Model for Customer Spend Based on Demographic Data

Summary

7. Supervised Learning: Predicting Customer Churn

Introduction

Classification Problems

Understanding Logistic Regression

Revisiting Linear Regression

Logistic Regression

Cost Function for Logistic Regression

Assumptions of Logistic Regression

Exercise 7.01: Comparing Predictions by Linear and Logistic Regression on the Shill Bidding Dataset

Creating a Data Science Pipeline

Churn Prediction Case Study

Obtaining the Data

Exercise 7.02: Obtaining the Data

Scrubbing the Data

Exercise 7.03: Imputing Missing Values

Exercise 7.04: Renaming Columns and Changing the Data Type

Exploring the Data

Exercise 7.05: Obtaining the Statistical Overview and Correlation Plot

Visualizing the Data

Exercise 7.06: Performing Exploratory Data Analysis (EDA)

Activity 7.01: Performing the OSE technique from OSEMN

Modeling the Data

Feature Selection

Exercise 7.07: Performing Feature Selection

Model Building

Exercise 7.08: Building a Logistic Regression Model

Interpreting the Data

Activity 7.02: Performing the MN technique from OSEMN

Summary

8. Fine-Tuning Classification Algorithms

Introduction

Support Vector Machines

Intuition behind Maximum Margin

Linearly Inseparable Cases

Linearly Inseparable Cases Using the Kernel

Exercise 8.01: Training an SVM Algorithm Over a Dataset

Decision Trees

Exercise 8.02: Implementing a Decision Tree Algorithm over a Dataset

Important Terminology for Decision Trees

Decision Tree Algorithm Formulation

Random Forest

Exercise 8.03: Implementing a Random Forest Model over a Dataset

Classical Algorithms – Accuracy Compared

Activity 8.01: Implementing Different Classification Algorithms

Preprocessing Data for Machine Learning Models

Standardization

Exercise 8.04: Standardizing Data

Scaling

Exercise 8.05: Scaling Data After Feature Selection

Normalization

Exercise 8.06: Performing Normalization on Data

Model Evaluation

Exercise 8.07: Stratified K-fold

Fine-Tuning of the Model

Exercise 8.08: Fine-Tuning a Model

Activity 8.02: Tuning and Optimizing the Model

Performance Metrics

Precision

Recall

F1 Score

Exercise 8.09: Evaluating the Performance Metrics for a Model

ROC Curve

Exercise 8.10: Plotting the ROC Curve

Activity 8.03: Comparison of the Models

Summary

9. Multiclass Classification Algorithms

Introduction

Understanding Multiclass Classification

Classifiers in Multiclass Classification

Exercise 9.01: Implementing a Multiclass Classification Algorithm on a Dataset

Performance Metrics

Exercise 9.02: Evaluating Performance Using Multiclass Performance Metrics

Activity 9.01: Performing Multiclass Classification and Evaluating Performance

Class-Imbalanced Data

Exercise 9.03: Performing Classification on Imbalanced Data

Dealing with Class-Imbalanced Data

Exercise 9.04: Fixing the Imbalance of a Dataset Using SMOTE

Activity 9.02: Dealing with Imbalanced Data Using scikit-learn

Summary

Appendix

Preface

About the Book

Unleash the power of data to reach your marketing goals with this practical guide to data science for business.

This book will help you get started on your journey to becoming a master of marketing analytics with Python. You'll work with relevant datasets and build your practical skills by tackling engaging exercises and activities that simulate real-world market analysis projects.

You'll learn to think like a data scientist, build your problem-solving skills, and discover how to look at data in new ways to deliver business insights and make intelligent data-driven decisions.

As well as learning how to clean, explore, and visualize data, you'll implement machine learning algorithms and build models to make predictions. As you work through the book, you'll use Python tools to analyze sales, visualize advertising data, predict revenue, address customer churn, and implement customer segmentation to understand behavior.

This second edition has been updated to include new case studies that bring a more application-oriented approach to your marketing analytics journey. The code has also been updated to support the latest versions of Python and the popular data science libraries that have been used in the book. The practical exercises and activities have been revamped to prepare you for the real-world problems that marketing analysts need to solve. This will show you how to create a measurable impact on businesses large and small.

By the end of this book, you'll have the knowledge, skills, and confidence to implement data science and machine learning techniques to better understand your marketing data and improve your decision-making.

About the Authors

Mirza Rahim Baig is an avid problem solver who uses deep learning and artificial intelligence to solve complex business problems. He has more than a decade of experience in creating value from data, harnessing the power of the latest in machine learning and AI with proficiency in using unstructured and structured data across areas like marketing, customer experience, catalog, supply chain, and other e-commerce sub-domains. Rahim is also a teacher - designing, creating, teaching data science for various learning platforms. He loves making the complex easy to understand. He is also an author of The Deep Learning Workshop, a hands-on guide to start your deep learning journey and build your own next-generation deep learning models.

Gururajan Govindan is a data scientist, intrapreneur, and trainer with more than seven years of experience working across domains such as finance and insurance. He is also an author of The Data Analysis Workshop, a book focusing on data analytics. He is well known for his expertise in data-driven decision making and machine learning with Python.

Vishwesh Ravi Shrimali graduated from BITS Pilani, where he studied mechanical engineering. He has a keen interest in programming and AI and has applied that interest in mechanical engineering projects. He has also written multiple blogs on OpenCV, deep learning, and computer vision. When he is not writing blogs or working on projects, he likes to go on long walks or play his acoustic guitar. He is also an author of The Computer Vision Workshop, a book focusing on OpenCV and its applications in real-world scenarios; as well as, Machine Learning for OpenCV (2nd Edition) - which introduces how to use OpenCV for machine learning applications.

Who This Book Is For

This marketing book is for anyone who wants to learn how to use Python for cutting-edge marketing analytics. Whether you're a developer who wants to move into marketing, or a marketing analyst who wants to learn more sophisticated tools and techniques, this book will get you on the right path. Basic prior knowledge of Python is required to work through the exercises and activities provided in this book.

About the Chapters

Chapter 1, Data Preparation and Cleaning, teaches you skills related to data cleaning along with various data preprocessing techniques using real-world examples.

Chapter 2, Data Exploration and Visualization, teaches you how to explore and analyze data with the help of various aggregation techniques and visualizations using Matplotlib and Seaborn.

Chapter 3, Unsupervised Learning and Customer Segmentation, teaches you customer segmentation, one of the most important skills for a data science professional in marketing. You will learn how to use machine learning to perform customer segmentation with the help of scikit-learn. You will also learn to evaluate segments from a business perspective.

Chapter 4, Evaluating and Choosing the Best Segmentation Approach, expands your repertoire to various advanced clustering techniques and teaches principled numerical methods of evaluating clustering performance.

Chapter 5, Predicting Customer Revenue using Linear Regression, gets you started on predictive modeling of quantities by introducing you to regression and teaching simple linear regression in a hands-on manner using scikit-learn.

Chapter 6, More Tools and Techniques for Evaluating Regression Models, goes into more details of regression techniques, along with different regularization methods available to prevent overfitting. You will also discover the various evaluation metrics available to identify model performance.

Chapter 7, Supervised Learning: Predicting Customer Churn, uses a churn prediction problem as the central problem statement throughout the chapter to cover different classification algorithms and their implementation using scikit-learn.

Chapter 8, Fine-Tuning Classification Algorithms, introduces support vector machines and tree-based classifiers along with the evaluation metrics for classification algorithms. You will also learn about the process of hyperparameter tuning which will help you obtain better results using these algorithms.

Chapter 9, Multiclass Classification Algorithms, introduces a multiclass classification problem statement and the classifiers that can be used to solve such problems. You will learn about imbalanced datasets and their treatment in detail. You will also discover the micro- and macro-evaluation metrics available in scikit-learn for these classifiers.

Conventions

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, and, user input are shown as follows:

"df.head(n) will return the first n rows of the DataFrame. If no n is passed, the function considers n to be 5 by default."

Words that you see on the screen, for example, in menus or dialog boxes, also appear in the same format.

A block of code is set as follows:

sales.head()

New important words are shown like this: "a box plot is used to depict the distribution of numerical data and is primarily used for comparisons".

Key parts of code snippets are emboldened as follows:

Code Presentation

Lines of code that span multiple lines are split using a backslash (\). When the code is executed, Python will ignore the backslash, and treat the code on the next line as a direct continuation of the current line.

For example,

                  'ValueInINR': pd.Series([70, 89, 99])})

                            'ValueInINR':[70, 89, 99]})

df.head()

Comments are added into code to help explain specific bits of logic. Single-line comments are denoted using the # symbol, as follows:

# Importing the matplotlib library

import matplotlib.pyplot as plt

#Declaring the color of the plot as gray

plt.bar(sales['Product line'], sales['Revenue'], color='gray')

Multi-line comments are used as follows:

"""

Importing classification report and confusion matrix from sklearn metrics

"""

from sklearn.metrics import classification_report

from sklearn.metrics import precision_recall_fscore_support

Minimum Hardware Requirements

For an optimal experience, we recommend the following hardware configuration:

Processor: Dual Core or better

Memory: 4 GB RAM

Storage: 10 GB available space

Downloading the Code Bundle

Download the code files from GitHub at https://packt.link/59F3X. Refer to these code files for the complete code bundle. The files here contain the exercises, activities, and some intermediate code for each chapter. This can be a useful reference when you become stuck.

On the GitHub repo's page, you can click the green Code button and then click the Download ZIP option to download the complete code as a ZIP file to your disk (refer to Figure 0.1). You can then extract these code files to a folder of your choice, for example, C:\Code.

Figure 0.1: Download ZIP option on GitHub

On your system, the extracted ZIP file should contain all the files present in the GitHub repository:

Figure 0.2: GitHub code directory structure (Windows Explorer)

Setting Up Your Environment

Before you explore the book in detail, you need to set up specific software and tools. In the following section, you shall see how to do that.

Installing Anaconda on Your System

The code for all the exercises and activities in this book can be executed using Jupyter Notebooks. You'll first need to install the Anaconda Navigator, which is an interface through which you can access your Jupyter Notebooks. Anaconda Navigator will be installed as a part of Anaconda Individual Edition, which is an open-source Python distribution platform available for Windows, macOS, and Linux. Installing Anaconda will also install Python. Head to https://www.anaconda.com/distribution/.

From the page that opens, click the

Download

button (annotated by

1

). Make sure you are downloading the

Individual Edition

.

Figure 0.3: Anaconda homepage

The installer should start downloading immediately. The website will, by default, choose an installer based on your system configuration. If you prefer downloading Anaconda for a different operating system (Windows, macOS, or Linux) and system configuration (32- or 64-bit), click the

Get Additional Installers

link at the bottom of the box (refer to

Figure 0.3

). The page should scroll down to a section (refer to

Figure 0.4

) that lets you choose from various options based on the operating system and configuration you desire. For this book, it is recommended that you use the latest version of Python (3.8 or higher).

Figure 0.4: Downloading Anaconda Installers based on the OS

Follow the installation steps presented on the screen.

Figure 0.5: Anaconda setup

On Windows, if you've never installed Python on your system before, you can select the checkbox that prompts you to add Anaconda to your

PATH

. This will let you run Anaconda-specific commands (like

conda

) from the default command prompt. If you have Python installed or had installed an earlier version of Anaconda in the past, it is recommended that you leave it unchecked (you may run Anaconda commands from the Anaconda Prompt application instead). The installation may take a while depending on your system configuration.

Figure 0.6: Anaconda installation steps

For more detailed instructions, you may refer to the official documentation for Linux by clicking this link (https://docs.anaconda.com/anaconda/install/linux/), macOS using this link (https://docs.anaconda.com/anaconda/install/mac-os/), and Windows using this link (https://docs.anaconda.com/anaconda/install/windows/).

To check if Anaconda Navigator is correctly installed, look for

Anaconda Navigator

in your applications. Look for an application that has the following icon. Depending on your operating system, the icon's aesthetics may vary slightly.

Figure 0.7: Anaconda Navigator icon

You can also search for the application using your operating system's search functionality. For example, on Windows 10, you can use the Windows Key + S combination and type in Anaconda Navigator. On macOS, you can use Spotlight search. On Linux, you can open the terminal and type the anaconda-navigator command and press the return key.

Figure 0.8: Searching for Anaconda Navigator on Windows 10

For detailed steps on how to verify if Anaconda Navigator is installed, refer to the following link: https://docs.anaconda.com/anaconda/install/verify-install/.

Click the icon to open Anaconda Navigator. It may take a while to load for the first time, but upon successful installation, you should see a similar screen:

Figure 0.9: Anaconda Navigator screen

If you have more questions about the installation process, you may refer to the list of frequently asked questions from the Anaconda documentation: https://docs.anaconda.com/anaconda/user-guide/faq/.

Launching Jupyter Notebook

Once the Anaconda Navigator is open, you can launch the Jupyter Notebook interface from this screen. The following steps will show you how to do that:

Open Anaconda Navigator. You should see the following screen:

Figure 0.10: Anaconda Navigator screen

Now, click

Launch

under the

Jupyter Notebook

panel to start the notebook interface on your local system.

Figure 0.11: Jupyter notebook launch option

On clicking the

Launch

button, you'll notice that even though nothing changes in the window shown in the preceding screenshot, a new tab opens up in your default browser. This is known as the

Notebook Dashboard

. It will, by default, open to your root folder. For Windows users, this path would be something similar to

C:\Users\<username>

. On macOS and Linux, it will be

/home/<username>/

.

Figure 0.12: Notebook dashboard

Note that you can also open a Jupyter Notebook by simply running the command jupyter notebook in the terminal or command prompt. Or you can search for Jupyter Notebook in your applications just like you did in Figure 0.8.

You can use this Dashboard as a file explorer to navigate to the directory where you have downloaded or stored the code files for the book (refer to the

Downloading the Code Bundle

section on how to download the files from GitHub). Once you have navigated to your desired directory, you can start by creating a new Notebook. Alternatively, if you've downloaded the code from our repository, you can open an existing Notebook as well (Notebook files will have a

.inpyb

extension). The menus here are quite simple to use:

Figure 0.13: Jupyter notebook navigator menu options walkthrough

If you make any changes to the directory using your operating system's file explorer and the changed file isn't showing up in the Jupyter Notebook Navigator, click the Refresh Notebook List button (annotated as 1). To quit, click the Quit button (annotated as 2). To create a new file (a new Jupyter Notebook), you can click the New button (annotated as 3).

Clicking the

New

button will open a dropdown menu as follows:

Figure 0.14: Creating a new Jupyter notebook

Note

A detailed tutorial on the interface and the keyboard shortcuts for Jupyter Notebooks can be found here: https://jupyter-notebook.readthedocs.io/en/stable/notebook.html.

You can get started and create your first notebook by selecting Python 3; however, it is recommended that you also set up the virtual environment we've provided. Installing the environment will also install all the packages required for running the code in this book. The following section will show you how to do that.

Installing the ds-marketing Virtual Environment

As you run the code for the exercises and activities, you'll notice that even after installing Anaconda, there are certain libraries like kmodes which you'll need to install separately as you progress in the book. Then again, you may already have these libraries installed, but their versions may be different from the ones we've used, which may lead to varying results. That's why we've provided an environment.yml file with this book that will:

Install all the packages and libraries required for this book at once.

Make sure that the version numbers of your libraries match the ones we've used to write the code for this book.

Make sure that the code you write based on this book remains separate from any other coding environment you may have.

You can download the environment.yml file by clicking the following link: http://packt.link/dBv1k.

Save this file, ideally in the same folder where you'll be running the code for this book. If you've downloaded the code from GitHub as detailed in the Downloading the Code Bundle section, this file should already be present in the parent directory, and you won't need to download it separately.

To set up the environment, follow these steps:

On macOS, open Terminal from the Launchpad (you can find more information about Terminal here:

https://support.apple.com/en-in/guide/terminal/apd5265185d-f365-44cb-8b09-71a064a42125/mac

). On Linux, open the Terminal application that's native to your distribution. On Windows, you can open the Anaconda Prompt instead by simply searching for the application. You can do this by opening the Start menu and searching for

Anaconda Prompt

.

Figure 0.15: Searching for Anaconda Prompt on Windows

A new terminal like the following should open. By default, it will start in your home directory:

Figure 0.16: Anaconda terminal prompt

In the case of Linux, it would look like the following:

Figure 0.17: Terminal in Linux

In the terminal, navigate to the directory where you've saved the

environment.yml

file on your computer using the

cd

command. Say you've saved the file in

Documents\Data-Science-for-Marketing-Analytics-Second-Edition

. In that case, you'll type the following command in the prompt and press

Enter

:

cd Documents\Data-Science-for-Marketing-Analytics-Second-Edition

Note that the command may vary slightly based on your directory structure and your operating system.

Now that you've navigated to the correct folder, create a new

conda

environment by typing or pasting the following command in the terminal. Press

Enter

to run the command.

conda env create -f environment.yml

This will install the ds-marketing virtual environment along with the libraries that are required to run the code in this book. In case you see a prompt asking you to confirm before proceeding, type y and press Enter to continue creating the environment. Depending on your system configuration, it may take a while for the process to complete.

Note

For a complete list of conda commands, visit the following link: https://conda.io/projects/conda/en/latest/index.html. For a detailed guide on how to manage conda environments, please visit the following link: https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html.

Once complete, type or paste the following command in the shell to activate the newly installed environment,

ds-marketing

.

conda activate ds-marketing

If the installation is successful, you'll see the environment name in brackets change from base to ds-marketing:

Figure 0.18: Environment name showing up in the shell

Run the following command to install

ipykernel

in the newly activated

conda

 environment:

pip install ipykernel

Note

On macOS and Linux, you'll need to specify pip3 instead of pip.

In the same environment, run the following command to add

ipykernel

as a Jupyter kernel:

python -m ipykernel install --user --name=ds-marketing

Windows only:

If you're on Windows, type or paste the following command. Otherwise, you may skip this step and exit the terminal.

conda install pywin32

Select the created kernel

ds-marketing

when you start your Jupyter notebook.

Figure 0.19: Selecting the ds-marketing kernel

A new tab will open with a fresh untitled Jupyter notebook where you can start writing your code:

Figure 0.20: A new Jupyter notebook

Running the Code Online Using Binder

You can also try running the code files for this book in a completely online environment through an interactive Jupyter Notebook interface called Binder. Along with the individual code files that can be downloaded locally, we have provided a link that will help you quickly access the Binder version of the GitHub repository for the book. Using this link, you can run any of the .inpyb code files for this book in a cloud-based online interactive environment. Click the following link to open the online Binder version of the book's repository to give it a try: https://packt.link/GdQOp. It is recommended that you save the link in your browser bookmarks for future reference (you may also use the launch binder link provided in the README section of the book's GitHub page).

Depending on your internet connection, it may take a while to load, but once loaded, you'll get the same interface as you would when running the code in a local Jupyter Notebook (all your shortcuts should work as well):

Figure 0.21: Binder lets you run Jupyter Notebooks in a browser

Binder is an online service that helps you read and execute Jupyter Notebook files (.inpyb) present in any public GitHub repository in a cloud-based environment. However, please note that there are certain memory constraints associated with Binder. This means that running multiple Jupyter Notebooks instances at the same time or running processes that consume a lot of memory (like model training) can result in a kernel crash or kernel reset. Moreover, any changes you make in these online Notebooks would not be stored, and the Notebooks will reset to the latest version present in the repository whenever you close and re-open the Binder link. A stable internet connection is required to use Binder. You can find out more about the Binder Project here: https://jupyter.org/binder.

This is a recommended option for readers who want to have a quick look at the code and experiment with it without downloading the entire repository on their local machine.

Get in Touch

Feedback from our readers is always welcome.

General feedback: If you have any questions about this book, please mention the book title in the subject of your message and email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you could report this to us. Please visit www.packtpub.com/support/errata and complete the form.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you could provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit https://authors.packtpub.com/.

Please Leave a Review

Let us know what you think by leaving a detailed, impartial review on Amazon. We appreciate all feedback – it helps us continue to make great products and help aspiring developers build their skills. Please spare a few minutes to give your thoughts – it makes a big difference to us. You can leave a review by clicking the following link: https://packt.link/r/1800560478.

To Azra, Aiza, Duha and Aidama - you inspire courage, strength, and grace.

- Mirza Rahim Baig

To Appa, Amma, Vindhya, Madhu, and Ishan - The Five Pillars of my life.

- Gururajan Govindan

To Nanaji, Dadaji, and Appa - for their wisdom, inspiration, and unconditional love.

- Vishwesh Ravi Shrimali

1. Data Preparation and Cleaning

Overview

In this chapter, you'll learn the skills required to process and clean data to effectively ready it for further analysis. Using the pandas library in Python, you will learn how to read and import data from various file formats, including JSON and CSV, into a DataFrame. You'll then learn how to perform slicing, aggregation, and filtering on DataFrames. By the end of the chapter, you will consolidate your data cleaning skills by learning how to join DataFrames, handle missing values, and even combine data from various sources.

Introduction

"Since you liked this artist, you'll also like their new album," "Customers who bought bread also bought butter," and "1,000 people near you have also ordered this item." Every day, recommendations like these influence customers' shopping decisions, helping them discover new products. Such recommendations are possible thanks to data science techniques that leverage data to create complex models, perform sophisticated tasks, and derive valuable customer insights with great precision. While the use of data science principles in marketing analytics is a proven, cost-effective, and efficient strategy, many companies are still not using these techniques to their full potential. There is a wide gap between the possible and actual usage of these techniques.

This book is designed to teach you skills that will help you contribute toward bridging that gap. It covers a wide range of useful techniques that will allow you to leverage everything data science can do in terms of strategies and decision-making in the marketing domain. By the end of the book, you should be able to successfully create and manage an end-to-end marketing analytics solution in Python, segment customers based on the data provided, predict their lifetime value, and model their decision-making behavior using data science techniques.

You will start your journey by first learning how to clean and prepare data. Raw data from external sources cannot be used directly; it needs to be analyzed, structured, and filtered before it can be used any further. In this chapter, you will learn how to manipulate rows and columns and apply transformations to data to ensure you have the right data with the right attributes. This is an essential skill in a data analyst's arsenal because, otherwise, the outcome of your analysis will be based on incorrect data, thereby making it a classic example of garbage in, garbage out. But before you start working with the data, it is important to understand its nature - in other words, the different types of data you'll be working with.

Data Models and Structured Data

When you build an analytical solution, the first thing that you need to do is to build a data model. A data model is an overview of the data sources that you will be using, their relationships with other data sources, where exactly the data from a specific source is going to be fetched, and in what form (such as an Excel file, a database, or a JSON from an internet source).

Note

Keep in mind that the data model evolves as data sources and processes change.

A data model can contain data of the following three types:

Structured Data

: Also known as completely structured or well-structured data, this is the simplest way to manage information. The data is arranged in a flat tabular form with the correct value corresponding to the correct attribute. There is a unique column, known as an

index

, for easy and quick access to the data, and there are no duplicate columns. For example, in

Figure 1.1

,

employee_id

is the unique column. Using the data in this column, you can run SQL queries and quickly access data at a specific row and column in the dataset easily. Furthermore, there are no empty rows, missing entries, or duplicate columns, thereby making this dataset quite easy to work with. What makes structured data most ubiquitous and easy to analyze is that it is stored in a standardized tabular format that makes adding, updating, deleting, and updating entries easy and programmable. With structured data, you may not have to put in much effort during the data preparation and cleaning stage.

Data stored in relational databases such as MySQL, Amazon Redshift, and more are examples of structured data:

Figure 1.1: Data in a MySQL table

Semi-structured data

: You will not find semi-structured data to be stored in a strict, tabular hierarchy as you saw in

Figure 1.1

. However, it will still have its own hierarchies that group its elements and establish a relationship between them. For example, metadata of a song may include information about the cover art, the artist, song length, and even the lyrics. You can search for the artist's name and find the song you want. Such data does not have a fixed hierarchy mapping the unique column with rows in an expected format, and yet you can find the information you need.

Another example of semi-structured data is a JSON file. JSON files are self-describing and can be understood easily. In Figure 1.2, you can see a JSON file that contains personally identifiable information of Jack Jones.

Semi-structured data can be stored accurately in NoSQL databases.

Figure 1.2: Data in a JSON file

Unstructured data

: Unstructured data may not be tabular, and even if it is tabular, the number of attributes or columns per observation may be completely arbitrary. The same data could be represented in different ways, and the attributes might not match each other, with values leaking into other parts.

For example, think of reviews of various products stored in rows of an Excel sheet or a dump of the latest tweets of a company's Twitter profile. We can only search for specific keywords in that data, but we cannot store it in a relational database, nor will we be able to establish a concrete hierarchy between different elements or rows. Unstructured data can be stored as text files, CSV files, Excel files, images, and audio clips.

Marketing data, traditionally, comprises all three aforementioned data types. Initially, most data points originate from different data sources. This results in different implications, such as the values of a field could be of different lengths, the value for one field would not match that of other fields because of different field names, and some rows might have missing values for some of the fields.

You'll soon learn how to effectively tackle such problems with your data using Python. The following diagram illustrates what a data model for marketing analytics looks like. The data model comprises all kinds of data: structured data such as databases (top), semi-structured data such as JSON (middle), and unstructured data such as Excel files (bottom):

Figure 1.3: Data model for marketing analytics

As the data model becomes complex, the probability of having bad data increases. For example, a marketing analyst working with the demographic details of a customer can mistakenly read the age of the customer as a text string instead of a number (integer). In such situations, the analysis would go haywire as the analyst cannot perform any aggregation functions, such as finding the average age of a customer. These types of situations can be overcome by having a proper data quality check to ensure that the data chosen for further analysis is of the correct data type.

This is where programming languages such as Python come into play. Python is an all-purpose general programming language that integrates with almost every platform and helps automate data production and analysis.

Apart from understanding patterns and giving at least a basic structure to data, Python forces the data model to accept the right value for the attribute. The following diagram illustrates how most marketing analytics today structure different kinds of data by passing it through scripts to make it at least semi-structured:

Figure 1.4: Data model of most marketing analytics that use Python

By making use of such structure-enforcing scripts, you will have a data model of semi-structured data coming in with expected values in the right fields; however, the data is not yet in the best possible format to perform analytics. If you can completely structure your data (that is, arrange it in flat tables, with the right value pointing to the right attribute with no nesting), it will be easy to see how every data point individually compares to other points with the help of common fields. You can easily get a feel of the data—that is, see in what range most values lie, identify the clear outliers, and so on—by simply scrolling through it.

While there are a lot of tools that can be used to convert data from an unstructured/semi-structured format to a fully structured format (for example, Spark, STATA, and SAS), the tool that is most widely used for data science, and which can be integrated with practically any framework, has rich functionalities, minimal costs, and is easy to use in our use case, is pandas.

pandas

pandas is a software library written in Python and is the basic building block for data manipulation and analysis. It offers a collection of high-performance, easy-to-use, and intuitive data structures and analysis tools that are of great use to marketing analysts and data scientists alike. The library comes as a default package when you install Anaconda (refer to the Preface for detailed instructions).

Note

Before you run the code in this book, it is recommended that you install and set up the virtual environment using the environment.yml file we have provided in the GitHub repository of this book.

You can find the environment.yml file at the following link: https://packt.link/dBv1k.

It will install all the required libraries and ensure that the version numbers of the libraries on your system match with ours. Refer to the Preface for more instructions on how to set this up.

However, if you're using any other distribution where pandas is not pre-installed, you can run the following command in your terminal app or command prompt to install the library:

pip install pandas

Note

On macOS or Linux, you will need to modify the preceding command to use pip3 instead of pip.

The following diagram illustrates how different kinds of data are converted to a structured format with the help of pandas:

Figure 1.5: Data model to structure the different kinds of data

When working with pandas, you'll be dealing with its two primary object types: DataFrames and Series. What follows is a brief explanation of what those object types are. Don't worry if you are not able to understand things such as their structure and how they work; you'll be learning more about these in detail later in the chapter.

DataFrame

: This is the fundamental tabular structure that stores data in rows and columns (like a spreadsheet). When performing data analysis, you can directly apply functions and operations to DataFrames.

Series

: This refers to a single column of the DataFrame. Series adds up to form a DataFrame. The values can be accessed through its index, which is assigned automatically while defining a DataFrame.

In the following diagram, the users column annotated by 2 is a series, and the viewers, views, users, and cost columns, along with the index, form a DataFrame (annotated by 1):

Figure 1.6: A sample pandas DataFrame and series

Now that you have a brief understanding of what pandas objects are, let's take a look at some of the functions you can use to import and export data in pandas.

Importing and Exporting Data with pandas DataFrames

Every team in a marketing group can have its own preferred data source for its specific use case. Teams that handle a lot of customer data, such as demographic details and purchase history, would prefer a database such as MySQL or Oracle, whereas teams that handle a lot of text might prefer JSON, CSV, or XML. Due to the use of multiple data sources, we end up having a wide variety of files. In such cases, the pandas library comes to our rescue as it provides a variety of APIs (Application Program Interfaces) that can be used to read multiple different types of data into a pandas DataFrame. Some of the most commonly used APIs are shown here:

Figure 1.7: Ways to import and export different types of data with pandas DataFrames

So, let's say you wanted to read a CSV file. You'll first need to import the pandas library as follows:

import pandas as pd

Then, you will run the following code to store the CSV file in a DataFrame named df (df is a variable):

In the preceding line, we have sales.csv, which is the file to be imported. This command should work if your Jupyter notebook (or Python process) is run from the same directory where the file is stored. If the file was stored in any other path, you'll have to specify the exact path. On Windows, for example, you'll specify the path as follows:

Note that we've added r before the path to take care of any special characters in the path. As you work with and import various data files in the exercises and activities in this book, we'll often remind you to pay attention to the path of the CSV file.

When loading data, pandas also provides additional parameters that you can pass to the read function, so that you can load the data the way you want. Some of these parameters are provided here. Please note that most of these parameters are optional. Also worth noting is the fact that the default value of the index in a DataFrame starts with 0:

For example, if you want to import a CSV file into a DataFrame, df, with the following conditions:

The first row of the file must be the header.

You need to import only the first 100 rows into the file.

You need to import only the first 3 columns.

The code corresponding to the preceding conditions would be as follows:

df= pd.read_csv("sales.csv",header=1,nrows=100,usecols=[0,1,2])

Note

There are similar specific parameters for almost every inbuilt function in pandas. You can find details about them with the documentation for pandas available at the following link: https://pandas.pydata.org/pandas-docs/stable/.

Once the data is imported, you need to verify whether it has been imported correctly. Let's understand how to do that in the following section.

Viewing and Inspecting Data in DataFrames

Once you've successfully read a DataFrame using the pandas library, you need to inspect the data to check whether the right attribute has received the right value. You can use several built-in pandas functions to do that.

The most commonly used way to inspect loaded data is using the head() command. By default, this command will display the first five rows of the DataFrame. Here's an example of the command used on a DataFrame called df:

df.head()

The output should be as follows:

Figure 1.8: Output of the df.head() command

Similarly, to display the last five rows, you can use the df.tail() command. Instead of the default five rows, you can even specify the number of rows you want to be displayed. For example, the df.head(11) command will display the first 11 rows.

Here's the complete usage of these two commands, along with a few other commands that be useful while examining data. Again, it is assumed that you have stored the DataFrame in a variable called df:

df.head(n)

will return the first

n

rows of the DataFrame. If no

n

is passed, the function considers

n

to be

5

by default.

df.tail(n)

will return the last

n

rows of the DataFrame. If no

n

is passed, the function considers

n

to be

5

by default.

df.shape

will return the dimensions of a DataFrame (number of rows and number of columns).

df.dtypes

will return the type of data in each column of the pandas DataFrame (such as

float

,

object

,

int64

, and so on

)

.

df.info()

will summarize the DataFrame and print its size, type of values, and the count of non-null values.

So far, you've learned about the different functions that can be used on DataFrames. In the first exercise, you will practice using these functions to import a JSON file into a DataFrame and later, to inspect the data.

Exercise 1.01: Loading Data Stored in a JSON File

The tech team in your company has been testing a web version of its flagship shopping app. A few loyal users who volunteered to test the website were asked to submit their details via an online form. The form captured some useful details (such as age, income, and more) along with some not-so-useful ones (such as eye color). The tech team then tested their new profile page module, using which a few additional details were captured. All this data was stored in a JSON file called user_info.json, which the tech team sent to you for validation.

Note

You can find the user_info.json file at the following link: https://packt.link/Gi2O7.

Your goal is to import this JSON file into pandas and let the tech team know the answers to the following questions so that they can add more modules to the website:

Is the data loading correctly?

Are there any missing values in any of the columns?

What are the data types of all the columns?

How many rows and columns are present in the dataset?

Note

All the exercises and activities in this chapter can be performed in both the Jupyter notebook and Python shell. While you can do them in the shell for now, it is highly recommended to use the Jupyter notebook. To learn how to install Jupyter and set up the Jupyter notebook, refer to the Preface. It will be assumed that you are using a Jupyter notebook from this point on.

In this exercise, you loaded the data, checked whether it had been loaded correctly, and gathered some more information about the entries contained therein. All this was done by loading data stored in a single source, which was the JSON file. As a marketing analyst, you will come across situations where you'll need to load and process data from different sources. Let's practice that in the exercise that follows.

Exercise 1.02: Loading Data from Multiple Sources

You work for a company that uses Facebook for its marketing campaigns. The data.csv file contains the views and likes of 100 different posts on Facebook used for a marketing campaign. The team also uses historical sales data to derive insights. The sales.csv file contains some historical sales data recorded in a CSV file relating to different customer purchases in stores in the past few years.

Your goal is to read the files into pandas DataFrames and check the following:

Whether either of the datasets contains null or missing values

Whether the data is stored in the correct columns and the corresponding column names make sense (in other words, the names of the columns correctly convey what type of information is stored in the rows)

Note

You can find the data.csv file at https://packt.link/NmBJT, and the sales.csv file at https://packt.link/ER7fz.

Let's first work with the data.csv file:

Figure 1.19: Output of sales.info()

From the preceding output, you can see that the country column has missing values (since all the other columns have 100 entries). You'll need to dig deeper and find out the exact cause of this problem. By the end of this chapter, you'll learn how to address such problems effectively.

Now that you have loaded the data and looked at the result, you can observe that the data collected by the marketing campaigns team (data.csv) looks good and it has no missing values. The data collected by the sales team, on the other hand (stored in sales.csv), has quite a few missing values and incorrect column names.

Based on what you've learned about pandas so far, you won't be able to standardize the data. Before you learn how to do that, you'll first have to dive deep into the internal structure of pandas objects and understand how data is stored in pandas.

Structure of a pandas DataFrame and Series

You are undecided as to which data structure to use to store some of the information that comes in from different marketing teams. From your experience, you know that a few elements in your data will have missing values. You are also expecting two different teams to collect the same data but categorize it differently. That is, instead of numerical indices (0-10), they might use custom labels to access specific values. pandas provides data structures that help store and work with such data. One such data structure is called a pandas series.

A pandas series is nothing more than an indexed NumPy array. To create a pandas series, all you need to do is create an array and give it an index. If you create a series without an index, it will create a default numeric index that starts from 0 and goes on for the length of the series, as shown in the following diagram:

Figure 1.20: Sample pandas series

Note

As a series is still a NumPy array, all functions that work on a NumPy array work the same way on a pandas series, too. To learn more about the functions, please refer to the following link:https://pandas.pydata.org/pandas-docs/stable/reference/series.html.

As your campaign grows, so does the number of series. With that, new requirements arise. Now, you want to be able to perform operations such as concatenation on specific entries in several series at once. However, to access the values, these different series must share the same index. And that's exactly where DataFrames come into the picture. A pandas DataFrame is just a dictionary with the column names as keys and values as different pandas series, joined together by the index.

A DataFrame is created when different columns (which are nothing but series) such as these are joined together by the index:

Figure 1.21: Series joined together by the same index create a pandas DataFrame

In the preceding screenshot, you'll see numbers 0-4 to the left of the age column. These are the indices. The age, balance, _id, about, and address columns, along with others, are series, and together they form a DataFrame.

This way of storing data makes it very easy to perform the operations you need on the data you want. You can easily choose the series you want to modify by picking a column and directly slicing off indices based on the value in that column. You can also group indices with similar values in one column together and see how the values change in other columns.

pandas also allows operations to be applied to both rows and columns of a DataFrame. You can choose which one to apply by specifying the axis, 0 referring to rows, and 1 referring to columns.

For example, if you wanted to apply the sum function to all the rows in the balance column of the DataFrame, you would use the following code:

df['balance'].sum(axis=0)

In the following screenshot, by specifying axis=0, you can apply a function (such as sum) on all the rows in a particular column:

By specifying