27,59 €
Unleash the power of data to reach your marketing goals with this practical guide to data science for business.
This book will help you get started on your journey to becoming a master of marketing analytics with Python. You'll work with relevant datasets and build your practical skills by tackling engaging exercises and activities that simulate real-world market analysis projects.
You'll learn to think like a data scientist, build your problem-solving skills, and discover how to look at data in new ways to deliver business insights and make intelligent data-driven decisions.
As well as learning how to clean, explore, and visualize data, you'll implement machine learning algorithms and build models to make predictions. As you work through the book, you'll use Python tools to analyze sales, visualize advertising data, predict revenue, address customer churn, and implement customer segmentation to understand behavior.
By the end of this book, you'll have the knowledge, skills, and confidence to implement data science and machine learning techniques to better understand your marketing data and improve your decision-making.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 551
Veröffentlichungsjahr: 2021
A practical guide to forming a killer marketing strategy through data analysis with Python
Mirza Rahim Baig, Gururajan Govindan, and Vishwesh Ravi Shrimali
Copyright © 2021 Packt Publishing
All rights reserved. No part of this course may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this course to ensure the accuracy of the information presented. However, the information contained in this course is sold without warranty, either express or implied. Neither the authors nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this course.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this course by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Authors: Mirza Rahim Baig, Gururajan Govindan, and Vishwesh Ravi Shrimali
Reviewers: Cara Davies and Subhranil Roy
Managing Editors: Prachi Jain and Abhishek Rane
Acquisitions Editors: Royluis Rodrigues, Kunal Sawant, and Sneha Shinde
Production Editor: Salma Patel
Editorial Board: Megan Carlisle, Mahesh Dhyani, Heather Gopsill, Manasa Kumar, Alex Mazonowicz, Monesh Mirpuri, Bridget Neale, Abhishek Rane, Brendan Rodrigues, Ankita Thakur, Nitesh Thakur, and Jonathan Wray
First published: March 2019
First edition authors: Tommy Blanchard, Debasish Behera, and Pranshu Bhatnagar
Second edition: September 2021
Production reference: 1060921
ISBN: 978-1-80056-047-5
Published by Packt Publishing Ltd.
Livery Place, 35 Livery Street
Birmingham B3 2PB, UK
Unleash the power of data to reach your marketing goals with this practical guide to data science for business.
This book will help you get started on your journey to becoming a master of marketing analytics with Python. You'll work with relevant datasets and build your practical skills by tackling engaging exercises and activities that simulate real-world market analysis projects.
You'll learn to think like a data scientist, build your problem-solving skills, and discover how to look at data in new ways to deliver business insights and make intelligent data-driven decisions.
As well as learning how to clean, explore, and visualize data, you'll implement machine learning algorithms and build models to make predictions. As you work through the book, you'll use Python tools to analyze sales, visualize advertising data, predict revenue, address customer churn, and implement customer segmentation to understand behavior.
This second edition has been updated to include new case studies that bring a more application-oriented approach to your marketing analytics journey. The code has also been updated to support the latest versions of Python and the popular data science libraries that have been used in the book. The practical exercises and activities have been revamped to prepare you for the real-world problems that marketing analysts need to solve. This will show you how to create a measurable impact on businesses large and small.
By the end of this book, you'll have the knowledge, skills, and confidence to implement data science and machine learning techniques to better understand your marketing data and improve your decision-making.
Mirza Rahim Baig is an avid problem solver who uses deep learning and artificial intelligence to solve complex business problems. He has more than a decade of experience in creating value from data, harnessing the power of the latest in machine learning and AI with proficiency in using unstructured and structured data across areas like marketing, customer experience, catalog, supply chain, and other e-commerce sub-domains. Rahim is also a teacher - designing, creating, teaching data science for various learning platforms. He loves making the complex easy to understand. He is also an author of The Deep Learning Workshop, a hands-on guide to start your deep learning journey and build your own next-generation deep learning models.
Gururajan Govindan is a data scientist, intrapreneur, and trainer with more than seven years of experience working across domains such as finance and insurance. He is also an author of The Data Analysis Workshop, a book focusing on data analytics. He is well known for his expertise in data-driven decision making and machine learning with Python.
Vishwesh Ravi Shrimali graduated from BITS Pilani, where he studied mechanical engineering. He has a keen interest in programming and AI and has applied that interest in mechanical engineering projects. He has also written multiple blogs on OpenCV, deep learning, and computer vision. When he is not writing blogs or working on projects, he likes to go on long walks or play his acoustic guitar. He is also an author of The Computer Vision Workshop, a book focusing on OpenCV and its applications in real-world scenarios; as well as, Machine Learning for OpenCV (2nd Edition) - which introduces how to use OpenCV for machine learning applications.
This marketing book is for anyone who wants to learn how to use Python for cutting-edge marketing analytics. Whether you're a developer who wants to move into marketing, or a marketing analyst who wants to learn more sophisticated tools and techniques, this book will get you on the right path. Basic prior knowledge of Python is required to work through the exercises and activities provided in this book.
Chapter 1, Data Preparation and Cleaning, teaches you skills related to data cleaning along with various data preprocessing techniques using real-world examples.
Chapter 2, Data Exploration and Visualization, teaches you how to explore and analyze data with the help of various aggregation techniques and visualizations using Matplotlib and Seaborn.
Chapter 3, Unsupervised Learning and Customer Segmentation, teaches you customer segmentation, one of the most important skills for a data science professional in marketing. You will learn how to use machine learning to perform customer segmentation with the help of scikit-learn. You will also learn to evaluate segments from a business perspective.
Chapter 4, Evaluating and Choosing the Best Segmentation Approach, expands your repertoire to various advanced clustering techniques and teaches principled numerical methods of evaluating clustering performance.
Chapter 5, Predicting Customer Revenue using Linear Regression, gets you started on predictive modeling of quantities by introducing you to regression and teaching simple linear regression in a hands-on manner using scikit-learn.
Chapter 6, More Tools and Techniques for Evaluating Regression Models, goes into more details of regression techniques, along with different regularization methods available to prevent overfitting. You will also discover the various evaluation metrics available to identify model performance.
Chapter 7, Supervised Learning: Predicting Customer Churn, uses a churn prediction problem as the central problem statement throughout the chapter to cover different classification algorithms and their implementation using scikit-learn.
Chapter 8, Fine-Tuning Classification Algorithms, introduces support vector machines and tree-based classifiers along with the evaluation metrics for classification algorithms. You will also learn about the process of hyperparameter tuning which will help you obtain better results using these algorithms.
Chapter 9, Multiclass Classification Algorithms, introduces a multiclass classification problem statement and the classifiers that can be used to solve such problems. You will learn about imbalanced datasets and their treatment in detail. You will also discover the micro- and macro-evaluation metrics available in scikit-learn for these classifiers.
Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, and, user input are shown as follows:
"df.head(n) will return the first n rows of the DataFrame. If no n is passed, the function considers n to be 5 by default."
Words that you see on the screen, for example, in menus or dialog boxes, also appear in the same format.
A block of code is set as follows:
sales.head()
New important words are shown like this: "a box plot is used to depict the distribution of numerical data and is primarily used for comparisons".
Key parts of code snippets are emboldened as follows:
Lines of code that span multiple lines are split using a backslash (\). When the code is executed, Python will ignore the backslash, and treat the code on the next line as a direct continuation of the current line.
For example,
'ValueInINR': pd.Series([70, 89, 99])})
'ValueInINR':[70, 89, 99]})
df.head()
Comments are added into code to help explain specific bits of logic. Single-line comments are denoted using the # symbol, as follows:
# Importing the matplotlib library
import matplotlib.pyplot as plt
#Declaring the color of the plot as gray
plt.bar(sales['Product line'], sales['Revenue'], color='gray')
Multi-line comments are used as follows:
"""
Importing classification report and confusion matrix from sklearn metrics
"""
from sklearn.metrics import classification_report
from sklearn.metrics import precision_recall_fscore_support
For an optimal experience, we recommend the following hardware configuration:
Processor: Dual Core or better
Memory: 4 GB RAM
Storage: 10 GB available space
Download the code files from GitHub at https://packt.link/59F3X. Refer to these code files for the complete code bundle. The files here contain the exercises, activities, and some intermediate code for each chapter. This can be a useful reference when you become stuck.
On the GitHub repo's page, you can click the green Code button and then click the Download ZIP option to download the complete code as a ZIP file to your disk (refer to Figure 0.1). You can then extract these code files to a folder of your choice, for example, C:\Code.
Figure 0.1: Download ZIP option on GitHub
On your system, the extracted ZIP file should contain all the files present in the GitHub repository:
Figure 0.2: GitHub code directory structure (Windows Explorer)
Before you explore the book in detail, you need to set up specific software and tools. In the following section, you shall see how to do that.
The code for all the exercises and activities in this book can be executed using Jupyter Notebooks. You'll first need to install the Anaconda Navigator, which is an interface through which you can access your Jupyter Notebooks. Anaconda Navigator will be installed as a part of Anaconda Individual Edition, which is an open-source Python distribution platform available for Windows, macOS, and Linux. Installing Anaconda will also install Python. Head to https://www.anaconda.com/distribution/.
From the page that opens, click the
Download
button (annotated by
1
). Make sure you are downloading the
Individual Edition
.
Figure 0.3: Anaconda homepage
The installer should start downloading immediately. The website will, by default, choose an installer based on your system configuration. If you prefer downloading Anaconda for a different operating system (Windows, macOS, or Linux) and system configuration (32- or 64-bit), click the
Get Additional Installers
link at the bottom of the box (refer to
Figure 0.3
). The page should scroll down to a section (refer to
Figure 0.4
) that lets you choose from various options based on the operating system and configuration you desire. For this book, it is recommended that you use the latest version of Python (3.8 or higher).
Figure 0.4: Downloading Anaconda Installers based on the OS
Follow the installation steps presented on the screen.
Figure 0.5: Anaconda setup
On Windows, if you've never installed Python on your system before, you can select the checkbox that prompts you to add Anaconda to your
PATH
. This will let you run Anaconda-specific commands (like
conda
) from the default command prompt. If you have Python installed or had installed an earlier version of Anaconda in the past, it is recommended that you leave it unchecked (you may run Anaconda commands from the Anaconda Prompt application instead). The installation may take a while depending on your system configuration.
Figure 0.6: Anaconda installation steps
For more detailed instructions, you may refer to the official documentation for Linux by clicking this link (https://docs.anaconda.com/anaconda/install/linux/), macOS using this link (https://docs.anaconda.com/anaconda/install/mac-os/), and Windows using this link (https://docs.anaconda.com/anaconda/install/windows/).
To check if Anaconda Navigator is correctly installed, look for
Anaconda Navigator
in your applications. Look for an application that has the following icon. Depending on your operating system, the icon's aesthetics may vary slightly.
Figure 0.7: Anaconda Navigator icon
You can also search for the application using your operating system's search functionality. For example, on Windows 10, you can use the Windows Key + S combination and type in Anaconda Navigator. On macOS, you can use Spotlight search. On Linux, you can open the terminal and type the anaconda-navigator command and press the return key.
Figure 0.8: Searching for Anaconda Navigator on Windows 10
For detailed steps on how to verify if Anaconda Navigator is installed, refer to the following link: https://docs.anaconda.com/anaconda/install/verify-install/.
Click the icon to open Anaconda Navigator. It may take a while to load for the first time, but upon successful installation, you should see a similar screen:
Figure 0.9: Anaconda Navigator screen
If you have more questions about the installation process, you may refer to the list of frequently asked questions from the Anaconda documentation: https://docs.anaconda.com/anaconda/user-guide/faq/.
Once the Anaconda Navigator is open, you can launch the Jupyter Notebook interface from this screen. The following steps will show you how to do that:
Open Anaconda Navigator. You should see the following screen:
Figure 0.10: Anaconda Navigator screen
Now, click
Launch
under the
Jupyter Notebook
panel to start the notebook interface on your local system.
Figure 0.11: Jupyter notebook launch option
On clicking the
Launch
button, you'll notice that even though nothing changes in the window shown in the preceding screenshot, a new tab opens up in your default browser. This is known as the
Notebook Dashboard
. It will, by default, open to your root folder. For Windows users, this path would be something similar to
C:\Users\<username>
. On macOS and Linux, it will be
/home/<username>/
.
Figure 0.12: Notebook dashboard
Note that you can also open a Jupyter Notebook by simply running the command jupyter notebook in the terminal or command prompt. Or you can search for Jupyter Notebook in your applications just like you did in Figure 0.8.
You can use this Dashboard as a file explorer to navigate to the directory where you have downloaded or stored the code files for the book (refer to the
Downloading the Code Bundle
section on how to download the files from GitHub). Once you have navigated to your desired directory, you can start by creating a new Notebook. Alternatively, if you've downloaded the code from our repository, you can open an existing Notebook as well (Notebook files will have a
.inpyb
extension). The menus here are quite simple to use:
Figure 0.13: Jupyter notebook navigator menu options walkthrough
If you make any changes to the directory using your operating system's file explorer and the changed file isn't showing up in the Jupyter Notebook Navigator, click the Refresh Notebook List button (annotated as 1). To quit, click the Quit button (annotated as 2). To create a new file (a new Jupyter Notebook), you can click the New button (annotated as 3).
Clicking the
New
button will open a dropdown menu as follows:
Figure 0.14: Creating a new Jupyter notebook
Note
A detailed tutorial on the interface and the keyboard shortcuts for Jupyter Notebooks can be found here: https://jupyter-notebook.readthedocs.io/en/stable/notebook.html.
You can get started and create your first notebook by selecting Python 3; however, it is recommended that you also set up the virtual environment we've provided. Installing the environment will also install all the packages required for running the code in this book. The following section will show you how to do that.
As you run the code for the exercises and activities, you'll notice that even after installing Anaconda, there are certain libraries like kmodes which you'll need to install separately as you progress in the book. Then again, you may already have these libraries installed, but their versions may be different from the ones we've used, which may lead to varying results. That's why we've provided an environment.yml file with this book that will:
Install all the packages and libraries required for this book at once.
Make sure that the version numbers of your libraries match the ones we've used to write the code for this book.
Make sure that the code you write based on this book remains separate from any other coding environment you may have.
You can download the environment.yml file by clicking the following link: http://packt.link/dBv1k.
Save this file, ideally in the same folder where you'll be running the code for this book. If you've downloaded the code from GitHub as detailed in the Downloading the Code Bundle section, this file should already be present in the parent directory, and you won't need to download it separately.
To set up the environment, follow these steps:
On macOS, open Terminal from the Launchpad (you can find more information about Terminal here:
https://support.apple.com/en-in/guide/terminal/apd5265185d-f365-44cb-8b09-71a064a42125/mac
). On Linux, open the Terminal application that's native to your distribution. On Windows, you can open the Anaconda Prompt instead by simply searching for the application. You can do this by opening the Start menu and searching for
Anaconda Prompt
.
Figure 0.15: Searching for Anaconda Prompt on Windows
A new terminal like the following should open. By default, it will start in your home directory:
Figure 0.16: Anaconda terminal prompt
In the case of Linux, it would look like the following:
Figure 0.17: Terminal in Linux
In the terminal, navigate to the directory where you've saved the
environment.yml
file on your computer using the
cd
command. Say you've saved the file in
Documents\Data-Science-for-Marketing-Analytics-Second-Edition
. In that case, you'll type the following command in the prompt and press
Enter
:
cd Documents\Data-Science-for-Marketing-Analytics-Second-Edition
Note that the command may vary slightly based on your directory structure and your operating system.
Now that you've navigated to the correct folder, create a new
conda
environment by typing or pasting the following command in the terminal. Press
Enter
to run the command.
conda env create -f environment.yml
This will install the ds-marketing virtual environment along with the libraries that are required to run the code in this book. In case you see a prompt asking you to confirm before proceeding, type y and press Enter to continue creating the environment. Depending on your system configuration, it may take a while for the process to complete.
Note
For a complete list of conda commands, visit the following link: https://conda.io/projects/conda/en/latest/index.html. For a detailed guide on how to manage conda environments, please visit the following link: https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html.
Once complete, type or paste the following command in the shell to activate the newly installed environment,
ds-marketing
.
conda activate ds-marketing
If the installation is successful, you'll see the environment name in brackets change from base to ds-marketing:
Figure 0.18: Environment name showing up in the shell
Run the following command to install
ipykernel
in the newly activated
conda
environment:
pip install ipykernel
Note
On macOS and Linux, you'll need to specify pip3 instead of pip.
In the same environment, run the following command to add
ipykernel
as a Jupyter kernel:
python -m ipykernel install --user --name=ds-marketing
Windows only:
If you're on Windows, type or paste the following command. Otherwise, you may skip this step and exit the terminal.
conda install pywin32
Select the created kernel
ds-marketing
when you start your Jupyter notebook.
Figure 0.19: Selecting the ds-marketing kernel
A new tab will open with a fresh untitled Jupyter notebook where you can start writing your code:
Figure 0.20: A new Jupyter notebook
You can also try running the code files for this book in a completely online environment through an interactive Jupyter Notebook interface called Binder. Along with the individual code files that can be downloaded locally, we have provided a link that will help you quickly access the Binder version of the GitHub repository for the book. Using this link, you can run any of the .inpyb code files for this book in a cloud-based online interactive environment. Click the following link to open the online Binder version of the book's repository to give it a try: https://packt.link/GdQOp. It is recommended that you save the link in your browser bookmarks for future reference (you may also use the launch binder link provided in the README section of the book's GitHub page).
Depending on your internet connection, it may take a while to load, but once loaded, you'll get the same interface as you would when running the code in a local Jupyter Notebook (all your shortcuts should work as well):
Figure 0.21: Binder lets you run Jupyter Notebooks in a browser
Binder is an online service that helps you read and execute Jupyter Notebook files (.inpyb) present in any public GitHub repository in a cloud-based environment. However, please note that there are certain memory constraints associated with Binder. This means that running multiple Jupyter Notebooks instances at the same time or running processes that consume a lot of memory (like model training) can result in a kernel crash or kernel reset. Moreover, any changes you make in these online Notebooks would not be stored, and the Notebooks will reset to the latest version present in the repository whenever you close and re-open the Binder link. A stable internet connection is required to use Binder. You can find out more about the Binder Project here: https://jupyter.org/binder.
This is a recommended option for readers who want to have a quick look at the code and experiment with it without downloading the entire repository on their local machine.
Feedback from our readers is always welcome.
General feedback: If you have any questions about this book, please mention the book title in the subject of your message and email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you could report this to us. Please visit www.packtpub.com/support/errata and complete the form.
Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you could provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit https://authors.packtpub.com/.
Let us know what you think by leaving a detailed, impartial review on Amazon. We appreciate all feedback – it helps us continue to make great products and help aspiring developers build their skills. Please spare a few minutes to give your thoughts – it makes a big difference to us. You can leave a review by clicking the following link: https://packt.link/r/1800560478.
To Azra, Aiza, Duha and Aidama - you inspire courage, strength, and grace.
- Mirza Rahim Baig
To Appa, Amma, Vindhya, Madhu, and Ishan - The Five Pillars of my life.
- Gururajan Govindan
To Nanaji, Dadaji, and Appa - for their wisdom, inspiration, and unconditional love.
- Vishwesh Ravi Shrimali
Overview
In this chapter, you'll learn the skills required to process and clean data to effectively ready it for further analysis. Using the pandas library in Python, you will learn how to read and import data from various file formats, including JSON and CSV, into a DataFrame. You'll then learn how to perform slicing, aggregation, and filtering on DataFrames. By the end of the chapter, you will consolidate your data cleaning skills by learning how to join DataFrames, handle missing values, and even combine data from various sources.
"Since you liked this artist, you'll also like their new album," "Customers who bought bread also bought butter," and "1,000 people near you have also ordered this item." Every day, recommendations like these influence customers' shopping decisions, helping them discover new products. Such recommendations are possible thanks to data science techniques that leverage data to create complex models, perform sophisticated tasks, and derive valuable customer insights with great precision. While the use of data science principles in marketing analytics is a proven, cost-effective, and efficient strategy, many companies are still not using these techniques to their full potential. There is a wide gap between the possible and actual usage of these techniques.
This book is designed to teach you skills that will help you contribute toward bridging that gap. It covers a wide range of useful techniques that will allow you to leverage everything data science can do in terms of strategies and decision-making in the marketing domain. By the end of the book, you should be able to successfully create and manage an end-to-end marketing analytics solution in Python, segment customers based on the data provided, predict their lifetime value, and model their decision-making behavior using data science techniques.
You will start your journey by first learning how to clean and prepare data. Raw data from external sources cannot be used directly; it needs to be analyzed, structured, and filtered before it can be used any further. In this chapter, you will learn how to manipulate rows and columns and apply transformations to data to ensure you have the right data with the right attributes. This is an essential skill in a data analyst's arsenal because, otherwise, the outcome of your analysis will be based on incorrect data, thereby making it a classic example of garbage in, garbage out. But before you start working with the data, it is important to understand its nature - in other words, the different types of data you'll be working with.
When you build an analytical solution, the first thing that you need to do is to build a data model. A data model is an overview of the data sources that you will be using, their relationships with other data sources, where exactly the data from a specific source is going to be fetched, and in what form (such as an Excel file, a database, or a JSON from an internet source).
Note
Keep in mind that the data model evolves as data sources and processes change.
A data model can contain data of the following three types:
Structured Data
: Also known as completely structured or well-structured data, this is the simplest way to manage information. The data is arranged in a flat tabular form with the correct value corresponding to the correct attribute. There is a unique column, known as an
index
, for easy and quick access to the data, and there are no duplicate columns. For example, in
Figure 1.1
,
employee_id
is the unique column. Using the data in this column, you can run SQL queries and quickly access data at a specific row and column in the dataset easily. Furthermore, there are no empty rows, missing entries, or duplicate columns, thereby making this dataset quite easy to work with. What makes structured data most ubiquitous and easy to analyze is that it is stored in a standardized tabular format that makes adding, updating, deleting, and updating entries easy and programmable. With structured data, you may not have to put in much effort during the data preparation and cleaning stage.
Data stored in relational databases such as MySQL, Amazon Redshift, and more are examples of structured data:
Figure 1.1: Data in a MySQL table
Semi-structured data
: You will not find semi-structured data to be stored in a strict, tabular hierarchy as you saw in
Figure 1.1
. However, it will still have its own hierarchies that group its elements and establish a relationship between them. For example, metadata of a song may include information about the cover art, the artist, song length, and even the lyrics. You can search for the artist's name and find the song you want. Such data does not have a fixed hierarchy mapping the unique column with rows in an expected format, and yet you can find the information you need.
Another example of semi-structured data is a JSON file. JSON files are self-describing and can be understood easily. In Figure 1.2, you can see a JSON file that contains personally identifiable information of Jack Jones.
Semi-structured data can be stored accurately in NoSQL databases.
Figure 1.2: Data in a JSON file
Unstructured data
: Unstructured data may not be tabular, and even if it is tabular, the number of attributes or columns per observation may be completely arbitrary. The same data could be represented in different ways, and the attributes might not match each other, with values leaking into other parts.
For example, think of reviews of various products stored in rows of an Excel sheet or a dump of the latest tweets of a company's Twitter profile. We can only search for specific keywords in that data, but we cannot store it in a relational database, nor will we be able to establish a concrete hierarchy between different elements or rows. Unstructured data can be stored as text files, CSV files, Excel files, images, and audio clips.
Marketing data, traditionally, comprises all three aforementioned data types. Initially, most data points originate from different data sources. This results in different implications, such as the values of a field could be of different lengths, the value for one field would not match that of other fields because of different field names, and some rows might have missing values for some of the fields.
You'll soon learn how to effectively tackle such problems with your data using Python. The following diagram illustrates what a data model for marketing analytics looks like. The data model comprises all kinds of data: structured data such as databases (top), semi-structured data such as JSON (middle), and unstructured data such as Excel files (bottom):
Figure 1.3: Data model for marketing analytics
As the data model becomes complex, the probability of having bad data increases. For example, a marketing analyst working with the demographic details of a customer can mistakenly read the age of the customer as a text string instead of a number (integer). In such situations, the analysis would go haywire as the analyst cannot perform any aggregation functions, such as finding the average age of a customer. These types of situations can be overcome by having a proper data quality check to ensure that the data chosen for further analysis is of the correct data type.
This is where programming languages such as Python come into play. Python is an all-purpose general programming language that integrates with almost every platform and helps automate data production and analysis.
Apart from understanding patterns and giving at least a basic structure to data, Python forces the data model to accept the right value for the attribute. The following diagram illustrates how most marketing analytics today structure different kinds of data by passing it through scripts to make it at least semi-structured:
Figure 1.4: Data model of most marketing analytics that use Python
By making use of such structure-enforcing scripts, you will have a data model of semi-structured data coming in with expected values in the right fields; however, the data is not yet in the best possible format to perform analytics. If you can completely structure your data (that is, arrange it in flat tables, with the right value pointing to the right attribute with no nesting), it will be easy to see how every data point individually compares to other points with the help of common fields. You can easily get a feel of the data—that is, see in what range most values lie, identify the clear outliers, and so on—by simply scrolling through it.
While there are a lot of tools that can be used to convert data from an unstructured/semi-structured format to a fully structured format (for example, Spark, STATA, and SAS), the tool that is most widely used for data science, and which can be integrated with practically any framework, has rich functionalities, minimal costs, and is easy to use in our use case, is pandas.
pandas is a software library written in Python and is the basic building block for data manipulation and analysis. It offers a collection of high-performance, easy-to-use, and intuitive data structures and analysis tools that are of great use to marketing analysts and data scientists alike. The library comes as a default package when you install Anaconda (refer to the Preface for detailed instructions).
Note
Before you run the code in this book, it is recommended that you install and set up the virtual environment using the environment.yml file we have provided in the GitHub repository of this book.
You can find the environment.yml file at the following link: https://packt.link/dBv1k.
It will install all the required libraries and ensure that the version numbers of the libraries on your system match with ours. Refer to the Preface for more instructions on how to set this up.
However, if you're using any other distribution where pandas is not pre-installed, you can run the following command in your terminal app or command prompt to install the library:
pip install pandas
Note
On macOS or Linux, you will need to modify the preceding command to use pip3 instead of pip.
The following diagram illustrates how different kinds of data are converted to a structured format with the help of pandas:
Figure 1.5: Data model to structure the different kinds of data
When working with pandas, you'll be dealing with its two primary object types: DataFrames and Series. What follows is a brief explanation of what those object types are. Don't worry if you are not able to understand things such as their structure and how they work; you'll be learning more about these in detail later in the chapter.
DataFrame
: This is the fundamental tabular structure that stores data in rows and columns (like a spreadsheet). When performing data analysis, you can directly apply functions and operations to DataFrames.
Series
: This refers to a single column of the DataFrame. Series adds up to form a DataFrame. The values can be accessed through its index, which is assigned automatically while defining a DataFrame.
In the following diagram, the users column annotated by 2 is a series, and the viewers, views, users, and cost columns, along with the index, form a DataFrame (annotated by 1):
Figure 1.6: A sample pandas DataFrame and series
Now that you have a brief understanding of what pandas objects are, let's take a look at some of the functions you can use to import and export data in pandas.
Every team in a marketing group can have its own preferred data source for its specific use case. Teams that handle a lot of customer data, such as demographic details and purchase history, would prefer a database such as MySQL or Oracle, whereas teams that handle a lot of text might prefer JSON, CSV, or XML. Due to the use of multiple data sources, we end up having a wide variety of files. In such cases, the pandas library comes to our rescue as it provides a variety of APIs (Application Program Interfaces) that can be used to read multiple different types of data into a pandas DataFrame. Some of the most commonly used APIs are shown here:
Figure 1.7: Ways to import and export different types of data with pandas DataFrames
So, let's say you wanted to read a CSV file. You'll first need to import the pandas library as follows:
import pandas as pd
Then, you will run the following code to store the CSV file in a DataFrame named df (df is a variable):
In the preceding line, we have sales.csv, which is the file to be imported. This command should work if your Jupyter notebook (or Python process) is run from the same directory where the file is stored. If the file was stored in any other path, you'll have to specify the exact path. On Windows, for example, you'll specify the path as follows:
Note that we've added r before the path to take care of any special characters in the path. As you work with and import various data files in the exercises and activities in this book, we'll often remind you to pay attention to the path of the CSV file.
When loading data, pandas also provides additional parameters that you can pass to the read function, so that you can load the data the way you want. Some of these parameters are provided here. Please note that most of these parameters are optional. Also worth noting is the fact that the default value of the index in a DataFrame starts with 0:
For example, if you want to import a CSV file into a DataFrame, df, with the following conditions:
The first row of the file must be the header.
You need to import only the first 100 rows into the file.
You need to import only the first 3 columns.
The code corresponding to the preceding conditions would be as follows:
df= pd.read_csv("sales.csv",header=1,nrows=100,usecols=[0,1,2])
Note
There are similar specific parameters for almost every inbuilt function in pandas. You can find details about them with the documentation for pandas available at the following link: https://pandas.pydata.org/pandas-docs/stable/.
Once the data is imported, you need to verify whether it has been imported correctly. Let's understand how to do that in the following section.
Once you've successfully read a DataFrame using the pandas library, you need to inspect the data to check whether the right attribute has received the right value. You can use several built-in pandas functions to do that.
The most commonly used way to inspect loaded data is using the head() command. By default, this command will display the first five rows of the DataFrame. Here's an example of the command used on a DataFrame called df:
df.head()
The output should be as follows:
Figure 1.8: Output of the df.head() command
Similarly, to display the last five rows, you can use the df.tail() command. Instead of the default five rows, you can even specify the number of rows you want to be displayed. For example, the df.head(11) command will display the first 11 rows.
Here's the complete usage of these two commands, along with a few other commands that be useful while examining data. Again, it is assumed that you have stored the DataFrame in a variable called df:
df.head(n)
will return the first
n
rows of the DataFrame. If no
n
is passed, the function considers
n
to be
5
by default.
df.tail(n)
will return the last
n
rows of the DataFrame. If no
n
is passed, the function considers
n
to be
5
by default.
df.shape
will return the dimensions of a DataFrame (number of rows and number of columns).
df.dtypes
will return the type of data in each column of the pandas DataFrame (such as
float
,
object
,
int64
, and so on
)
.
df.info()
will summarize the DataFrame and print its size, type of values, and the count of non-null values.
So far, you've learned about the different functions that can be used on DataFrames. In the first exercise, you will practice using these functions to import a JSON file into a DataFrame and later, to inspect the data.
The tech team in your company has been testing a web version of its flagship shopping app. A few loyal users who volunteered to test the website were asked to submit their details via an online form. The form captured some useful details (such as age, income, and more) along with some not-so-useful ones (such as eye color). The tech team then tested their new profile page module, using which a few additional details were captured. All this data was stored in a JSON file called user_info.json, which the tech team sent to you for validation.
Note
You can find the user_info.json file at the following link: https://packt.link/Gi2O7.
Your goal is to import this JSON file into pandas and let the tech team know the answers to the following questions so that they can add more modules to the website:
Is the data loading correctly?
Are there any missing values in any of the columns?
What are the data types of all the columns?
How many rows and columns are present in the dataset?
Note
All the exercises and activities in this chapter can be performed in both the Jupyter notebook and Python shell. While you can do them in the shell for now, it is highly recommended to use the Jupyter notebook. To learn how to install Jupyter and set up the Jupyter notebook, refer to the Preface. It will be assumed that you are using a Jupyter notebook from this point on.
In this exercise, you loaded the data, checked whether it had been loaded correctly, and gathered some more information about the entries contained therein. All this was done by loading data stored in a single source, which was the JSON file. As a marketing analyst, you will come across situations where you'll need to load and process data from different sources. Let's practice that in the exercise that follows.
You work for a company that uses Facebook for its marketing campaigns. The data.csv file contains the views and likes of 100 different posts on Facebook used for a marketing campaign. The team also uses historical sales data to derive insights. The sales.csv file contains some historical sales data recorded in a CSV file relating to different customer purchases in stores in the past few years.
Your goal is to read the files into pandas DataFrames and check the following:
Whether either of the datasets contains null or missing values
Whether the data is stored in the correct columns and the corresponding column names make sense (in other words, the names of the columns correctly convey what type of information is stored in the rows)
Note
You can find the data.csv file at https://packt.link/NmBJT, and the sales.csv file at https://packt.link/ER7fz.
Let's first work with the data.csv file:
Figure 1.19: Output of sales.info()
From the preceding output, you can see that the country column has missing values (since all the other columns have 100 entries). You'll need to dig deeper and find out the exact cause of this problem. By the end of this chapter, you'll learn how to address such problems effectively.
Now that you have loaded the data and looked at the result, you can observe that the data collected by the marketing campaigns team (data.csv) looks good and it has no missing values. The data collected by the sales team, on the other hand (stored in sales.csv), has quite a few missing values and incorrect column names.
Based on what you've learned about pandas so far, you won't be able to standardize the data. Before you learn how to do that, you'll first have to dive deep into the internal structure of pandas objects and understand how data is stored in pandas.
You are undecided as to which data structure to use to store some of the information that comes in from different marketing teams. From your experience, you know that a few elements in your data will have missing values. You are also expecting two different teams to collect the same data but categorize it differently. That is, instead of numerical indices (0-10), they might use custom labels to access specific values. pandas provides data structures that help store and work with such data. One such data structure is called a pandas series.
A pandas series is nothing more than an indexed NumPy array. To create a pandas series, all you need to do is create an array and give it an index. If you create a series without an index, it will create a default numeric index that starts from 0 and goes on for the length of the series, as shown in the following diagram:
Figure 1.20: Sample pandas series
Note
As a series is still a NumPy array, all functions that work on a NumPy array work the same way on a pandas series, too. To learn more about the functions, please refer to the following link:https://pandas.pydata.org/pandas-docs/stable/reference/series.html.
As your campaign grows, so does the number of series. With that, new requirements arise. Now, you want to be able to perform operations such as concatenation on specific entries in several series at once. However, to access the values, these different series must share the same index. And that's exactly where DataFrames come into the picture. A pandas DataFrame is just a dictionary with the column names as keys and values as different pandas series, joined together by the index.
A DataFrame is created when different columns (which are nothing but series) such as these are joined together by the index:
Figure 1.21: Series joined together by the same index create a pandas DataFrame
In the preceding screenshot, you'll see numbers 0-4 to the left of the age column. These are the indices. The age, balance, _id, about, and address columns, along with others, are series, and together they form a DataFrame.
This way of storing data makes it very easy to perform the operations you need on the data you want. You can easily choose the series you want to modify by picking a column and directly slicing off indices based on the value in that column. You can also group indices with similar values in one column together and see how the values change in other columns.
pandas also allows operations to be applied to both rows and columns of a DataFrame. You can choose which one to apply by specifying the axis, 0 referring to rows, and 1 referring to columns.
For example, if you wanted to apply the sum function to all the rows in the balance column of the DataFrame, you would use the following code:
df['balance'].sum(axis=0)
In the following screenshot, by specifying axis=0, you can apply a function (such as sum) on all the rows in a particular column:
By specifying
