E-Book
36,59 €

Hands-On Exploratory Data Analysis with Python E-Book

Suresh Kumar Mukhiya

0,0

36,59 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

Discover techniques to summarize the characteristics of your data using PyPlot, NumPy, SciPy, and pandas

Key Features

Understand the fundamental concepts of exploratory data analysis using Python

Find missing values in your data and identify the correlation between different variables

Practice graphical exploratory analysis techniques using Matplotlib and the Seaborn Python package

Book Description

Exploratory Data Analysis (EDA) is an approach to data analysis that involves the application of diverse techniques to gain insights into a dataset. This book will help you gain practical knowledge of the main pillars of EDA - data cleaning, data preparation, data exploration, and data visualization.

You'll start by performing EDA using open source datasets and perform simple to advanced analyses to turn data into meaningful insights. You'll then learn various descriptive statistical techniques to describe the basic characteristics of data and progress to performing EDA on time-series data. As you advance, you'll learn how to implement EDA techniques for model development and evaluation and build predictive models to visualize results. Using Python for data analysis, you'll work with real-world datasets, understand data, summarize its characteristics, and visualize it for business intelligence.

By the end of this EDA book, you'll have developed the skills required to carry out a preliminary investigation on any dataset, yield insights into data, present your results with visual aids, and build a model that correctly predicts future outcomes.

What you will learn

Import, clean, and explore data to perform preliminary analysis using powerful Python packages

Identify and transform erroneous data using different data wrangling techniques

Explore the use of multiple regression to describe non-linear relationships

Discover hypothesis testing and explore techniques of time-series analysis

Understand and interpret results obtained from graphical analysis

Build, train, and optimize predictive models to estimate results

Perform complex EDA techniques on open source datasets

Who this book is for

This EDA book is for anyone interested in data analysis, especially students, statisticians, data analysts, and data scientists. The practical concepts presented in this book can be applied in various disciplines to enhance decision-making processes with data analysis and synthesis. Fundamental knowledge of Python programming and statistical concepts is all you need to get started with this book.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 308

Veröffentlichungsjahr: 2020

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Hands-On Exploratory Data Analysis with Python

Perform EDA techniques to understand, summarize, and investigate your data

Suresh Kumar Mukhiya

Usman Ahmed

BIRMINGHAM - MUMBAI

Hands-On Exploratory Data Analysis with Python

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author(s), nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor: Pravin DhandreAcquisition Editor: Ali AbidiContent Development Editor:Nathanya DiasSenior Editor: Ayaan HodaTechnical Editor: Manikandan KurupCopy Editor: Safis EditingProject Coordinator: Aishwarya MohanProofreader: Safis EditingIndexer: Rekha NairProduction Designer: Deepika Naik

First published: March 2020

Production reference: 1270320

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN 978-1-78953-725-3

www.packt.com

Packt.com

Subscribe to our online digital library for full access to over 7,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Fully searchable for easy access to vital information

Copy and paste, print, and bookmark content

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

Contributors

About the authors

Suresh Kumar Mukhiya is a Ph.D. candidate currently affiliated with the Western Norway University of Applied Sciences (HVL). He is a big data enthusiast, specializing in information systems, model-driven software engineering, big data analysis, artificial intelligence, and frontend development. He has completed his Master's degree in information systems at the Norwegian University of Science and Technology (NTNU, Norway), along with a thesis in processing mining. He also holds a Bachelor's degree in computer science and information technology (BSc.CSIT) from Tribhuvan University, Nepal, where he was decorated with the Vice-Chancellor's Award for obtaining the highest score. He is a passionate photographer and a resilient traveler.

Special thanks go to the people who have helped in the creation of this book. We want to acknowledge the following contributors whose constructive feedback and ideas made this book possible: Asha Gaire ([email protected]), Bachelor in Computer Science and Information Technology, Nepal. She proofread the final draft and contributed to the major sections of the book especially Data Transformation, Grouping Dataset, and Correlation chapters. Anju Mukhiya ([email protected]) for reading an early draft and making many corrections and suggestions. Lilash Sah, ([email protected]) Master in Information Technology, King’s Own Institute -Sydney, for reading and validating the codes used in this book.

Usman Ahmed is a data scientist and Ph.D. candidate at the Western Norway University of Applied Sciences (HVL). He has rich experience in building and scaling high-performance systems based on data mining, natural language processing, and machine learning. Usman's research interests are sequential data mining, heterogeneous computing, natural language processing, recommendation systems, and machine learning. He has completed the Master of Science degree in computer science at Capital University of Science and Technology, Islamabad, Pakistan. Usman Ahmed was awarded a gold medal for his bachelor of computer science degree from Heavy Industries Taxila Education City.

About the reviewer

Jamshaid Sohail is passionate about data science, machine learning, computer vision, natural language processing, and big data, and has completed over 65 online courses in related fields. He has worked in a Silicon Valley-based start-up named Funnelbeam as a data scientist. He worked with the founders of Funnelbeam, who came from Stanford University, and he generated a lot of revenue by completing several projects and products. Currently, he is working as a data scientist at Fiverivers Technologies. He authored the course Data Wrangling with Python 3.X for Packt and has reviewed a number of books and courses.

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Title Page

Hands-On Exploratory Data Analysis with Python

About Packt

Why subscribe?

Contributors

About the authors

About the reviewer

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

Section 1: The Fundamentals of EDA

Exploratory Data Analysis Fundamentals

Understanding data science

The significance of EDA

Steps in EDA

Making sense of data

Numerical data

Discrete data

Continuous data

Categorical data

Measurement scales

Nominal

Ordinal 

Interval

Ratio

Comparing EDA with classical and Bayesian analysis

Software tools available for EDA

Getting started with EDA

NumPy

Pandas

SciPy

Matplotlib

Summary

Further reading

Visual Aids for EDA

Technical requirements

Line chart

Steps involved

Bar charts

Scatter plot

Bubble chart

Scatter plot using seaborn

Area plot and stacked plot

Pie chart

Table chart

Polar chart

Histogram

Lollipop chart

Choosing the best chart

Other libraries to explore

Summary

Further reading

EDA with Personal Email

Technical requirements

Loading the dataset

Data transformation

Data cleansing

Loading the CSV file

Converting the date

Removing NaN values

Applying descriptive statistics

Data refactoring

Dropping columns

Refactoring timezones

Data analysis

Number of emails

Time of day

Average emails per day and hour

Number of emails per day

Most frequently used words

Summary

Further reading

Data Transformation

Technical requirements

Background

Merging database-style dataframes

Concatenating along with an axis

Using df.merge with an inner join

Using the pd.merge() method with a left join

Using the pd.merge() method with a right join

Using pd.merge() methods with outer join

Merging on index

Reshaping and pivoting

Transformation techniques

Performing data deduplication

Replacing values

Handling missing data

NaN values in pandas objects

Dropping missing values

Dropping by rows

Dropping by columns

Mathematical operations with NaN

Filling missing values

Backward and forward filling

Interpolating missing values

Renaming axis indexes

Discretization and binning

Outlier detection and filtering

Permutation and random sampling

Random sampling without replacement

Random sampling with replacement

Computing indicators/dummy variables

String manipulation

Benefits of data transformation

Challenges

Summary

Further reading

Section 2: Descriptive Statistics

Descriptive Statistics

Technical requirements

Understanding statistics

Distribution function

Uniform distribution

Normal distribution

Exponential distribution

Binomial distribution

Cumulative distribution function

Descriptive statistics

Measures of central tendency

Mean/average

Median

Mode

Measures of dispersion

Standard deviation

Variance

Skewness

Kurtosis

Types of kurtosis

Calculating percentiles

Quartiles

Visualizing quartiles

Summary

Further reading

Grouping Datasets

Technical requirements

Understanding groupby() 

Groupby mechanics

Selecting a subset of columns

Max and min

Mean

Data aggregation

Group-wise operations

Renaming grouped aggregation columns

Group-wise transformations

Pivot tables and cross-tabulations

Pivot tables

Cross-tabulations

Summary

Further reading

Correlation

Technical requirements

Introducing correlation

Types of analysis

Understanding univariate analysis

Understanding bivariate analysis

Understanding multivariate analysis

Discussing multivariate analysis using the Titanic dataset

Outlining Simpson's paradox

Correlation does not imply causation

Summary

Further reading

Time Series Analysis

Technical requirements

Understanding the time series dataset

Fundamentals of TSA

Univariate time series

Characteristics of time series data

TSA with Open Power System Data

Data cleaning

Time-based indexing

Visualizing time series

Grouping time series data

Resampling time series data

Summary

Further reading

Section 3: Model Development and Evaluation

Hypothesis Testing and Regression

Technical requirements

Hypothesis testing

Hypothesis testing principle

statsmodels library

Average reading time 

Types of hypothesis testing

T-test

p-hacking

Understanding regression

Types of regression

Simple linear regression

Multiple linear regression

Nonlinear regression

Model development and evaluation

Constructing a linear regression model

Model evaluation

Computing accuracy

Understanding accuracy

Implementing a multiple linear regression model

Summary

Further reading

Model Development and Evaluation

Technical requirements

Types of machine learning

Understanding supervised learning

Regression

Classification

Understanding unsupervised learning

Applications of unsupervised learning 

Clustering using MiniBatch K-means clustering 

Extracting keywords

Plotting clusters

Word cloud

Understanding reinforcement learning

Difference between supervised and reinforcement learning

Applications of reinforcement learning

Unified machine learning workflow 

Data preprocessing

Data collection

Data analysis

Data cleaning, normalization, and transformation

Data preparation

Training sets and corpus creation

Model creation and training

Model evaluation

Best model selection and evaluation

Model deployment

Summary

Further reading

EDA on Wine Quality Data Analysis

Technical requirements

Disclosing the wine quality dataset

Loading the dataset

Descriptive statistics

Data wrangling

Analyzing red wine

Finding correlated columns

Alcohol versus quality

Alcohol versus pH

Analyzing white wine

Red wine versus white wine 

Adding a new attribute

Converting into a categorical column

Concatenating dataframes

Grouping columns

Univariate analysis

Multivariate analysis on the combined dataframe

Discrete categorical attributes

3-D visualization

Model development and evaluation

Summary

Further reading

Appendix

String manipulation

Creating strings

Accessing characters in Python 

String slicing

Deleting/updating from a string

Escape sequencing in Python

Formatting strings

Using pandas vectorized string functions

Using string functions with a pandas DataFrame

Using regular expressions

Further reading

Other Books You May Enjoy

Leave a review - let other readers know what you think

Preface

Data is a collection of discrete objects, events, and facts in the form of numbers, text, pictures, videos, objects, audio, and other entities. Processing data provides a great deal of information. But the million-dollar question is—how do we get meaningful information from data? The answer to this question is Exploratory Data Analysis (EDA), which is the process of investigating datasets, elucidating subjects, and visualizing outcomes. EDA is an approach to data analysis that applies a variety of techniques to maximize specific insights into a dataset, reveal an underlying structure, extract significant variables, detect outliers and anomalies, test assumptions, develop models, and determine best parameters for future estimations. This book, Hands-On Exploratory Data Analysis with Python, aims to provide practical knowledge about the main pillars of EDA, including data cleansing, data preparation, data exploration, and data visualization. Why visualization? Well, several research studies have shown that portraying data in graphical form makes complex statistical data analyses and business intelligence more marketable.

You will get the opportunity to explore open source datasets including healthcare datasets, demographics datasets, a Titanic dataset, a wine quality dataset, automobile datasets, a Boston housing pricing dataset, and many others. Using these real-life datasets, you will get hands-on practice in understanding data, summarize data's characteristics, and visualizing data for business intelligence purposes. This book expects you to use pandas, a powerful library for working with data, and other core Python libraries including NumPy, scikit-learn,SciPy, StatsModels for regression, and Matplotlib for visualization.

Who this book is for

This book is for anyone who intends to analyze data, including students, teachers, managers, engineers, statisticians, data analysts, and data scientists. The practical concepts presented in this hands-on book are applicable to applications in various disciplines, including linguistics, sociology, astronomy, marketing, business, management, quality control, education, economics, medicine, psychology, engineering, biology, physics, computer science, geosciences, chemistry, and any other fields where data analysis and synthesis is required in order to improve knowledge and help in decision-making processes. Fundamental understanding of Python programming and some statistical concepts is all you need to get started with this book.

What this book covers

Chapter 1, Exploratory Data Analysis Fundamentals, will help us learn and revise the fundamental aspects of EDA. We will dig into the importance of EDA and the main data analysis tasks, and try to make sense out of data. In addition to that, we will use Python to explore different types of data, including numerical data, time-series data, geospatial data, categorical data, and others.

Chapter 2, Visual Aids for EDA, will help us gain proficiency with different tools for visualizing the information that we get from investigation and make analysis much clearer. We will figure out how to use data visualization tools such as box plots, histograms, multi-variate charts, and more. Notwithstanding that, we will get our hands dirty in plotting an enlightening visual graph using real databases. Finally, we will investigate the intuitive forms of these plots.

Chapter 3, EDA with Personal Email, will help us figure out how to import a dataset from your personal Gmail account and work on analyzing the extracted dataset. We will perform basic EDA techniques, including data loading, data cleansing, data preparation, data visualization, and data analysis, on the extracted dataset.

Chapter 4, Data Transformation, is where you will take your first steps in data wrangling. We will see how to merge database-style DataFrames, merge on the index, concatenate along an axis, combine data with overlaps, reshape with hierarchical indexing, and pivot from long to wide format. We will look at what needs to be done with a dataset before analysis takes place, such as removing duplicates, replacing values, renaming axis indexes, discretization and binning, and detecting and filtering outliers. We will work on transforming data using a function or mapping, permutation, and random sampling and computing indicators/dummy variables.

Chapter 5, Descriptive Statistics, will teach you about essential statistical measures for gaining insights about data that are not noticeable at the surface level. We will become familiar with the equations for computing the variance and standard deviation of datasets as well as figuring out percentiles and quartiles. Furthermore, we will envision those factual measures with visualization. We will use tools such as box plots to gain knowledge from statistics.

Chapter 6, Grouping Datasets, will cover the rudiments of grouping and how it can change our datasets in order to help us to analyze them better. We will look at different group-by mechanics that will amass our dataset into various classes in which we can perform aggregate activities. We will also figure out how to dissect categorical data with visualizations, utilizing pivot tables and cross-tabulations.

Chapter 7, Correlation, will help us to understand the correlation between different factors and to identify to what degree different factors are relevant. We will learn about the different kinds of examinations that we can carry out to discover the relationships between data, including univariate analysis, bivariate analysis, and multivariate analysis on the Titanic dataset, as well as looking at Simpson's paradox. We will observe how correlation does not always equal causation.

Chapter 8, Time Series Analysis, will help us to understand time-series data and how to perform EDA on it. We will use the open power system data for time series analysis.

Chapter 9, Hypothesis Testing and Regression, will help us learn about hypothesis testing and linear, non-linear, and multiple linear regression. We will build a basis for model development and evaluation. We will be using polynomial regression and pipelines for model evaluation.

Chapter 10, Model Development and Evaluation, will help us learn about a unified machine learning approach, discuss different types of machine learning algorithms and evaluation techniques. Moreover, in this chapter, we are going to perform the unsupervised learning task of clustering with text data. Furthermore, we will discuss model selection and model deployment techniques.

Chapter 11, EDA on Wine Quality Data, will teach us how to use all the techniques learned throughout the book to perform advanced EDA on a wine quality dataset. We will import the dataset, research the variables, slice the data based on different points of interest, and perform data analysis.

To get the most out of this book

All the EDA activities in this book are based on Python 3.x. So, the first and foremost requirement to run any code from this book is for you to have Python 3.x installed on your computer irrespective of the operating system. Python can be installed on your system by following the documentation on its official website: https://www.python.org/downloads/.

Here is the software that needs to be installed in order to execute the code:

Software/hardware covered in the book

OS requirements

Python 3.x

Windows, macOS, Linux, or any other OS

Python notebooks

There are several options:

Local: Jupyter: https://jupyter.org/

Local:https://www.anaconda.com/distribution/

Online:https://colab.research.google.com/

Python libraries

NumPy, pandas, scikit-learn, Matplotlib, Seaborn, StatsModel

We primarily used Python notebooks to execute our code. One of the reasons for that is, with them, it is relatively easy to break code into a clear structure and see the output on the fly. It is always safer to install a notebook locally. The official website holds great information on how they can be installed. However, if you do not want the hassle and simply want to start learning immediately, then Google Colab provides a great platform where you can code and execute code using both Python 2.x and Python 3.x with support for Graphics Processing Units (GPUs) and Tensor Processing Units (TPUs).

If you are using the digital version of this book, we advise you to type the code yourself or access the code via the GitHub repository (link available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

Download the example code files

You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packtpub.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

www.packt.com

Select the

Support

tab.

Click on

Code Downloads

Enter the name of the book in the

box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for Mac

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/hands-on-exploratory-data-analysis-with-python. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://static.packt-cdn.com/downloads/9781789537253_ColorImages.pdf.

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in the text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "we visualized a time series dataset using thematplotlib and seaborn libraries."

A block of code is set as follows:

import osimport numpy as np%matplotlib inline from matplotlibimport pyplot as pltimport seaborn as sns

Any command-line input or output is written as follows:

> pip install virtualenv

virtualenv Local_Version_Directory -p Python_System_Directory

Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "Time series data may contain a notable amount of outliers."

Warnings or important notes appear like this.

Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packt.com.

Section 1: The Fundamentals of EDA

The main objective of this section is to cover the fundamentals of Exploratory Data Analysis (EDA) and understand different stages of the EDA process. We will also look at the key concepts of profiling, quality assessment, the main aspects of EDA, and the challenges and opportunities in EDA. In addition to this, we will be discovering different useful visualization techniques. Finally, we will be discussing essential data transformation techniques, including database-style dataframe merges, transformation techniques, and benefits of data transformation.

This section contains the following chapters:

Chapter 1

Exploratory Data Analysis Fundamentals

Chapter 2

Visual Aids for EDA

Chapter 3

EDA with Personal Email

Chapter 4

Data Transformation

Exploratory Data Analysis Fundamentals

The main objective of this introductory chapter is to revise the fundamentals of Exploratory Data Analysis (EDA), what it is, the key concepts of profiling and quality assessment, the main dimensions of EDA, and the main challenges and opportunities in EDA.

Data encompasses a collection of discrete objects, numbers, words, events, facts, measurements, observations, or even descriptions of things. Such data is collected and stored by every event or process occurring in several disciplines, including biology, economics, engineering, marketing, and others. Processing such data elicits useful information and processing such information generates useful knowledge. But an important question is: how can we generate meaningful and useful information from such data? An answer to this question is EDA. EDA is a process of examining the available dataset to discover patterns, spot anomalies, test hypotheses, and check assumptions using statistical measures. In this chapter, we are going to discuss the steps involved in performing top-notch exploratory data analysis and get our hands dirty using some open source databases.

As mentioned here and in several studies, the primary aim of EDA is to examine what data can tell us before actually going through formal modeling or hypothesis formulation. John Tuckey promoted EDA to statisticians to examine and discover the data and create newer hypotheses that could be used for the development of a newer approach in data collection and experimentations.

In this chapter, we are going to learn and revise the following topics:

Understanding data science

The significance of EDA

Making sense of data

Comparing EDA with classical and Bayesian analysis

Software tools available for EDA

Getting started with EDA

The significance of EDA

Different fields of science, economics, engineering, and marketing accumulate and store data primarily in electronic databases. Appropriate and well-established decisions should be made using the data collected. It is practically impossible to make sense of datasets containing more than a handful of data points without the help of computer programs. To be certain of the insights that the collected data provides and to make further decisions, data mining is performed where we go through distinctive analysis processes. Exploratory data analysis is key, and usually the first exercise in data mining. It allows us to visualize data to understand it as well as to create hypotheses for further analysis. The exploratory analysis centers around creating a synopsis of data or insights for the next steps in a data mining project.

EDA actually reveals ground truth about the content without making any underlying assumptions. This is the fact that data scientists use this process to actually understand what type of modeling and hypotheses can be created. Key components of exploratory data analysis include summarizing data, statistical analysis, and visualization of data. Python provides expert tools for exploratory analysis, with pandas for summarizing; scipy, along with others, for statistical analysis; and matplotlib and plotly for visualizations.

That makes sense, right? Of course it does. That is one of the reasons why you are going through this book. After understanding the significance of EDA, let's discover what are the most generic steps involved in EDA in the next section.

Steps in EDA

Having understood what EDA is, and its significance, let's understand the various steps involved in data analysis. Basically, it involves four different steps. Let's go through each of them to get a brief understanding of each step:

Problem definition:

Before trying to extract useful insight from the data, it is essential to define the business problem to be solved. The problem definition works as the driving force for a data analysis plan execution. The main tasks involved in problem definition are defining the main objective of the analysis, defining the main deliverables, outlining the main roles and responsibilities, obtaining the current status of the data, defining the timetable, and performing cost/benefit analysis. Based on such a problem definition, an execution plan can be created.

Data preparation

: This step involves methods for preparing the dataset before actual analysis. In this step, we define the sources of data, define data schemas and tables, understand the main characteristics of the data, clean the dataset, delete non-relevant datasets, transform the data, and divide the data into required chunks for analysis.

Data analysis:

This is one of the most crucial steps that deals with descriptive statistics and analysis of the data. The main tasks involve summarizing the data, finding the hidden correlation and relationships among the data, developing predictive models, evaluating the models, and calculating the accuracies. Some of the techniques used for data summarization are summary tables, graphs, descriptive statistics, inferential statistics, correlation statistics, searching, grouping, and mathematical models.

Development and representation of the results:

This step involves presenting the dataset to the target audience in the form of graphs, summary tables, maps, and diagrams. This is also an essential step as the result analyzed from the dataset should be interpretable by the business stakeholders, which is one of the major goals of EDA. Most of the graphical analysis techniques include scattering plots, character plots, histograms, box plots, residual plots, mean plots, and others. We will explore several types of graphical representation in

Chapter 2

Visual Aids for EDA