Data Literacy With Python - Mercury Learning and Information - E-Book

Data Literacy With Python E-Book

Mercury Learning and Information

0,0
29,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

This book ushers readers into the world of data, emphasizing its importance in modern industries and how its management leads to insightful decision-making. Using Python 3, the book introduces foundational data tasks and progresses to advanced model training concepts. Detailed, step-by-step Python examples help readers master training models, starting with the kNN algorithm and moving to other classifiers with minimal code adjustments. Tools like Sweetviz, Skimpy, Matplotlib, and Seaborn are introduced for hands-on chart and graph rendering.
The course begins with working with data, detecting outliers and anomalies, and cleaning datasets. It then introduces statistics and progresses to using Matplotlib and Seaborn for data visualization. Each chapter builds on the previous one, ensuring a comprehensive understanding of data management and analysis.
These concepts are crucial for making data-driven decisions. This book transitions readers from basic data handling to advanced model training, blending theoretical knowledge with practical skills. Companion files with source code and data sets enhance the learning experience, making this book an invaluable resource for mastering data science with Python.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 242

Veröffentlichungsjahr: 2024

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



LICENSE, DISCLAIMER OF LIABILITY, AND LIMITED WARRANTY

By purchasing or using this book and companion files (the “Work”), you agree that this license grants permission to use the contents contained herein, including the disc, but does not give you the right of ownership to any of the textual content in the book / disc or ownership to any of the information or products contained in it. This license does not permit uploading of the Work onto the Internet or on a network (of any kind) without the written consent of the Publisher. Duplication or dissemination of any text, code, simulations, images, etc. contained herein is limited to and subject to licensing terms for the respective products, and permission must be obtained from the Publisher or the owner of the content, etc., in order to reproduce or network any portion of the textual material (in any media) that is contained in the Work.

MERCURY LEARNING AND INFORMATION (“MLI” or “the Publisher”) and anyone involved in the creation, writing, or production of the companion disc, accompanying algorithms, code, or computer programs (“the software”), and any accompanying Web site or software of the Work, cannot and do not warrant the performance or results that might be obtained by using the contents of the Work. The author, developers, and the Publisher have used their best efforts to ensure the accuracy and functionality of the textual material and/or programs contained in this package; we, however, make no warranty of any kind, express or implied, regarding the performance of these contents or programs. The Work is sold “as is” without warranty (except for defective materials used in manufacturing the book or due to faulty workmanship).

The author, developers, and the publisher of any accompanying content, and anyone involved in the composition, production, and manufacturing of this work will not be liable for damages of any kind arising out of the use of (or the inability to use) the algorithms, source code, computer programs, or textual material contained in this publication. This includes, but is not limited to, loss of revenue or profit, or other incidental, physical, or consequential damages arising out of the use of this Work.

The sole remedy in the event of a claim of any kind is expressly limited to replacement of the book and/or disc, and only at the discretion of the Publisher. The use of “implied warranty” and certain “exclusions” vary from state to state, and might not apply to the purchaser of this product.

Companion files for this title are available by writing to the publisher at [email protected].

Copyright ©2024 by MERCURY LEARNING AND INFORMATION. An Imprint of DeGruyter, Inc. All rights reserved.

This publication, portions of it, or any accompanying software may not be reproduced in any way, stored in a retrieval system of any type, or transmitted by any means, media, electronic display or mechanical display, including, but not limited to, photocopy, recording, Internet postings, or scanning, without prior permission in writing from the publisher.

Publisher: David Pallai

MERCURY LEARNING AND INFORMATION

121 High Street, 3rd Floor

Boston, MA 02110

[email protected]

www.merclearning.com

800-232-0223

O. Campesato. Data Literacy with Python.

ISBN 978-1-50152-199-7

The publisher recognizes and respects all marks used by companies, manufacturers, and developers as a means to distinguish their products. All brand names and product names mentioned in this book are trademarks or service marks of their respective companies. Any omission or misuse (of any kind) of service marks or trademarks, etc. is not an attempt to infringe on the property of others.

Library of Congress Control Number: 2023945518

232425321 This book is printed on acid-free paper in the United States of America.

Our titles are available for adoption, license, or bulk purchase by institutions, corporations, etc. For additional information, please contact the Customer Service Dept. at 800-232-0223(toll free).

All of our titles are available in digital format at academiccourseware.com and other digital vendors. Companion files (code listings) for this title are available by contacting [email protected]. The sole obligation of MERCURY LEARNING AND INFORMATION to the purchaser is to replace the disc, based on defective materials or faulty workmanship, but not based on the operation or functionality of the product.

I’d like to dedicate this book to my parents

– may this bring joy and happiness into their lives.

CONTENTS

Preface

Chapter 1: Working With Data

What Is Data Literacy?

Exploratory Data Analysis (EDA)

Where Do We Find Data?

Dealing With Data: What Can Go Wrong?

Explanation of Data Types

Working With Data Types

What Is Drift?

Discrete Data Versus Continuous Data

“Binning” Data Values

Correlation

Working With Synthetic Data

Summary

References

Chapter 2: Outlier and Anomaly Detection

Working With Outliers

Finding Outliers With NumPy

Finding Outliers With Pandas

Fraud Detection

Techniques for Anomaly Detection

Working With Imbalanced Datasets

Summary

Chapter 3: Cleaning Datasets

Analyzing Missing Data

Pandas, CSV Files, and Missing Data

Missing Data and Imputation

Data Normalization

Handling Categorical Data

Data Wrangling

Summary

Chapter 4: Introduction to Statistics

Basic Concepts in Statistics

Random Variables

Multiple Random Variables

Basic Concepts in Statistics

The Variance and Standard Deviation

Sampling Techniques for a Population

The Confusion Matrix

Calculating Expected Values

Summary

References

Chapter 5: Matplotlib and Seaborn

What Is Data Visualization?

What Is Matplotlib?

Matplotlib Styles

Display Attribute Values

Color Values in Matplotlib

Cubed Numbers in Matplotlib

Horizontal Lines in Matplotlib

Slanted Lines in Matplotlib

Parallel Slanted Lines in Matplotlib

Lines and Labeled Vertices in Matplotlib

A Dotted Grid in Matplotlib

Lines in a Grid in Matplotlib

Two Lines and a Legend in Matplotlib

Loading Images in Matplotlib

A Set of Line Segments in Matplotlib

Plotting Multiple Lines in Matplotlib

A Histogram in Matplotlib

Plot Bar Charts

Plot a Pie Chart

Heat Maps

Save Plot as a PNG File

Working With SweetViz

Working With Skimpy

Working With Seaborn

Seaborn Dataset Names

Seaborn Built-In Datasets

The Iris Dataset in Seaborn

The Titanic Dataset in Seaborn

Extracting Data From Titanic Dataset in Seaborn

Visualizing a Pandas DataFrame in Seaborn

Seaborn Heat Maps

Seaborn Pair Plots

Summary

Appendix A: Introduction to Python

Tools for Python

Python Installation

Setting the PATH Environment Variable (Windows Only)

Launching Python on Your Machine

Python Identifiers

Lines, Indentation, and Multilines

Quotation and Comments in Python

Saving Your Code in a Module

Some Standard Modules in Python

The help() and dir() Functions

Compile Time and Runtime Code Checking

Simple Data Types in Python

Working With Numbers

Working With Fractions

Unicode and UTF-8

Working With Unicode

Working With Strings

Uninitialized Variables and the Value None in Python

Slicing and Splicing Strings

Search and Replace a String in Other Strings

Remove Leading and Trailing Characters

Printing Text Without NewLine Characters

Text Alignment

Working With Dates

Exception Handling in Python

Handling User Input

Python and Emojis (Optional)

Command-Line Arguments

Summary

Appendix B: Introduction to Pandas

What Is Pandas?

A Pandas DataFrame With a NumPy Example

Describing a Pandas DataFrame

Pandas Boolean DataFrames

Pandas DataFrames and Random Numbers

Reading CSV Files in Pandas

The loc() and iloc() Methods in Pandas

Converting Categorical Data to Numeric Data

Matching and Splitting Strings in Pandas

Converting Strings to Dates in Pandas

Working With Date Ranges in Pandas

Detecting Missing Dates in Pandas

Interpolating Missing Dates in Pandas

Other Operations With Dates in Pandas

Merging and Splitting Columns in Pandas

Reading HTML Web Pages in Pandas

Saving a Pandas DataFrame as an HTML Web Page

Summary

Index

PREFACE

The purpose of this book is to usher readers into the world of data, ensuring a comprehensive understanding of its nuances, intricacies, and complexities. With Python 3 as the primary medium, the book underscores the pivotal role of data in modern industries, and how its adept management can lead to insightful decision-making.

THE CORE PROPOSITION

At its heart, the book provides a swift introduction to foundational data-related tasks, priming the readers for more advanced concepts of model training introduced later on. Through detailed, step-by-step Python code examples, the readers will traverse the journey of training models, beginning with the kNN algorithm, and then smoothly transitioning to other classifiers, effortlessly, by tweaking mere lines of code.

FROM BASICS TO VISUALIZATION

The narrative commences with a dive into datasets and potential issues, gradually segueing into more intricate topics like anomaly detection and data cleaning. As one progresses, the guide unfolds the intricacies of classification algorithms, followed by a deep dive into data visualization. Here, tools like Sweetviz, Skimpy, Matplotlib, and Seaborn are introduced, offering readers a hands-on experience in rendering charts and graphs.

TECHNICAL PREREQUISITES

To derive the maximum value from this book, a foundational grasp of Python 3.x is requisite. While some sections might necessitate a preliminary understanding of the ‘awk’ utility, the majority of the content is dedicated to Python’s prowess. Familiarity with Pandas, especially its data frames, will further enhance the reader’s journey.

CODE VARIETIES

Appreciating the diversity in learning styles, the book encapsulates a blend of short, detailed, and progressive code samples. This variety ensures that whether one is a hands-on coder, who jumps straight into execution, or a contemplative reader, who ponders over logic, there's something for everyone.

GLOBAL AUDIENCE, GLOBAL LANGUAGE

Designed for individuals beginning their foray into machine learning, the language caters to a global audience. By intentionally steering clear of colloquialisms, and adopting a standard English approach, it ensures content clarity for readers, irrespective of their linguistic backgrounds.

THE ESSENCE OF THE CODE

While the enclosed code samples are comprehensive, their essence lies in their clarity. They are meticulously designed to elucidate the underlying concepts rather than emphasize efficiency or brevity. However, readers are encouraged to optimize, experiment, and improvise, making the code their own.

Companion files with source code and data sets are available by writing to the publisher at [email protected].

BEYOND THE CODE

While “Data Literacy with Python” is predominantly a technical guide, it champions the idea that the most potent tool is a curious mind. A genuine intrigue for data, complemented by the determination to decipher code samples, is what will make this journey truly transformative.

O. CampesatoOctober 2023

CHAPTER1

WORKING WITH DATA

This chapter shows you how to analyze data types that you will encounter in datasets, such as currency and dates, as well as scaling data values in order to ensure that a dataset has “clean” data.

The first part of this chapter briefly discusses some aspects of EDA (exploratory data analysis), such as data quality, data-centric AI versus model-centric AI, as well as some of the steps involved in data cleaning and data wrangling. You will also see an EDA code sample involving the Titanic dataset.

The second part of this chapter describes common types of data, such as binary, nominal, ordinal, and categorical data. In addition, you will learn about continuous versus discrete data, quantitative and quantitative data, and types of statistical data.

The third second introduces the notion of data drift and data leakage, followed by model selection. This section also describes how to process categorical data, and how to map categorical data to numeric data.

Keep in mind that the code samples in this chapter utilize NumPy and Pandas, both of which are discussed in a corresponding appendix.

WHAT IS DATA LITERACY?

There are various definitions of data literacy that involve concepts such as data, meaningful information, decision-making, drawing conclusions, chart reading, and so forth. According to Wikipedia, which we’ll use as a starting point, data literacy is defined as follows:

Data literacy is the ability to read, understand, create, and communicate data as information. Much like literacy as a general concept, data literacy focuses on the competencies involved in working with data. It is, however, not similar to the ability to read text since it requires certain skills involving reading and understanding data. (Wikipedia, 2023)

Data literacy encompasses many topics, starting with analyzing data that is often in the form of a CSV (comma-separated values) file. The quality of the data in a dataset is of paramount importance: high data quality enables you to make more reliable inferences regarding the nature of the data. Indeed, high data quality is a requirement for fields such as machine learning, scientific experiments, and so forth. However, keep in mind that you might face various challenges regarding robust data, such as:

• a limited amount of available data

• costly acquisition of relevant data

• difficulty in generating valid synthetic data

• availability of domain experts

Depending on the domain, the cost of data cleaning can involve months of work at a cost of millions of dollars. For instance, identifying images of cats and dogs is essentially trivial, whereas identifying potential tumors in x-rays is much more costly and requires highly skilled individuals.

With all the preceding points in mind, let’s take a look at EDA (exploratory data analysis), which is the topic of the next section.

EXPLORATORY DATA ANALYSIS (EDA)

According to Wikipedia, EDA involves analyzing datasets to summarize their main characteristics, often with visual methods. EDA also involves searching through data to detect patterns (if there are any) and anomalies, and in some cases, test hypotheses regarding the distribution of the data.

EDA represents the initial phase of data analysis, whereby data is explored in order to determine its primary characteristics. Moreover, this phase involves detecting patterns (if any), and any outstanding issues pertaining to the data. The purpose of EDA is to obtain an understanding of the semantics of the data without performing a deep assessment of the nature of the data. The analysis is often performed through data visualization in order to produce a summary of their most important characteristics. The four types of EDA are listed here:

• univariate nongraphical

• multivariate nongraphical

• univariate graphical

• multivariate graphical

In brief, the two primary methods for data analysis are qualitative data analysis techniques and quantitative data analysis techniques.

As an example of exploratory data analysis, consider the plethora of cell phones that customers can purchase for various needs (work, home, minors, and so forth). Visualize the data in an associated dataset to determine the top ten (or top three) most popular cell phones, which can potentially be performed by state (or province) and country.

An example of quantitative data analysis involves measuring (quantifying) data, which can be gathered from physical devices, surveys, or activities such as downloading applications from a Web page.

Common visualization techniques used in EDA include histograms, line graphs, bar charts, box plots, and multivariate charts.

What Is Data Quality?

According to Wikipedia, data quality refers to “the state of qualitative or quantitative pieces of information” (Wikipedia, 2022). Furthermore, high data quality refers to data whose quality meets the various needs of an organization. In particular, performing data cleaning tasks are the type of tasks that assist in achieving high data quality.

When companies label their data, they obviously strive for a high quality of labeled data, and yet the quality can be adversely affected in various ways, some of which are as follows:

• inaccurate methodology for labeling data

• insufficient data accuracy

• insufficient attention to data management

The cumulative effect of the preceding (and other) types of errors can be significant, to the extent that models underperform in a production environment. In addition to the technical aspects, underperforming models can have an adverse effect on business revenue.

Related to data quality is data quality assurance, which typically involves data cleaning tasks that are discussed later in this chapter, after which data is analyzed to detect potential inconsistencies in the data, and then determine how to resolve those inconsistencies. Another aspect to consider: the aggregation of additional data sources, especially involving heterogenous sources of data, can introduce challenges with respect to ensuring data quality. Other concepts related to data quality include data stewardship and data governance, both of which are discussed in multiple online articles.

Data-Centric AI or Model-Centric AI?

A model-centric approach focuses primarily on enhancing the performance of a given model, and data considered secondary in importance. In fact, during the past ten years or so, the emphasis of AI has been a model-centric approach. Note that during this time span some very powerful models and architectures have been developed, such as the CNN model for image classification in 2012 and the enormous impact (especially in NLP) of models based on the transformer architecture that was developed in 2017.

By contrast, a data-centric approach concentrates on improving data, which relies on several factors, such as the quality of labels for the data as well as obtaining accurate data for training a model.

Given the importance of high-quality data with respect to training a model, it stands to reason that using a data-centric approach instead of a model-centric approach can result in higher quality models in AI. While data quality and model effectiveness are both important, keep in mind that the data-centric approach is becoming increasingly more strategic in the machine learning world. More information can be found on the AI Multiple site: https://research.aimultiple.com/data-centric-ai/

The Data Cleaning and Data Wrangling Steps

The next step often involves data cleaning in order to find and correct errors in the dataset, such as missing data, duplicate data, or invalid data. This task also involves data consistency, which pertains to updating different representations of the same value in a consistent manner. As a simple example, suppose that a Web page contains a form with an input field whose valid input is either Y or N, but users are able to enter Yes, Ys, or ys as text input. Obviously, these values correspond to the value Y, and they must all be converted to the same value in order to achieve data consistency.

Finally, data wrangling can be performed after the data cleaning task is completed. Although interpretations of data wrangling do vary, in this book the term refers to transforming datasets into different formats as well as combining two or more datasets. Hence, data wrangling does not examine the individual data values to determine whether or not they are valid: this step is performed during data cleaning.

Keep in mind that sometimes it’s worthwhile to perform another data cleaning step after the data wrangling step. For example, suppose that two CSV files contain employee-related data, and you merge these CSV files into a third CSV file. The newly created CSV file might contain duplicate values: it’s certainly possible to have two people with the same name (such as John Smith), which obviously needs to be resolved.

ELT and ETL

ELT is an acronym for extract, load, and transform, which is a pipeline-based approach for managing data. Another pipeline-based approach is called ETL (extract, transform, load), which is actually more popular than ELT. However, ELT has the following advantages over ETL:

•ELT requires less computational time.

•ELT is well-suited for processing large datasets.

•ELT is more cost effective than ETL.

ELT involves (1) data extraction from one or more sources, (2) transforming the raw data into a suitable format, and (3) loading the result into a data warehouse. The data in the warehouse becomes available for additional analysis.

WHERE DO WE FIND DATA?

Data resides in many locations, with different formats, languages, and currencies. An important task involves finding the sources of relevant data and then aggregating that data in a meaningful fashion. Some examples of sources of data are as follows:

• CSV/TSV files

• RBDMS tables

• NoSQL tables

• Web Services

The following subsections briefly describe some of the details that are involved with each of the items in the preceding bullet list.

Working With CSV Files

A CSV file (comma-separated values) or TSV file (tab-separated values) is a common source of data, and other delimiters (semi-colons, “#” symbols, and so forth) can also appear in a text file with data. Moreover, you might need to combine multiple CSV files into a single file that contains the data to perform an accurate analysis.

As a simple example, the following snippet displays a portion of the titanic.csv dataset:

survived,pclass,sex,age,sibsp,parch,fare,embarked, class,who,adult_male,deck,embark_town,alive,alone 0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton, no,False 1,1,female,38.0,1,0,71.2833,C,First,woman,False,C, Cherbourg,yes,False 1,3,female,26.0,0,0,7.925,S,Third,woman,False,, Southampton,yes,True

As you can see, there are many columns (also called “features”) in the preceding set of data. When you perform machine learning, you need to determine which of those columns provide meaningful data. Notice the survived attribute: this is known as the target feature, which contains the values that you are trying to predict correctly. The prediction of who survives is based on identifying the columns (features) that are relevant in making such a prediction.

For example, the sex, age, and class features are most likely relevant for determining whether or not a passenger survived the fate of the Titanic. How do you know if you have selected all the relevant features, and only the relevant features?

There are two main techniques for doing so. In some datasets it’s possible to visually inspect the features of a dataset in order to determine the most important features. Loosely speaking, when you “eyeball” the data to determine the set of relevant features, that’s called feature selection. This approach can be viable when there is a relatively small number of features in the dataset (i.e., ten or fewer features).

On the other hand, it’s very difficult to visually determine the relevant features in a dataset that contains 5,000 columns. Fortunately, you can use an algorithm such as PCA (Principal Component Analysis) to determine which features are significant. The use of such an algorithm (and there are others as well) is called feature extraction.

Moreover, it’s important to enlist the aid of a so-called domain expert (which might be you) who can assist in determining the most important features of a dataset, and also determine if there are any missing features that are important in the selection of features.

Working With RDBMS Data

An RDBMS (relational database management system) stores data in a structured manner by utilizing database tables whose structure is defined by you. For example, suppose that you have an online store that sells products, and you want to keep track of customers, purchase orders, and inventory.

One approach involves defining a customer’s table, which has the following (simplified) type of structure:

CREATE TABLE customers ( cust_id INTEGER, first_name VARCHAR(20), last_name VARCHAR(20), home_address VARCHAR(20), city VARCHAR(20), state VARCHAR(20), zip_code VARCHAR(10));

Next, you can use SQL (structured query language) statements in order to insert data into the customers table, as shown here:

INSERT INTO customers VALUES (1000,'John','Smith','123 Main St', 'Fremont','CA','94123'); INSERT INTO customers VALUES (2000,'Jane','Jones','456 Front St', 'Fremont','CA','95015');

In a real application you obviously need real data, which you can gather from a Web registration page that enables users to register for your Web application (we’ll skip those details).

If you use an RDBMS such as MySQL, you can define a database and database tables, such as the customers table described previously. The following SQL statement displays the structure of the customers table that was defined previously:

mysql> DESCRIBE customers; +---------+----------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +---------+----------+------+-----+---------+-------+ | cust_id | int | YES | | NULL | | | name | char(30) | YES | | NULL | | | address | char(30) | YES | | NULL | | | email | char(30) | YES | | NULL | | +---------+----------+------+-----+---------+-------+ 4 rows in set (0.03 sec)

After manually inserting data with a SQL INSERT statement, you can also select the data from the customers table via a SQL SELECT statement, as shown here (the simulated data is different from the previous data):

In simplified terms, an RDBMS involves the following tasks:

• Define the relevant tables

• Insert meaningful data into the tables

• Select useful data from the tables

One way to insert data involves programmatically loading data from CSV files into the database tables. An RDBMS provides many useful features, which includes exporting the data from all the tables in a database, and the export file can be a single SQL file that contains all the SQL statements that are required for creating the relevant tables and inserting existing data (i.e., data that you already inserted) into those tables.

You also need a purchase orders table to keep track of which customers have made purchases from your store. An example of the structure of a purchase orders table is shown here:

mysql> DESCRIBE purch_orders; +-----------+------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-----------+------+------+-----+---------+-------+ | cust_id | int | YES | | NULL | | | purch_id | int | YES | | NULL | | | line_item | int | YES | | NULL | | +-----------+------+------+-----+---------+-------+ 3 rows in set (0.01 sec)

Notice that each row in the purch_orders table contains a cust_id and a purch_id column: that’s because a purchase order is associated with a customer, and a customer can place one or more purchase orders. In database parlance, the customers table has a one-to-many relationship with the purchase orders table, and every row in the latter table must have an associated row in the customers table (and those that do not are called “orphans”).

In fact, there is also a one-to-many relationship between the purchase orders table and the item_desc table, where the latter contains information about each product that was purchased in a given purchase order. Note that each row in a purchase order is called a line item.

Working With NoSQL Data

A NoSQL database is useful when the data that you manage does not have a fixed structure. Examples of popular NoSQL databases are MongoDB and Cassandra.

Instead of defining a fixed structure for tables, you can populate a NoSQL database dynamically with documents, where documents belong to a collection instead of a table. Obviously, documents can have different lengths and contain different text, which can be conveniently stored and accessed in a collection in a NoSQL database.

DEALING WITH DATA: WHAT CAN GO WRONG?

In a perfect world, all datasets are in pristine condition, with no extreme values, no missing values, and no erroneous values. Every feature value is captured correctly, with no chance for any confusion. Moreover, no conversion is required between date formats, currency values, or languages because of the one universal standard that defines the correct formats and acceptable values for every possible set of data values.

However, you cannot rely on the scenarios in the previous paragraph, which is the reason for the techniques that are discussed in this chapter. Even after you manage to create a wonderfully clean and robust dataset, other issues can arise, such as data drift that is described in the next section.

In fact, the task of cleaning data is not necessarily complete even after a machine learning model is deployed to a production environment. For instance, an online system that gathers terabytes or petabytes of data on a daily basis can contain skewed values that in turn adversely affect the performance of the model. Such adverse effects can be revealed through the changes in the metrics that are associated with the production model.

Datasets

In simple terms, a dataset is a source of data (such as a text file) that ­contains rows and columns of data. Each row is typically called a “data point,” and each column is called a “feature”. A dataset can be a CSV (comma separated values), TSV (tab separated values), Excel spreadsheet, a table in an RDMBS, a document in a NoSQL database, the output from a Web service, and so forth.

Note that a static dataset consists of fixed data. For example, a CSV file that contains the states of the United States is a static dataset. A slightly different example involves a product table that contains information about the products that customers can buy from a company. Such a table is static if no new products are added to the table. Discontinued products are probably maintained as historical data that can appear in product-related reports.

By contrast, a dynamic dataset consists of data that changes over a period of time. Simple examples include housing prices, stock prices, and time-based data from IoT devices.

A dataset can vary from very small (perhaps a few features and 100 rows) to very large (more than 1,000 features and more than one million rows). If you are unfamiliar with the problem domain for a particular dataset, then you might struggle to determine its most important features. In this situation, you consult a “domain expert” who understands the importance of the features, their interdependencies (if any), and whether or not the data values for the features are valid. In addition, there are algorithms (called dimensionality reduction algorithms) that can help you determine the most important features, such as PCA (Principal Component Analysis).

Before delving into topics such as data preprocessing, data types, and so forth, let’s take a brief detour to introduce the concept of feature importance, which is the topic of the next section.

As you will see, someone needs to analyze the dataset to determine which features are the most important and which features can be safely ignored in order to train a model with the given dataset. A dataset can contain various data types, such as:

• audio data

• image data

• numeric data

• text-based data

• video data

• combinations of the above

In this book, we’ll only consider datasets that contain columns with numeric or text-based data types, which can be further classified as follows:

• nominal (string-based or numeric)

• ordinal (ordered values)

• categorical (enumeration)

• interval (positive/negative values)

• ratio (nonnegative values)

The next section contains brief descriptions of the data types that are in the preceding bullet list.

EXPLANATION OF DATA TYPES

This section contains subsections that provide brief descriptions about the following data types:

• binary data

• nominal data

• ordinal data

• categorical data

• interval data

• ratio data

Later you will learn about the difference between continuous data versus discrete data, as well as the difference between qualitative data versus quantitative data. In addition, the Pandas