29,99 €
This book is for developers seeking an overview of basic concepts in Natural Language Processing (NLP). It caters to those with varied technical backgrounds, offering numerous code samples and listings to illustrate the wide range of topics covered. The journey begins with managing data relevant to NLP, followed by two chapters on fundamental NLP concepts. This foundation is reinforced with Python code samples that bring these concepts to life.
The book then delves into practical NLP applications, such as sentiment analysis, recommender systems, COVID-19 analysis, spam detection, and chatbots. These examples provide real-world context and demonstrate how NLP techniques can be applied to solve common problems. The final chapter introduces advanced topics, including the Transformer architecture, BERT-based models, and the GPT family, highlighting the latest state-of-the-art developments in the field.
Appendices offer additional resources, including Python code samples on regular expressions and probability/statistical concepts, ensuring a well-rounded understanding. Companion files with source code and figures enhance the learning experience, making this book a comprehensive guide for mastering NLP techniques and applications.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 468
Veröffentlichungsjahr: 2024
NATURAL LANGUAGE PROCESSINGFUNDAMENTALSFOR DEVELOPERS
LICENSE, DISCLAIMER OF LIABILITY, AND LIMITED WARRANTY
By purchasing or using this book and its companion files (the “Work”), you agree that this license grants permission to use the contents contained herein, but does not give you the right of ownership to any of the textual content in the book or ownership to any of the information, files, or products contained in it. This license does not permit uploading of the Work onto the Internet or on a network (of any kind) without the written consent of the Publisher. Duplication or dissemination of any text, code, simulations, images, etc. contained herein is limited to and subject to licensing terms for the respective products, and permission must be obtained from the Publisher or the owner of the content, etc., in order to reproduce or network any portion of the textual material (in any media) that is contained in the Work.
MERCURY LEARNING AND INFORMATION (“MLI” or “the Publisher”) and anyone involved in the creation, writing, production, accompanying algorithms, code, or computer programs (“the software”), and any accompanying Web site or software of the Work, cannot and do not warrant the performance or results that might be obtained by using the contents of the Work. The author, developers, and the Publisher have used their best efforts to insure the accuracy and functionality of the textual material and/or programs contained in this package; we, however, make no warranty of any kind, express or implied, regarding the performance of these contents or programs. The Work is sold “as is” without warranty (except for defective materials used in manufacturing the book or due to faulty workmanship).
The author, developers, and the publisher of any accompanying content, and anyone involved in the composition, production, and manufacturing of this work will not be liable for damages of any kind arising out of the use of (or the inability to use) the algorithms, source code, computer programs, or textual material contained in this publication. This includes, but is not limited to, loss of revenue or profit, or other incidental, physical, or consequential damages arising out of the use of this Work.
The sole remedy in the event of a claim of any kind is expressly limited to replacement of the book and only at the discretion of the Publisher. The use of “implied warranty” and certain “exclusions” vary from state to state, and might not apply to the purchaser of this product.
Companion files also available for downloading from the publisher by writing to [email protected].
NATURAL LANGUAGE PROCESSINGFUNDAMENTALSFOR DEVELOPERS
OSWALD CAMPESATO
MERCURYLEARNINGANDINFORMATION
Dulles, Virginia
Boston, Massachusetts
New Delhi
Copyright ©2021 by MERCURY LEARNING AND INFORMATION LLC. All rights reserved.
This publication, portions of it, or any accompanying software may not be reproduced in any way, stored in a retrieval system of any type, or transmitted by any means, media, electronic display or mechanical display, including, but not limited to, photocopy, recording, Internet postings, or scanning, without prior permission in writing from the publisher.
Publisher: David PallaiMERCURY LEARNING AND INFORMATION22841 Quicksilver DriveDulles, VA [email protected]
O. Campesato. Natural Language Processing Fundamentals for Developers.ISBN: 978-1-68392-657-3
The publisher recognizes and respects all marks used by companies, manufacturers, and developers as a means to distinguish their products. All brand names and product names mentioned in this book are trademarks or service marks of their respective companies. Any omission or misuse (of any kind) of service marks or trademarks, etc. is not an attempt to infringe on the property of others.
Library of Congress Control Number: 2021939603
212223321 Printed on acid-free paper in the United States of America.
Our titles are available for adoption, license, or bulk purchase by institutions, corporations, etc. For additional information, please contact the Customer Service Dept. at 800-232-0223(toll free).
All of our titles are available in digital format at academiccourseware.com and other digital vendors. Companion files for this title are available by writing to the publisher at [email protected]. The sole obligation of MERCURY LEARNING AND INFORMATION to the purchaser is to replace the book, based on defective materials or faulty workmanship, but not based on the operation or functionality of the product.
I’d like to dedicate this book to my parents – may this bring joy and happiness into their lives.
CONTENTS
Preface
Chapter 1: Working with Data
What are Datasets?
Data Types
Preparing Datasets
Missing Data, Anomalies, and Outliers
What is Imbalanced Classification?
What is SMOTE?
Analyzing Classifiers (Optional)
The Bias-Variance Trade-Off
Summary
Chapter 2: NLP Concepts (I)
The Origin of Languages
The Complexity of Natural Languages
Japanese Grammar
Phonetic Languages
Multiple Ways to Pronounce Consonants
English Pronouns and Prepositions
What is NLP?
A Wide-Angle View of NLP
Information Extraction and Retrieval
Word Sense Disambiguation
NLP Techniques in ML
Text Normalization and Tokenization
Handling Stop Words
What is Stemming?
What is Lemmatization?
Working with Text: POS
Working with Text: NER
What is Topic Modeling?
Keyword Extraction, Sentiment Analysis, and Text Summarization
Summary
Chapter 3: NLP Concepts (II)
What is Word Relevance?
What is Text Similarity?
Sentence Similarity
Working with Documents
Techniques for Text Similarity
What is Text Encoding?
Text Encoding Techniques
The BoW Algorithm
What are N-Grams?
Calculating tf, idf, and tf-idf
The Context of Words in a Document
What is Cosine Similarity?
Text Vectorization (A.K.A. Word Embeddings)
Overview of Word Embeddings and Algorithms
What is Word2vec?
The CBoW Architecture
What are Skip-grams?
What is GloVe?
Working with GloVe
What is FastText?
Comparison of Word Embeddings
What is Topic Modeling?
Language Models and NLP
Vector Space Models
NLP and Text Mining
Relation Extraction and Information Extraction
What is a BLEU Score?
Summary
Chapter 4: Algorithms and Toolkits (I)
What is NLTK?
NLTK and BoW
NLTK and Stemmers
NLTK and Lemmatization
NLTK and Stop Words
What Is Wordnet?
NLTK, lxml, and XPath
NLTK and N-Grams
NLTK and POS (I)
NLTK and POS (2)
NLTK and Tokenizers
NLTK and Context-Free Grammars (Optional)
What is Gensim?
An Example of Topic Modeling
A Brief Comparison of Popular Python-Based NLP Libraries
Miscellaneous Libraries
Summary
Chapter 5: Algorithms and Toolkits (II)
Cleaning Data with Regular Expressions
Handling Contracted Words
Python Code Samples of BoW
One-Hot Encoding Examples
Sklearn and Word Embedding Examples
What is BeautifulSoup?
Web Scraping with Pure Regular Expressions
What is Scrapy?
What is SpaCy?
SpaCy and Stop Words
SpaCy and Tokenization
SpaCy and Lemmatization
SpaCy and NER
SpaCy Pipelines
SpaCy and Word Vectors
The ScispaCy Library (Optional)
Summary
Chapter 6: NLP Applications
What is Text Summarization?
Text Summarization with Gensim and SpaCy
What are Recommender Systems?
Content-Based Recommendation Systems
Collaborative Filtering Algorithm
Recommender Systems and Reinforcement Learning (Optional)
What is Sentiment Analysis?
Sentiment Analysis with Naïve Bayes
Sentiment Analysis with VADER and NLTK
Sentiment Analysis with Textblob
Sentiment Analysis with Flair
Detecting Spam
Logistic Regression and Sentiment Analysis
Working with COVID-19
What are Chatbots?
Summary
Chapter 7: Transformer, BERT, and GPT
What is Attention?
An Overview of the Transformer Architecture
What is T5?
What is BERT?
The Inner Workings of BERT
Subword Tokenization
Sentence Similarity in BERT
Generating BERT Tokens (1)
Generating BERT Tokens (2)
The BERT Family
Introduction to GPT
Working with GPT-2
What is GPT-3?
The Switch Transformer: One Trillion Parameters
Looking Ahead
Summary
Appendix A: Introduction to Regular Expressions
Appendix B: Introduction to Probability and Statistics
Index
PREFACE
WHAT IS THE PRIMARY VALUE PROPOSITION FOR THIS BOOK?
This book contains a fast-paced introduction to as much relevant information about NLP as possible that can be reasonably included in a book of this size. Some chapters contain topics that are discussed in great detail (such as the first half of Chapter 2), and other chapters contain advanced statistical concepts that you can safely omit during your first pass through this book. This book casts a wide net to help developers who have a wide range of technical backgrounds, which is the rationale for the inclusion of a plethora of topics. Regardless of your background, please keep in mind the following point: you will probably need to read some of the content in this book multiple times.
However, you will be exposed to many NLP topics, and many topics are presented in a cursory manner for two reasons. First, it’s important that you be exposed to these concepts. In some cases, you will find topics that might pique your interest, and hence motivate you to learn more about them through self-study; in other cases, you will probably be satisfied with a brief introduction. Hence, you will decide whether or not to delve into more detail regarding the topics in this book.
Second, a full treatment of all the topics that are covered in this book would probably quadruple the size of this book, and few people are interested in reading 1,000-page technical books. Hence, this book provides a broad view of the NLP landscape, based on the belief that this approach will be more beneficial for readers who are experienced developers, who want to learn about NLP.
However, it’s important for you to decide if this approach is suitable for your needs and learning style: if not, you can select one or more of the plethora of NLP books that are available.
THE TARGET AUDIENCE
This book is intended primarily for people who have a solid background as software developers. Specifically, it is for developers who are accustomed to searching online for more detailed information about technical topics. If you are a beginner, there are other books that are more suitable for you, and you can find them by performing an online search.
This book is also intended to reach an international audience of readers with highly diverse backgrounds in various age groups. While many readers know how to read English, their native spoken language is not English. Consequently, this book uses standard English rather than colloquial expressions that might be confusing to those readers. As you know, many people learn by different types of imitation, which includes reading, writing, or hearing new material. This book takes these points into consideration in order to provide a comfortable and meaningful learning experience for the intended readers.
WHY SUCH A MASSIVE NUMBER OF TOPICS IN THIS BOOK?
As mentioned in the response to the previous question, this book is intended for developers who want to learn NLP concepts. Because this encompasses people with vastly different technical backgrounds, there are readers who “don’t know what they don’t know” regarding NLP. Therefore, this book exposes people to a plethora of NLP-related concepts, after which they can decide those topics to select for greater study. Consequently, the book does not have a “zero-to-hero” approach, nor is it necessary to master all the topics that are discussed in the chapters and the appendices; rather, they are a go-to source of information to help you decide where you want to invest your time and effort.
As you might already know, learning often takes place through an iterative and repetitive approach whereby the cumulative exposure leads to a greater level of comfort and understanding of technical concepts. For some readers, this will be the first step in their journey toward mastering NLP.
HOW IS THE BOOK ORGANIZED AND WHAT WILL I LEARN?
The first chapter shows you various details of managing data that are relevant for NLP. The next pair of chapters contain NLP concepts, followed by another pair of chapters that contain Python code samples which illustrate the NLP concepts.
Chapter 6 explores sentiment analysis, recommender systems, COVID-19 analysis, spam detection, and a short discussion regarding chatbots. The final chapter presents the Transformer architecture, BERT-based models, and the GPT family of models, all of which have been developed during the past three years and to varying degrees they are considered SOTA (“state of the art”).
The appendices contain introductory material (including Python code samples) for various topics, including Regular Expressions and statistical concepts.
WHY ARE THE CODE SAMPLES PRIMARILY IN PYTHON?
Most of the code samples are short (usually less than one page and sometimes less than half a page), and if need be, you can easily and quickly copy/paste the code into a new Jupyter notebook.
If you do decide to use Google Colaboratory, you can easily copy/paste the Python code into a notebook, and also use the upload feature to upload existing Jupyter notebooks. Keep in mind the following point: if the Python code references a CSV file, make sure that you include the appropriate code snippet (as explained in Chapter 1) to access the CSV file in the corresponding Jupyter notebook in Google Colaboratory.
HOW WERE THE CODE SAMPLES CREATED?
The code samples in this book were created and tested using Python 3 on a MacBook Pro with OS X 10.15.15 (macOS Catalina). Regarding their content: the code samples are derived primarily from the author for his Natural Language Processing graduate course. In some cases, there are code samples that incorporate short sections of code from discussions in online forums. The key point to remember is the code samples follow the “Four Cs”: they must be Clear, Concise, Complete, and Correct to the extent that it’s possible to do so, given the size of this book.
GETTING THE MOST FROM THIS BOOK
Some programmers learn well from prose, others learn well from sample code (and lots of it), which means that there’s no single style that can be used for everyone.
Moreover, some programmers want to run the code first, see what it does, and then return to the code to delve into the details (and others use the opposite approach).
Consequently, there are various types of code samples in this book: some are short, some are long, and other code samples “build” from earlier code samples.
WHAT DO I NEED TO KNOW FOR THIS BOOK?
Current knowledge of Python 3.x is the most helpful skill. Knowledge of other programming languages (such as Java) can also be helpful because of the exposure to programming concepts and constructs. The less technical knowledge that you have, the more diligence will be required in order to understand the various topics that are covered.
If you want to be sure that you can grasp the material in this book, glance through some of the code samples to get an idea of how much is familiar to you and how much is new for you.
DOES THIS BOOK CONTAIN PRODUCTION-LEVEL CODE SAMPLES?
The primary purpose of the code samples in this book is to show you Python-based libraries for solving a variety of NLP-related tasks. Clarity has higher priority than writing more compact code that is more difficult to understand (and possibly more prone to bugs). If you decide to use any of the code in this book in a production Website, you ought to subject that code to the same rigorous analysis as the other parts of your code base.
WHAT ARE THE NON-TECHNICAL PREREQUISITES FOR THIS BOOK?
Although the answer to this question is more difficult to quantify, it’s important to have a strong desire to learn about NLP, along with the motivation and discipline to read and understand the code samples.
Even simple APIs can be a challenge to understand them the first time you encounter them, so be prepared to read the code samples several times.
HOW DO I SET UP A COMMAND SHELL?
If you are a Mac user, there are three ways to do so. The first method is to use Finder to navigate to Applications > Utilities and then double click on the Utilities application. Next, if you already have a command shell available, you can launch a new command shell by typing the following command:
open /Applications/Utilities/Terminal.app
A second method for Mac users is to open a new command shell on a MacBook from a command shell that is already visible simply by clicking command+n in that command shell, and your Mac will launch another command shell.
If you are a PC user, you can install Cygwin (open source https://cygwin.com/) that simulates bash commands, or use another toolkit such as MKS (a commercial product). Please read the online documentation that describes the download and installation process. Note that custom aliases are not automatically set if they are defined in a file other than the main start-up file (such as .bash_login).
COMPANION FILES
All the code samples and figures in this book may be obtained by writing to the publisher at [email protected].
WHAT ARE THE “NEXT STEPS” AFTER FINISHING THIS BOOK?
The answer to this question varies widely, mainly because the answer depends heavily on your objectives. If you are interested primarily in NLP, then you can learn more advanced concepts, such as attention, transformers, and the BERT-related models.
If you are primarily interested in machine learning, there are some subfields of machine learning, such as deep learning and reinforcement learning (and deep reinforcement learning) that might appeal to you. Fortunately, there are many resources available, and you can perform an Internet search for those resources. One other point: the aspects of machine learning for you to learn depend on who you are: the needs of a machine learning engineer, data scientist, manager, student, or software developer are all different.
O. CampesatoMay 2021
CHAPTER 1
WORKING WITH DATA
This chapter introduces you to the data types (along with their differences), how to scale data values, and various techniques for handling missing data values. If most of the material in this chapter is new to you, be assured that it’s not necessary to understand everything in this chapter. It’s still a good idea to read as much material as you can, and perhaps return to this chapter again after you have completed some of the other chapters in this book.
The first part of this chapter contains an overview of different types of data and an explanation of how to normalize and standardize a set of numeric values by calculating the mean and standard deviation of a set of numbers. You will see how to map categorical data to a set of integers and how to perform one-hot encoding.
The second part of this chapter discusses missing data, outliers, and anomalies, and also some techniques for handling these scenarios. The third section discusses imbalanced data and the use of SMOTE (Synthetic Minority Oversampling Technique) to deal with imbalanced classes in a dataset.
The fourth section discusses ways to evaluate classifiers such as LIME and ANOVA. This section also contains details regarding the bias-variance trade-off and various types of statistical bias.
WHAT ARE DATASETS?
In simple terms, a dataset is a source of data (such as a text file) that contains rows and columns of data. Each row is typically called a “data point,” and each column is called a “feature.” A dataset can be in any form: CSV (comma separated values), TSV (tab separated values), Excel spreadsheet, a table in an RDMBS (Relational Database Management System), a document in a NoSQL database, or the output from a Web service. Someone needs to analyze the dataset to determine which features are the most important and which features can be safely ignored in order to train a model with the given dataset.
A dataset can vary from very small (a couple of features and 100 rows) to very large (more than 1,000 features and more than one million rows). If you are unfamiliar with the problem domain, then you might struggle to determine the most important features in a large dataset. In this situation, you might need a domain expert who understands the importance of the features, their interdependencies (if any), and whether the data values for the features are valid. In addition, there are algorithms (called dimensionality reduction algorithms) that can help you determine the most important features. For example, PCA (Principal Component Analysis) is one such algorithm, which is discussed in more detail later in this chapter.
Data Preprocessing
Data preprocessing is the initial step that involves validating the contents of a dataset, which involves making decisions about missing and incorrect data values such as
•dealing with missing data values
•cleaning “noisy” text-based data
•removing HTML tags
•removing emoticons
•dealing with emojis/emoticons
•filtering data
•grouping data
•handling currency and date formats (i18n)
Cleaning data is an important initial task that involves removing unwanted data as well as handling missing data. In the case of text-based data, you might need to remove HTML tags, punctuation, and so forth. In the case of numeric data, it’s less likely (though still possible) that alphabetic characters are mixed together with numeric data. However, a dataset with numeric features might have incorrect values or missing values (discussed later). In addition, calculating the minimum, maximum, mean, median, and standard deviation of the values of a feature obviously pertain only to numeric values.
After the preprocessing step is completed, data wrangling is performed, which refers to transforming data into a new format. You might have to combine data from multiple sources into a single dataset. For example, you might need to convert between different units of measurement (such as date formats or currency values) so that the data values can be represented in a consistent manner in a dataset.
Currency and date values are part of i18n (internationalization), whereas l10n (localization) targets a specific nationality, language, or region. Hard-coded values (such as text strings) can be stored as resource strings in a file that’s often called a resource bundle, where each string is referenced via a code. Each language has its own resource bundle.
DATA TYPES
Explicit data types exist in many programming languages such as C, C++, Java, and TypeScript. Some programming languages, such as JavaScript and awk, do not require initializing variables with an explicit type: the type of a variable is inferred dynamically via an implicit type system (i.e., one that is not directly exposed to a developer).
In machine learning, datasets can contain features that have different types of data, such as a combination of one or more of the following:
•numeric data (integer/floating point and discrete/continuous)
•character/categorical data (different languages)
•date-related data (different formats)
•currency data (different formats)
•binary data (yes/no, 0/1, and so forth)
•nominal data (multiple unrelated values)
•ordinal data (multiple and related values)
Consider a dataset that contains real estate data, which can have as many as thirty columns (or even more), often with the following features:
•the number of bedrooms in a house: numeric value and a discrete value
•the number of square feet: a numeric value and (probably) a continuous value
•the name of the city: character data
•the construction date: a date value
•the selling price: a currency value and probably a continuous value
•the “for sale” status: binary data (either “yes” or “no”)
An example of nominal data is the seasons in a year: although many countries have four distinct seasons, some countries have only two distinct seasons. However, seasons can be associated with different temperature ranges (summer versus winter). An example of ordinal data is an employee pay grade: 1=entry level, 2=one year of experience, and so forth. Another example of nominal data is a set of colors, such as {Red, Green, Blue}.
An example of binary data is the pair {Male, Female}, and some datasets contain a feature with these two values. If such a feature is required for training a model, first convert {Male, Female} to a numeric counterpart, such as {0, 1}. Similarly, if you need to include a feature whose values are the previous set of colors, you can replace {Red, Green, Blue} with the values {0, 1, 2}.
PREPARING DATASETS
If you have the good fortune to inherit a dataset that is in pristine condition, then data cleaning tasks (discussed later) are vastly simplified: in fact, it might not be necessary to perform any data cleaning for the dataset. On the other hand, if you need to create a dataset that combines data from multiple datasets that contain different formats for dates and currency, then you need to perform a conversion to a common format.
If you need to train a model that includes features that have categorical data, then you need to convert that categorical data to numeric data. For instance, the Titanic dataset contains a feature called “gender,” which is either male or female. Later in this chapter, we show how to “map” male to 0 and female to 1 using Pandas.
Discrete Data Versus Continuous Data
As a simple rule of thumb: discrete data is a set of values that can be counted, whereas continuous data must be measured. Discrete data can reasonably fit in a drop-down list of values, but there is no exact value for making such a determination. One person might think that a list of 500 values is discrete, whereas another person might think it’s continuous.
For example, the list of provinces of Canada and the list of states of the United States are discrete data values, but is the same true for the number of countries in the world (roughly 200) or for the number of languages in the world (more than 7,000)?
Values for temperature, humidity, and barometric pressure are considered continuous. Currency is also treated as continuous, even though there is a measurable difference between two consecutive values. The smallest unit of currency for U.S. currency is one penny, which is 1/100th of a dollar (accounting-based measurements use the “mil,” which is 1/1,000th of a dollar).
Continuous data types can have subtle differences. For example, someone who is 200 centimeters tall is twice as tall as someone who is 100 centimeters tall; the same is true for 100 kilograms versus 50 kilograms. However, temperature is different: 80 degrees Fahrenheit is not twice as hot as 40 degrees Fahrenheit.
Furthermore, keep in mind that the meaning of the word “continuous” in mathematics is not necessarily the same as continuous in machine learning. In the former, a continuous variable (let’s say in the 2D Euclidean plane) can have an uncountably infinite number of values. A feature in a dataset that can have more values than can be reasonably displayed in a drop-down list is treated as though it’s a continuous variable.
For instance, values for stock prices are discrete: they must differ by at least a penny (or some other minimal unit of currency), which is to say, it’s meaningless to say that the stock price changes by one-millionth of a penny. However, since there are so many possible stock values, it’s treated as a continuous variable. The same comments apply to car mileage, ambient temperature, and barometric pressure.
“Binning” Continuous Data
Binning refers to subdividing a set of values into multiple intervals, and then treating all the numbers in the same interval as though they had the same value.
As a simple example, suppose that a feature in a dataset contains the age of people in a dataset. The range of values is approximately between 0 and 120, and we could bin them into 12 equal intervals, where each consists of 10 values: 0 through 9, 10 through 19, 20 through 29, and so forth.
However, partitioning the values of people’s ages as described in the preceding paragraph can be problematic. Suppose that person A, person B, and person C are 29, 30, and 39, respectively. Then person A and person B are probably more similar to each other than person B and person C, but because of the way in which the ages are partitioned, B is classified as closer to C than to A. In fact, binning can increase Type I errors (false positive) and Type II errors (false negative), as discussed in this blog post (along with some alternatives to binning):
https://medium.com/@peterflom/why-binning-continuous-data-is-almost-always-a-mistake-ad0b3a1d141f.
As another example, using quartiles is even more coarse-grained than the earlier age-related binning example. The issue with binning pertains to the consequences of classifying people in different bins, even though they are in close proximity to each other. For instance, some people struggle financially because they earn a meager wage, and they are disqualified from financial assistance because their salary is higher than the cutoff point for receiving any assistance.
Scaling Numeric Data via Normalization
A range of values can vary significantly, and it’s important to note that they often need to be scaled to a smaller range, such as values in the range [−1, 1] or [0, 1], which you can do via the tanh function or the sigmoid function, respectively.
For example, measuring a person’s height in terms of meters involves a range of values between 0.50 meters and 2.5 meters (in the vast majority of cases), whereas measuring height in terms of centimeters ranges between 50 centimeters and 250 centimeters: these two units differ by a factor of 100. A person’s weight in kilograms generally varies between 5 kilograms and 200 kilograms, whereas measuring weight in grams differs by a factor of 1,000. Distances between objects can be measured in meters or in kilometers, which also differ by a factor of 1,000.
In general, use units of measure so that the data values in multiple features belong to a similar range of values. In fact, some machine learning algorithms require scaled data, often in the range of [0, 1] or [−1, 1]. In addition to the tanh and sigmoid function, there are other techniques for scaling data, such as standardizing data (think Gaussian distribution) and normalizing data (linearly scaled so that the new range of values is in [0, 1]).
The following examples involve a floating point variable X with different ranges of values that will be scaled so that the new values are in the interval [0, 1].
•Example 1: If the values of X are in the range [0, 2], then X/2 is in the range [0, 1].
•Example 2: If the values of X are in the range [3, 6], then X − 3 is in the range [0, 3], and (X − 3)/3 is in the range [0, 1].
•Example 3: If the values of X are in the range [−10, 20], then X + 10 is in the range [0, 30], and (X + 10)/30 is in the range of [0, 1].
In general, suppose that X is a random variable whose values are in the range [a,b], where a < b. You can scale the data values by performing two steps:
Step 1: X-a is in the range [0,b-a]
Step 2:(X-a)/(b-a) is in the range [0,1]
If X is a random variable that has the values {x1,x2,x3,...,xn}, then the formula for normalization involves mapping each xi value to (xi – min)/(max–min), where min is the minimum value of X and max is the maximum value of X.
As a simple example, suppose that the random variable X has the values {-1,0,1}. Then min and max are 1 and −1, respectively, and the normalization of {-1,0,1} is the set of values {(-1-(-1))/2,(0-(-1))/2, (1-(-1))/2}, which equals {0,1/2,1}.
Scaling Numeric Data via Standardization
What to Look for in Categorical Data
This section contains various suggestions for handling inconsistent data values, and you can determine which ones to adopt based on any additional factors that are relevant to your particular task. For example, consider dropping columns that have very low cardinality (equal to or close to 1), as well as numeric columns with zero or very low variance.
Next, check the contents of categorical columns for inconsistent spellings or errors. A good example pertains to the gender category, which can consist of a combination of the following values:
male
Male
female
Female
m
f
M
F
The preceding categorical values for gender can be replaced with two categorical values (unless you have a valid reason to retain some of the other values). Moreover, if you are training a model whose analysis involves a single gender, then you need to determine which rows (if any) of a dataset must be excluded. Also check categorical data columns for redundant or missing white spaces.
Check for data values that have multiple data types, such as a numerical column with numbers as numerals and some numbers as strings or objects. Ensure consistent data formats (numbers as integers or floating numbers), and ensure that dates have the same format (for example, do not mix mm/dd/yyyy date formats with another date format, such as dd/mm/yyyy).
Mapping Categorical Data to Numeric Values
Character data is often called categorical data, examples of which include people’s names, home or work addresses, and email addresses. Many types of categorical data involve short lists of values. For example, the days of the week and the months in a year involve seven and twelve distinct values, respectively. Notice that the days of the week have a relationship: For example, each day has a previous day and a next day. However, the colors of an automobile are independent of each other: the color red is not “better” or “worse” than the color blue.
There are several well-known techniques for mapping categorical values to a set of numeric values. A simple example where you need to perform this conversion involves the gender feature in the Titanic dataset. This feature is one of the relevant features for training a machine learning model. The gender feature has {M, F} as its set of possible values. As you will see later in this chapter, Pandas makes it very easy to convert the set of values {M, F} to the set of values {0, 1}.
Another mapping technique involves mapping a set of categorical values to a set of consecutive integer values. For example, the set {Red, Green, Blue} can be mapped to the set of integers {0, 1, 2}. The set {Male, Female} can be mapped to the set of integers {0, 1}. The days of the week can be mapped to {0, 1, 2, 3, 4, 5, 6}. Note that the first day of the week depends on the country: In some cases it’s Sunday, and in other cases it’s Monday.
Another technique is called one-hot encoding, which converts each value to a vector (check Wikipedia if you need a refresher regarding vectors). Thus, {Male, Female} can be represented by the vectors [1,0] and [0,1], and the colors {Red, Green, Blue} can be represented by the vectors [1,0,0], [0,1,0], and [0,0,1]. If you vertically “line up” the two vectors for gender, they form a 2 × 2 identity matrix, and doing the same for the colors will form a 3 × 3 identity matrix.
If you vertically “line up” the two vectors for gender, they form a 2 × 2 identity matrix, and doing the same for the colors will form a 3 × 3 identity matrix, as shown here:
[1,0,0]
[0,1,0]
[0,0,1]
If you are familiar with matrices, you probably noticed that the preceding set of vectors looks like the 3 × 3 identity matrix. In fact, this technique generalizes in a straightforward manner. Specifically, if you have n distinct categorical values, you can map each of those values to one of the vectors in an n × n identity matrix.
As another example, the set of titles {"Intern","Junior","Mid-Range","Senior","ProjectLeader","Dev Manager"} have a hierarchical relationship in terms of their salaries. Another set of categorical data involves the season of the year: {"Spring","Summer","Autumn","Winter"}, and while these values are generally independent of each other, there are cases in which the season is significant. For example, the values for the monthly rainfall, average temperature, crime rate, or foreclosure rate can depend on the season, month, week, or even the day of the year.
If a feature has a large number of categorical values, then one-hot encoding will produce many additional columns for each data point. Since the majority of the values in the new columns equal 0, this can increase the sparsity of the dataset, which in turn can result in more overfitting and hence adversely affect the accuracy of machine learning algorithms that you adopt during the training process.
Another solution is to use a sequence-based solution in which N categories are mapped to the integers 1, 2, . . . , N. Another solution involves examining the row frequency of each categorical value. For example, suppose that N equals 20, and there are three categorical values that occur in 95% of the values for a given feature. You can try the following:
1.Assign the values 1, 2, and 3 to those three categorical values.
2.Assign numeric values that reflect the relative frequency of those categorical values.
3.Assign the category “OTHER” to the remaining categorical values.
4.Delete the rows whose categorical values belong to the 5%.
Working with Dates
The format for a calendar date varies among different countries, and this belongs to something called localization of data (not to be confused with i18n, which is data internationalization). Some examples of date formats are shown as follows (and the first four are probably the most common):
MM/DD/YY
MM/DD/YYYY
DD/MM/YY
DD/MM/YYYY
YY/MM/DD
M/D/YY
D/M/YY
YY/M/D
MMDDYY
DDMMYY
YYMMDD
If you need to combine data from datasets that contain different date formats, then converting the disparate date formats to a single common date format will ensure consistency.
Working with Currency
The format for currency depends on the country, which includes different interpretations for a “,” and “.” in currency (and decimal values in general). For example, 1,124.78 equals “one thousand one hundred twenty-four point seven eight” in the United States, whereas 1.124,78 has the same meaning in Europe (i.e., the “.” symbol and the “,” symbol are interchanged).
If you need to combine data from datasets that contain different currency formats, then you probably need to convert all the disparate currency formats to a single common currency format. There is another detail to consider: currency exchange rates can fluctuate on a daily basis, which in turn can affect the calculation of taxes, late fees, and so forth. Although you might be fortunate enough where you won’t have to deal with these issues, it’s still worth being aware of them.
MISSING DATA, ANOMALIES, AND OUTLIERS
Although missing data is not directly related to checking for anomalies and outliers, in general you will perform all three of these tasks. Each task involves a set of techniques to help you perform an analysis of the data in a dataset, and the following subsections describe some of those techniques.
Missing Data
How you decide to handle missing data depends on the specific dataset. Here are some ways to handle missing data (the first three techniques are manual techniques, and the other techniques are algorithms):
1.replace missing data with the mean/median/mode value
2.infer (“impute”) the value for missing data
3.delete rows with missing data
4.isolation forest (tree-based algorithm)
5.minimum covariance determinant
6.local outlier factor
7.one-class SVM (Support Vector Machines)
In general, replacing a missing numeric value with zero is a risky choice: this value is obviously incorrect if the values of a feature are between 1,000 and 5,000. For a feature that has numeric values, replacing a missing value with the average value is better than the value zero (unless the average equals zero); also consider using the median value. For categorical data, consider using the mode to replace a missing value.
If you are not confident that you can impute a “reasonable” value, consider dropping the row with a missing value, and then train a model with the imputed value and also with the deleted row.
One problem that can arise after removing rows with missing values is that the resulting dataset is too small. In this case, consider using SMOTE, which is discussed later in this chapter, in order to generate synthetic data.
Anomalies and Outliers
In simplified terms, an outlier is an abnormal data value that is outside the range of “normal” values. For example, a person’s height in centimeters is typically between 30 centimeters and 250 centimeters. Hence, a data point (e.g., a row of data in a spreadsheet) with a height of 5 centimeters or a height of 500 centimeters is an outlier. The consequences of these outlier values are unlikely to involve a significant financial or physical loss (though they could adversely affect the accuracy of a trained model).
Anomalies are also outside the “normal” range of values (just like outliers), and they are typically more problematic than outliers: anomalies can have more severe consequences than outliers. For example, consider the scenario in which someone who lives in California suddenly makes a credit card purchase in New York. If the person is on vacation (or a business trip), then the purchase is an outlier (it’s outside the typical purchasing pattern), but it’s not an issue. However, if that person was in California when the credit card purchase was made, then it’s most likely to be credit card fraud, as well as an anomaly.
Unfortunately, there is no simple way to decide how to deal with anomalies and outliers in a dataset. Although you can drop rows that contain outliers, keep in mind that doing so might deprive the dataset—and therefore the trained model—of valuable information. You can try modifying the data values (described as follows), but again, this might lead to erroneous inferences in the trained model. Another possibility is to train a model with the dataset that contains anomalies and outliers, and then train a model with a dataset from which the anomalies and outliers have been removed. Compare the two results and see if you can infer anything meaningful regarding the anomalies and outliers.
Outlier Detection
Although the decision to keep or drop outliers is your decision to make, there are some techniques available that help you detect outliers in a dataset. This section contains a short list of some techniques, along with a very brief description and links for additional information.
Perhaps trimming is the simplest technique (apart from dropping outliers), which involves removing rows whose feature value is in the upper 5% range or the lower 5% range. Winsorizing the data is an improvement over trimming: set the values in the top 5% range equal to the maximum value in the 95th percentile, and set the values in the bottom 5% range equal to the minimum in the 5th percentile.
The Minimum Covariance Determinant is a covariance-based technique, and a Python-based code sample that uses this technique is available online:
https://scikit-learn.org/stable/modules/outlier_detection.html.
The Local Outlier Factor (LOF) technique is an unsupervised technique that calculates a local anomaly score via the kNN (k Nearest Neighbor) algorithm. Documentation and short code samples that use LOF are available online:
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LocalOutlierFactor.html.
Two other techniques involve the Huber and the Ridge classes, both of which are included as part of Sklearn. The Huber error is less sensitive to outliers because it’s calculated via the linear loss, similar to the MAE (Mean Absolute Error). A code sample that compares Huber and Ridge is available online:
https://scikit-learn.org/stable/auto_examples/linear_model/plot_huber_vs_ridge.html.
You can also explore the Theil-Sen estimator and RANSAC, which are “robust” against outliers:
https://scikit-learn.org/stable/auto_examples/linear_model/plot_theilsen.html and
https://en.wikipedia.org/wiki/Random_sample_consensus.
Four algorithms for outlier detection are discussed at the following site:
https://www.kdnuggets.com/2018/12/four-techniques-outlier-detection.html.
One other scenario involves “local” outliers. For example, suppose that you use kMeans (or some other clustering algorithm) and determine that a value is an outlier with respect to one of the clusters. While this value is not necessarily an “absolute” outlier, detecting such a value might be important for your use case.
What is Data Drift?
The value of data is based on its accuracy, its relevance, and its age. Data drift refers to data that has become less relevant over time. For example, online purchasing patterns in 2010 are probably not as relevant as data from 2020 because of various factors (such as the profile of different types of customers). Keep in mind that there might be multiple factors that can influence data drift in a specific dataset.
Two techniques are domain classifier and the black-box shift detector, both of which are discussed online:
https://blog.dataiku.com/towards-reliable-mlops-with-drift-detectors.
WHAT IS IMBALANCED CLASSIFICATION?
Imbalanced classification involves datasets with imbalanced classes. For example, suppose that class A has 99% of the data and class B has 1%. Which classification algorithm would you use? Unfortunately, classification algorithms don’t work well with this type of imbalanced dataset. Here is a list of several well-known techniques for handling imbalanced datasets:
•Random resampling rebalances the class distribution.
•Random oversampling duplicates data in the minority class.
•Random undersampling deletes examples from the majority class.
•SMOTE
Random resampling transforms the training dataset into a new dataset, which is effective for imbalanced classification problems.
The random undersampling technique removes samples from the dataset, and involves the following:
•randomly remove samples from majority class
•can be performed with or without replacement
•alleviates imbalance in the dataset
•may increase the variance of the classifier
•may discard useful or important samples
However, random undersampling does not work well with a dataset that has a 99%/1% split into two classes. Moreover, undersampling can result in losing information that is useful for a model.
Instead of random undersampling, another approach involves generating new samples from a minority class. The first technique involves oversampling examples in the minority class and duplicate examples from the minority class.
There is another technique that is better than the preceding technique, which involves the following:
•synthesize new examples from minority class
•a type of data augmentation for tabular data
•this technique can be very effective
•generate new samples from minority class
Another well-known technique is called SMOTE, which involves data augmentation (i.e., synthesizing new data samples) well before you use a classification algorithm. SMOTE was initially developed by means of the kNN algorithm (other options are available), and it can be an effective technique for handling imbalanced classes.
Yet another option to consider is the Python package imbalanced-learn in the scikit-learn-contrib project. This project provides various re-sampling techniques for datasets that exhibit class imbalance. More details are available online:
https://github.com/scikit-learn-contrib/imbalanced-learn.
WHAT IS SMOTE?
SMOTE is a technique for synthesizing new samples for a dataset. This technique is based on linear interpolation:
•Step 1: Select samples that are close in the feature space.
•Step 2: Draw a line between the samples in the feature space.
•Step 3: Draw a new sample at a point along that line.
A more detailed explanation of the SMOTE algorithm is as follows:
•Select a random sample “a” from the minority class.
•Find k nearest neighbors for that example.
•Select a random neighbor “b” from the nearest neighbors.
•Create a line “L” that connects “a” and “b.”
•Randomly select one or more points “c” on line L.
If need be, you can repeat this process for the other (k-1) nearest neighbors to distribute the synthetic values more evenly among the nearest neighbors.
SMOTE Extensions
The initial SMOTE algorithm is based on the kNN classification algorithm, which has been extended in various ways, such as replacing kNN with SVM. A list of SMOTE extensions is shown as follows:
•selective synthetic sample generation
•Borderline-SMOTE (kNN)
•Borderline-SMOTE (SVM)
•Adaptive Synthetic Sampling (ADASYN)
ANALYZING CLASSIFIERS (OPTIONAL)
This section is marked “optional” because its contents pertain to machine learning classifiers, which are not the focus of this book. However, it’s still worthwhile to glance through the material, or perhaps return to this section after you have a basic understanding of machine learning classifiers.
Several well-known techniques are available for analyzing the quality of machine learning classifiers. Two techniques are LIME and ANOVA, both of which are discussed in the following subsections.
What is LIME?
LIME is an acronym for Local Interpretable Model-Agnostic Explanations. LIME is a model-agnostic technique that can be used with machine learning models. In LIME, you make small random changes to data samples and then observe the manner in which predictions change (or not). The approach involves changing the output (slightly) and then observing what happens to the output.
By way of analogy, consider food inspectors who test for bacteria in truckloads of perishable food. Clearly, it’s infeasible to test every food item in a truck (or a train car), so inspectors perform “spot checks” that involve testing randomly selected items. In an analogous fashion, LIME makes small changes to input data in random locations and then analyzes the changes in the associated output values.
However, there are two caveats to keep in mind when you use LIME with input data for a given model:
1.The actual changes to input values are model-specific.
2.This technique works on input that is interpretable.
Examples of interpretable input include machine learning classifiers (such as trees and random forests) and NLP techniques such as BoW (Bag of Words). Non-interpretable input involves “dense” data, such as a word embedding (which is a vector of floating point numbers).
You could also substitute your model with another model that involves interpretable data, but then you need to evaluate how accurate the approximation is to the original model.
What is ANOVA?
ANOVA is an acronym for analysis of variance, which attempts to analyze the differences among the mean values of a sample that’s taken from a population. ANOVA enables you to test if multiple mean values are equal. More importantly, ANOVA can assist in reducing Type I (false positive) errors and Type II errors (false negative) errors. For example, suppose that person A is diagnosed with cancer and person B is diagnosed as healthy, and that both diagnoses are incorrect. Then the result for person A is a false positive whereas the result for person B is a false negative. In general, a test result of false positive is much preferable to a test result of false negative.
ANOVA pertains to the design of experiments and hypothesis testing, which can produce meaningful results in various situations. For example, suppose that a dataset contains a feature that can be partitioned into several “reasonably” homogenous groups. Next, analyze the variance in each group and perform comparisons with the goal of determining different sources of variance for the values of a given feature.
THE BIAS-VARIANCE TRADE-OFF
This section is presented from the viewpoint of machine learning, but the concepts of bias and variance are highly relevant outside of machine learning.
Bias in machine learning can be due to an error from wrong assumptions in a learning algorithm. High bias might cause an algorithm to miss relevant relations between features and target outputs (underfitting). Prediction bias can occur because of “noisy” data, an incomplete feature set, or a biased training sample.
Error due to bias is the difference between the expected (or average) prediction of your model and the correct value that you want to predict. Repeat the model building process multiple times, and gather new data each time, and also perform an analysis to produce a new model. The resulting models have a range of predictions because the underlying datasets have a degree of randomness. Bias measures the extent to which the predictions for these models deviate from the correct value.
Variance in machine learning is the expected value of the squared deviation from the mean. High variance can/might cause an algorithm to model the random noise in the training data, rather than the intended outputs (aka overfitting). Moreover, adding parameters to a model increases its complexity, increases the variance, and decreases the bias.
Dealing with bias and variance involves addressing underfitting and overfitting.
Error due to variance is the variability of a model prediction for a given data point. As before, repeat the entire model building process, and the variance is the extent to which predictions for a given point vary among different “instances” of the model.
If you have worked with datasets and performed data analysis, you already know that finding well-balanced samples can be difficult or highly impractical. Moreover, performing an analysis of the data in a dataset is vitally important, yet there is no guarantee that you can produce a dataset that is 100% “clean.”
A biased statistic is a statistic that is systematically different from the entity in the population that is being estimated. In more casual terminology, if a data sample “favors” or “leans” toward one aspect of the population, then the sample has bias. For example, if you prefer movies that are comedies, then clearly you are more likely to select a comedy instead of a dramatic movie or a science fiction movie. Thus, a frequency graph of the movie types in a sample of your movie selections will be more closely clustered around comedies.
However, if you have a wide-ranging set of preferences for movies, then the corresponding frequency graph will be more varied, and therefore have a larger spread of values. As a simple example, suppose that you are given an assignment that involves writing a term paper on a controversial subject that has many opposing viewpoints. Since you want a bibliography that supports your well-balanced term paper that takes into account multiple viewpoints, your bibliography will contain a wide variety of sources. In other words, your bibliography will have a larger variance and a smaller bias. However, if most (or all) the references in your bibliography espouse the same point of view, then you will have a smaller variance and a larger bias (it’s just an analogy, so it’s not a perfect counterpart to bias vs. variance).
The bias-variance trade-off can be stated in simple terms: in general, reducing the bias in samples can increase the variance, whereas reducing the variance tends to increase the bias.
Types of Bias in Data
In addition to the bias-variance trade-off that is discussed in the previous section, there are several types of bias, some of which are listed as follows:
•Availability Bias
•Confirmation Bias
•False Causality
•Sunk Cost Fallacy
•Survivorship Bias
Availability bias is akin to making a “rule” based on an exception. For example, there is a known link between smoking cigarettes and cancer, but there are exceptions. If you find someone who has smoked three packs of cigarettes on a daily basis for four decades and is still healthy, can you assert that smoking does not lead to cancer?
Confirmation bias refers to the tendency to focus on data that confirms one’s beliefs and simultaneously ignore data that contradicts a belief.
False causality occurs when you incorrectly assert that the occurrence of a particular event causes another event to occur as well. One of the most well-known examples involves ice cream consumption and violent crime in New York during the summer. Since more people eat ice cream in the summer, that “causes” more violent crime, which is a false causality. Other factors, such as the increase in temperature, may be linked to the increase in crime. However, it’s important to distinguish between correlation and causality: the latter is a much stronger link than the former, and it’s also more difficult to establish causality instead of correlation.
Sunk cost refers to something (often money) that has been spent or incurred that cannot be recouped. A common example pertains to gambling at a casino: People fall into the pattern of spending more money in order to recoup a substantial amount of money that has already been lost. While there are situations in which people do recover their money, in many cases, people simply incur an even greater loss because they continue to spend their money.
Survivorship bias