Natural Language Processing and Machine Learning for Developers - Oswald Campesato - E-Book

Natural Language Processing and Machine Learning for Developers E-Book

Oswald Campesato

0,0
58,79 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

Unlock the potential of NLP and machine learning with this comprehensive guide. Learn advanced techniques, implement practical applications, and master tools like NumPy, Pandas, and transformer models.

Key Features

  • Comprehensive guide to NLP and machine learning
  • Practical examples and code samples
  • Covers from basic to advanced techniques and applications

Book Description

This book introduces developers to basic concepts in NLP and machine learning, providing numerous code samples to support the topics covered. The journey begins with introductory material on NumPy and Pandas, essential for data manipulation. Following this, chapters delve into NLP concepts, algorithms, and toolkits, providing a solid foundation in natural language processing.

As you progress, the book covers machine learning fundamentals and classifiers, demonstrating how these techniques are applied in NLP. Practical examples using TF2 and Keras illustrate how to implement various NLP tasks. Advanced topics include the Transformer architecture, BERT-based models, and the GPT family of models, showcasing the latest advancements in the field.

The final chapters and appendices offer a comprehensive overview of related topics, including data and statistics, Python3, regular expressions, and data visualization with Matplotlib and Seaborn. Companion files with source code and figures ensure a hands-on learning experience. This book equips you with the knowledge and tools needed to excel in NLP and machine learning.

What you will learn

  • Master NumPy and Pandas for data manipulation
  • Understand NLP concepts and techniques
  • Implement various NLP algorithms
  • Apply machine learning techniques
  • Use transformer models like BERT and GPT
  • Develop practical NLP applications

Who this book is for

This book is ideal for developers and data scientists who want to enhance their skills in NLP and machine learning. Basic knowledge of Python is recommended. Prior experience with data manipulation and machine learning concepts is beneficial.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 988

Veröffentlichungsjahr: 2024

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



NATURAL LANGUAGE PROCESSINGANDMACHINE LEARNINGFOR DEVELOPERS

Pocket Primer

Oswald Campesato

Copyright ©2021 by MERCURY LEARNINGAND INFORMATION LLC. All rights reserved.

This publication, portions of it, or any accompanying software may not be reproduced in any way, stored in a retrieval system of any type, or transmitted by any means, media, electronic display or mechanical display, including, but not limited to, photocopy, recording, Internet postings, or scanning, without prior permission in writing from the publisher.

Publisher: David Pallai

MERCURY LEARNINGAND INFORMATION

22841 Quicksilver Drive

Dulles, VA 20166

[email protected]

www.merclearning.com

800-232-0223

O. Campesato. Natural Language Processing and Machine Learning for Developers.

ISBN: 978-1-68392-618-4

The publisher recognizes and respects all marks used by companies, manufacturers, and developers as a means to distinguish their products. All brand names and product names mentioned in this book are trademarks or service marks of their respective companies. Any omission or misuse (of any kind) of service marks or trademarks, etc. is not an attempt to infringe on the property of others.

Library of Congress Control Number: 2021936681

212223321     Printed on acid-free paper in the United States of America.

Our titles are available for adoption, license, or bulk purchase by institutions, corporations, etc.For additional information, please contact the Customer Service Dept. at 800-232-0223(toll free).

All of our titles are available in digital format at academiccourseware.com and other digital vendors.The sole obligation of MERCURY LEARNING AND INFORMATION to the purchaser is to replace the book, based on defective materials or faulty workmanship, but not based on the operation or functionality of the product.

I’d like to dedicate this book to my parents-may this bring joy and happiness into their lives.

Contents

Preface

Chapter 1   Introduction to NumPy

What is NumPy?

Useful NumPy Features

What are NumPy Arrays?

Working with Loops

Appending Elements to Arrays (1)

Appending Elements to Arrays (2)

Multiply Lists and Arrays

Doubling the Elements in a List

Lists and Exponents

Arrays and Exponents

Math Operations and Arrays

Working with “-1” Subranges with Vectors

Working with “-1” Subranges with Arrays

Other Useful NumPy Methods

Arrays and Vector Operations

NumPy and Dot Products (1)

NumPy and Dot Products (2)

NumPy and the “Norm” of Vectors

NumPy and Other Operations

NumPy and the reshape() Method

Calculating the Mean and Standard Deviation

Trimmed Mean and Weighted Mean

Code Sample with Mean and Standard Deviation

Working with Lines in the Plane (Optional)

Plotting a Line with NumPy and Matplotlib

Plotting a Quadratic with NumPy and Matplotlib

What is Linear Regression?

What is Multivariate Analysis?

What about Nonlinear Datasets?

The MSE Formula

Other Error Types

Nonlinear Least Squares

Calculating the MSE Manually

Find the Best-Fitting Line with NumPy

Calculating MSE by Successive Approximation (1)

Calculating MSE by Successive Approximation (2)

What is Jax?

Google Colaboratory

Uploading CSV Files in Google Colaboratory

Summary

Chapter 2   Introduction to Pandas

What is Pandas?

Pandas Options and Settings

Pandas Data Frames

Data Frames and Data Cleaning Tasks

Alternatives to Pandas

A Pandas Data Frame with NumPy Example

Describing a Pandas Data Frame

Pandas Boolean Data Frames

Transposing a Pandas Data Frame

Pandas Data Frames and Random Numbers

Reading CSV Files in Pandas

The loc() and iloc() Methods in Pandas

Converting Categorical Data to Numeric Data

Matching and Splitting Strings in Pandas

Converting Strings to Dates in Pandas

Merging and Splitting Columns in Pandas

Combining Pandas Data frames

Data Manipulation with Pandas Data Frames (1)

Data Manipulation with Pandas Data Frames (2)

Data Manipulation with Pandas Data Frames (3)

Pandas Data Frames and CSV Files

Managing Columns in Data Frames

Switching Columns

Appending Columns

Deleting Columns

Inserting Columns

Scaling Numeric Columns

Managing Rows in Pandas

Selecting a Range of Rows in Pandas

Finding Duplicate Rows in Pandas

Inserting New Rows in Pandas

Handling Missing Data in Pandas

Multiple Types of Missing Values

Test for Numeric Values in a Column

Replacing NaN Values in Pandas

Sorting Data Frames in Pandas

Working with groupby() in Pandas

Working with apply() and mapapply() in Pandas

Handling Outliers in Pandas

Pandas Data Frames and Scatterplots

Pandas Data Frames and Simple Statistics

Aggregate Operations in Pandas Data Frames

Aggregate Operations with the titanic.csv Dataset

Save Data Frames as CSV Files and Zip Files

Pandas Data Frames and Excel Spreadsheets

Working with JSON-based Data

Python Dictionary and JSON

Python, Pandas, and JSON

Pandas and Regular Expressions (Optional)

Useful One-Line Commands in Pandas

What is Method Chaining?

Pandas and Method Chaining

Pandas Profiling

What is Texthero?

Summary

Chapter 3   NLP Concepts (I)

The Origin of Languages

Language Fluency

Major Language Groups

Peak Usage of Some Languages

Languages and Regional Accents

Languages and Slang

Languages and Dialects

The Complexity of Natural Languages

Word Order in Sentences

What about Verbs?

Auxiliary Verbs

What are Case Endings?

Languages and Gender

Singular and Plural Forms of Nouns

Changes in Spelling of Words

Japanese Grammar

Japanese Postpositions (Particles)

Ambiguity in Japanese Sentences

Japanese Nominalization

Google Translate and Japanese

Japanese and Korean

Vowel-Optional Languages and Word Direction

Mutating Consonant Spelling

Expressing Negative Opinions

Phonetic Languages

Phonemes and Morphemes

English Words of Greek and Latin Origin

Multiple Ways to Pronounce Consonants

The Letter “j” in Various Languages

“Hard” versus “Soft” Consonant Sounds

“Ess,” “zee,” and “sh” Sounds

Three Consecutive Consonants

Diphthongs and Triphthongs in English

Semi-Vowels in English

Challenging English Sounds

English in Canada, UK, Australia, and the United States

English Pronouns and Prepositions

What is NLP?

The Evolution of NLP

A Wide-Angle View of NLP

NLP Applications and Use Cases

NLU and NLG

What is Text Classification?

Information Extraction and Retrieval

Word Sense Disambiguation

NLP Techniques in ML

NLP Steps for Training a Model

Text Normalization and Tokenization

Word Tokenization in Japanese

Text Tokenization with Unix Commands

Handling Stop Words

What is Stemming?

Singular versus Plural Word Endings

Common Stemmers

Stemmers and Word Prefixes

Over Stemming and Under Stemming

What is Lemmatization?

Stemming/Lemmatization Caveats

Limitations of Stemming and Lemmatization

Working with Text: POS

POS Tagging

POS Tagging Techniques

Working with Text: NER

Abbreviations and Acronyms

NER Techniques

What is Topic Modeling?

Keyword Extraction, Sentiment Analysis, and Text Summarization

Summary

Chapter 4   NLP Concepts (II)

What is Word Relevance?

What is Text Similarity?

Sentence Similarity

Sentence Encoders

Working with Documents

Document Classification

Document Similarity (doc2vec)

Techniques for Text Similarity

Similarity Queries

What is Text Encoding?

Text Encoding Techniques

Document Vectorization

One-Hot Encoding (OHE)

Index-Based Encoding

Additional Encoders

The BoW Algorithm

What are n-grams?

Calculating Probabilities with N-grams

Calculating tf, idf, and tf-idf

What is Term Frequency (TF)?

What is Inverse Document Frequency (IDF)?

What is tf-idf?

Limitations of tf-idf

Pointwise Mutual Information (PMI)

The Context of Words in a Document

What is Semantic Context?

Textual Entailment

Discrete, Distributed, and Contextual Word Representations

What is Cosine Similarity?

Text Vectorization (aka Word Embeddings)

Overview of Word Embeddings and Algorithms

Word Embeddings

Word Embedding Algorithms

What is Word2vec?

The Intuition for Word2vec

The Word2vec Architecture

Limitations of Word2vec

The CBoW Architecture

What are Skip-grams?

Skip-gram Example

The Skip-gram Architecture

Neural Network Reduction

What is GloVe?

Working with GloVe

What is FastText?

Comparison of Word Embeddings

What is Topic Modeling?

Topic Modeling Algorithms

LDA and Topic Modeling

Text Classification versus Topic Modeling

Language Models and NLP

How to Create a Language Model

Vector Space Models

Term-Document Matrix

Tradeoffs of the VSM

NLP and Text Mining

Text Extraction Preprocessing and N-Grams

Relation Extraction and Information Extraction

What is a BLEU Score?

ROUGE Score: An Alternative to BLEU

Summary

Chapter 5   Algorithms and Toolkits (I)

Cleaning Data with Regular Expressions

Handling Contracted Words

Python Code Samples of BoW

One-Hot Encoding Examples

Sklearn and Word Embedding Examples

What is BeautifulSoup?

Web Scraping with Pure Regular Expressions

What is Scrapy?

What is SpaCy?

SpaCy and Stop Words

SpaCy and Tokenization

SpaCy and Lemmatization

SpaCy and NER

SpaCy Pipelines

SpaCy and Word Vectors

The scispaCy Library (Optional)

Summary

Chapter 6   Algorithms and Toolkits (II)

What is NLTK?

NLTK and BoW

NLTK and Stemmers

NLTK and Lemmatization

NLTK and Stop Words

What is Wordnet?

Synonyms and Antonyms

NLTK, lxml, and XPath

NLTK and n-grams

NLTK and POS (1)

NLTK and POS (2)

NLTK and Tokenizers

NLTK and Context-Free Grammars (Optional)

What is Gensim?

Gensim and tf-idf Example

Saving a Word2vec Model in Genism

An Example of Topic Modeling

A Brief Comparison of Popular Python-Based NLP Libraries

Miscellaneous Libraries

Summary

Chapter 7   Introduction to Machine Learning

What is Machine Learning?

Learning Style of Machine Learning Algorithms

Types of Machine Learning Algorithms

Machine Learning Tasks

Preparing a Dataset and Training a Model

Feature Engineering, Selection, and Extraction

Feature Engineering

Feature Selection

Feature Extraction

Model Selection

Working with Datasets

Training Data versus Test Data

What is Cross-Validation?

Overfitting versus Underfitting

What is Regularization?

ML and Feature Scaling

Data Normalization Techniques

Metrics in Machine Learning

R-Squared and its Limitations

Confusion Matrix

Precision, Recall, and Specificity

The ROC Curve and AUC

Metrics for Model Evaluation and Selection

What is Linear Regression?

Linear Regression versus Curve-Fitting

When are Solutions Exact Values?

What is Multivariate Analysis?

Other Types of Regression

Working with Lines in the Plane (Optional)

Scatter Plots with NumPy and Matplotlib (1)

Why the “Perturbation Technique” is Useful

Scatter Plots with NumPy and Matplotlib (2)

A Quadratic Scatterplot with NumPy and Matplotlib

The Mean Squared Error (MSE) Formula

A List of Error Types

Nonlinear Least Squares

Calculating the MSE Manually

Approximating Linear Data with np.linspace()

What are Ensemble Methods?

Four Types of Ensemble Methods

Bagging

Boosting

Stacked Models and Blending Models

What is Bootstrapping?

Common Boosting Algorithms

Hyperparameter Optimization

Grid Search

Randomized Search

Bayesian Optimization

AutoML, AutoML-Zero, and AutoNLP

Miscellaneous Topics

What is Causality?

What is Explainability?

What is Interpretability?

Summary

Chapter 8   Classifiers in Machine Learning

What is Classification?

What are Classifiers?

Common Classifiers

Binary versus Multiclass Classification

Multilabel Classification

What are Linear Classifiers?

What is kNN?

How to Handle a Tie in kNN

SMOTE and kNN

kNN for Data Imputation

What are Decision Trees?

Trade-offs with Decision Trees

Decision Tree Algorithms

Decision Tree Code Samples

Decision Trees, Gini Impurity, and Entropy

What are Random Forests?

What are Support Vector Machines?

Trade-offs of SVMs

What is a Bayesian Classifier?

Types of Naïve Bayes Classifiers

Training Classifiers

Evaluating Classifiers

Trade-offs for ML Algorithms

What are Activation Functions?

Why Do we Need Activation Functions?

How Do Activation Functions Work?

Common Activation Functions

Activation Functions in Python

Keras Activation Functions

The ReLU and ELU Activation Functions

The Advantages and Disadvantages of ReLU

ELU

Sigmoid, Softmax, and Hardmax Similarities

Softmax

Softplus

Tanh

Sigmoid, Softmax, and HardMax Differences

Hyperparameters for Neural Networks

The Loss Function Hyperparameter

The Optimizer Hyperparameter

The Learning Rate Hyperparameter

The Dropout Rate Hyperparameter

What is Backward Error Propagation?

What is Logistic Regression?

Setting a Threshold Value

Logistic Regression: Important Assumptions

Linearly Separable Data

Keras, Logistic Regression, and Iris Dataset

Sklearn and Linear Regression

SciPy and Linear Regression

Keras and Linear Regression

Summary

Chapter 9   NLP Applications

What is Text Summarization?

Extractive Text Summarization

Abstractive Text Summarization

Text Summarization with gensim and SpaCy

What are Recommender Systems?

Movie Recommender Systems

Factoring the Rating Matrix R

Content-Based Recommendation Systems

Analyzing only the Description of the Content

Building User Profiles and Item Profiles

Collaborative Filtering Algorithm

User–User Collaborative Filtering

Item–Item Collaborative Filtering

Recommender System with Surprise

Recommender Systems and Reinforcement Learning (Optional)

Basic Reinforcement Learning in Five Minutes

What is RecSim?

What is Sentiment Analysis?

Useful Tools for Sentiment Analysis

Aspect-Based Sentiment Analysis

Deep Learning and Sentiment Analysis

Sentiment Analysis with Naïve Bayes

Sentiment Analysis in NLTK and VADER

Sentiment Analysis with Textblob

Sentiment Analysis with Flair

Detecting Spam

Logistic Regression and Sentiment Analysis

Working with COVID-19

What are Chatbots?

Open Domain Chatbots

Chatbot Types

Logic Flow of Chatbots

Chatbot Abuses

Useful Links

Summary

Chapter 10   NLP and TF2/Keras

Term-Document Matrix

Text Classification Algorithms in Machine Learning

A Keras-Based Tokenizer

TF2 and Tokenization

TF2 and Encoding

A Keras-Based Word Embedding

An Example of BoW with TF2

The 20newsgroup Dataset

Text Classification with the kNN Algorithm

Text Classification with a Decision Tree Algorithm

Text Classification with a Random Forest Algorithm

Text Classification with the SVC Algorithm

Text Classification with the Naïve Bayes Algorithm

Text Classification with the kMeans Algorithm

TF2/Keras and Word Tokenization

TF2/Keras and Word Encodings

Text Summarization with TF2/Keras and Reuters Dataset

Summary

Chapter 11   Transformer, BERT, and GPT

What is Attention?

Types of Word Embeddings

Types of Attention and Algorithms

An Overview of the Transformer Architecture

The Transformers Library from HuggingFace

Transformer and NER Tasks

Transformer and QnA Tasks

Transformer and Sentiment Analysis Tasks

Transformer and Mask Filling Tasks

What is T5?

What is BERT?

BERT Features

How is BERT Trained?

How BERT Differs from Earlier NLP Techniques

The Inner Workings of BERT

What is MLM?

What is NSP?

Special Tokens

BERT Encoding: Sequence of Steps

Subword Tokenization

Sentence Similarity in BERT

Word Context in BERT

Generating BERT Tokens (1)

Generating BERT Tokens (2)

The BERT Family

Surpassing Human Accuracy: deBERTa

What is Google Smith?

Introduction to GPT

Installing the Transformers Package

Working with GPT-2

What is GPT-3?

What is the Goal?

GPT-3 Task Strengths and Mistakes

GPT-3 Architecture

GPT versus BERT

Zero-Shot, One-Shot, and Few Shot Learners

GPT Task Performance

The Switch Transformer: One Trillion Parameters

Looking Ahead

Summary

Appendix A   Data and Statistics

What are Datasets?

Data Preprocessing

Data Types

Preparing Datasets

Continuous versus Discrete Data

“Binning” Continuous Data

Scaling Numeric Data via Normalization

Scaling Numeric Data via Standardization

What to Look for in Categorical Data

Mapping Categorical Data to Numeric Values

Working with Dates

Working with Currency

Missing Data, Anomalies, and Outliers

Anomalies and Outliers

Outlier Detection

Missing Data: MCAR, MAR, and MNAR

What is Data Drift?

What is Imbalanced Classification?

Undersampling and Oversampling

Limitations of Resampling

What is SMOTE?

SMOTE Extensions

Analyzing Classifiers

What is LIME?

What is ANOVA?

What is a Probability?

Calculating the Expected Value

Random Variables

Discrete versus Continuous Random Variables

Well-Known Probability Distributions

Fundamental Concepts in Statistics

The Mean

The Median

The Mode

The Variance and Standard Deviation

Population, Sample, and Population Variance

Chebyshev’s Inequality

What is a p-Value?

The Moments of a Function (Optional)

Skewness

Kurtosis

Data and Statistics

The Central Limit Theorem

Correlation versus Causation

Statistical Inferences

The Bias-Variance Trade-off

Types of Bias in Data

Gini Impurity, Entropy, and Perplexity

What is Gini Impurity?

What is Entropy?

Calculating Gini Impurity and Entropy Values

Multidimensional Gini Index

What is Perplexity?

Cross-Entropy and KL Divergence

What is Cross Entropy?

What is KL Divergence?

What’s their Purpose?

Covariance and Correlation Matrices

Covariance Matrix

Covariance Matrix: An Example

Correlation Matrix

Eigenvalues and Eigenvectors

Calculating Eigenvectors: A Simple Example

Gauss Jordan Elimination (Optional)

Principal Component Analysis (PCA)

The New Matrix of Eigenvectors

Dimensionality Reduction

Dimensionality Reduction Techniques

The Curse of Dimensionality

What are Manifolds (Optional)?

Singular Value Decomposition (SVD)

Locally Linear Embedding (LLE)

UMAP

t-SNE (“tee-snee”)

PHATE

Linear Versus Nonlinear Reduction Techniques

Types of Distance Metrics

Other Well-Known Distance Metrics

Pearson Correlation Coefficient

Jaccard Index (or Similarity)

Local Sensitivity Hashing (Optional)

What is Sklearn?

Sklearn, Pandas, and the IRIS Dataset

Sklearn and Outlier Detection

What is Bayesian Inference?

Bayes Theorem

Some Bayesian Terminology

What is MAP?

Why Use Bayes Theorem?

What are Vector Spaces?

Summary

Appendix B   Introduction to Python

Tools for Python

easy_install and pip

virtualenv

IPython

Python Installation

Setting the PATH Environment Variable (Windows Only)

Launching Python on Your Machine

The Python Interactive Interpreter

Python Identifiers

Lines, Indentation, and Multilines

Quotation and Comments in Python

Saving Your Code in a Module

Some Standard Modules in Python

The help() and dir() Functions

Compile Time and Runtime Code Checking

Simple Data Types in Python

Working with Numbers

Working with Other Bases

The chr() Function

The round() Function in Python

Formatting Numbers in Python

Working with Fractions

Unicode and UTF-8

Working with Unicode

Working with Strings

Comparing Strings

Formatting Strings in Python

Uninitialized Variables and the Value None in Python

Slicing and Splicing Strings

Testing for Digits and Alphabetic Characters

Search and Replace a String in Other Strings

Remove Leading and Trailing Characters

Printing Text without NewLine Characters

Text Alignment

Working with Dates

Converting Strings to Dates

Exception Handling in Python

Handling User Input

Python and Emojis (Optional)

Command-Line Arguments

Summary

Appendix C   Introduction to Regular Expressions

What are Regular Expressions?

Metacharacters in Python

Character Sets in Python

Working with “^” and “\”

Character Classes in Python

Matching Character Classes with the re Module

Using the re.match() Method

Options for the re.match() Method

Matching Character Classes with the re.search() Method

Matching Character Classes with the findAll() Method

Finding Capitalized Words in a String

Additional Matching Function for Regular Expressions

Grouping with Character Classes in Regular Expressions

Using Character Classes in Regular Expressions

Matching Strings with Multiple Consecutive Digits

Reversing Words in Strings

Modifying Text Strings with the re Module

Splitting Text Strings with the re.split() Method

Splitting Text Strings Using Digits and Delimiters

Substituting Text Strings with the re.sub() Method

Matching the Beginning and the End of Text Strings

Compilation Flags

Compound Regular Expressions

Counting Character Types in a String

Regular Expressions and Grouping

Simple String Matches

Additional Topics for Regular Expressions

Summary

Exercises

Appendix D   Introduction to Keras

What is Keras?

Working with Keras Namespaces in TF 2

Working with the tf.keras.layers Namespace

Working with the tf.keras.activations Namespace

Working with the keras.tf.datasets Namespace

Working with the tf.keras.experimental Namespace

Working with Other tf.keras Namespaces

TF 2 Keras versus “Standalone” Keras

Creating a Keras-Based Model

Keras and Linear Regression

Keras, MLPs, and MNIST

Keras, CNNs, and cifar10

Resizing Images in Keras

Keras and Early Stopping (1)

Keras and Early Stopping (2)

Keras and Metrics

Saving and Restoring Keras Models

Summary

Appendix E   Introduction to TensorFlow 2

What is TF 2?

TF 2 Use Cases

TF 2 Architecture: The Short Version

TF 2 Installation

TF 2 and the Python REPL

Other TF 2-Based Toolkits

TF 2 Eager Execution

TF 2 Tensors, Data Types, and Primitive Types

TF 2 Data Types

TF 2 Primitive Types

Constants in TF 2

Variables in TF 2

The tf.rank() API

The tf.shape() API

Variables in TF 2 (Revisited)

TF 2 Variables versus Tensors

What is @tf.function in TF 2?

How Does @tf.function Work?

A Caveat about @tf.function in TF 2

The tf.print() Function and Standard Error

Working with @tf.function in TF 2

An Example without @tf.function

An Example with @tf.function

Overloading Functions with @tf.function

What is AutoGraph in TF 2?

Arithmetic Operations in TF 2

Caveats for Arithmetic Operations in TF 2

TF 2 and Built-In Functions

Calculating Trigonometric Values in TF 2

Calculating Exponential Values in TF 2

Working with Strings in TF 2

Working with Tensors and Operations in TF 2

Second-Order Tensors in TF 2 (1)

Second-Order Tensors in TF 2 (2)

Multiplying Two Second-Order Tensors in TF

Convert Python Arrays to TF Tensors

Conflicting Types in TF 2

Differentiation and tf.GradientTape in TF 2

Examples of tf.GradientTape

Using Nested Loops with tf.GradientTape

Other Tensors with tf.GradientTape

A Persistent Gradient Tape

What is Trax?

Google Colaboratory

Other Cloud Platforms

GCP SDK

TF2 and tf.data.Dataset

The TF 2 tf.data.Dataset

Creating a Pipeline

A Simple TF 2 tf.data.Dataset

What are Lambda Expressions?

Working with Generators in TF 2

Summary

Appendix F   Data Visualization

What is Data Visualization?

Types of Data Visualization

What is Matplotlib?

Horizontal Lines in Matplotlib

Slanted Lines in Matplotlib

Parallel Slanted Lines in Matplotlib

A Grid of Points in Matplotlib

A Dotted Grid in Matplotlib

Lines in a Grid in Matplotlib

A Colored Grid in Matplotlib

A Colored Square in an Unlabeled Grid in Matplotlib

Randomized Data Points in Matplotlib

A Histogram in Matplotlib

A Set of Line Segments in Matplotlib

Plotting Multiple Lines in Matplotlib

Trigonometric Functions in Matplotlib

Display IQ Scores in Matplotlib

Plot a Best-Fitting Line in Matplotlib

Introduction to Sklearn (scikit-learn)

The Digits Dataset in Sklearn

The Iris Dataset in Sklearn

Sklearn, Pandas, and the Iris Dataset

The Iris Dataset in Sklearn (Optional)

The faces Dataset in Sklearn (Optional)

Working with Seaborn

Features of Seaborn

Seaborn Built-in Datasets

The Iris Dataset in Seaborn

The Titanic Dataset in Seaborn

Extracting Data from the Titanic Dataset in Seaborn (1)

Extracting Data from the Titanic Dataset in Seaborn (2)

Visualizing a Pandas Dataset in Seaborn

Data Visualization in Pandas

Summary

Index

Preface

 

What Is the Primary Value Proposition for This Book?

This book contains a fast-paced introduction to as much relevant information about NLP and machine learning as possible that can be reasonably included in a book of this size. Some chapters contain topics that are discussed in great detail (such as the first half of Chapter 3), and other chapters contain advanced statistical concepts that you can safely omit during your first pass through this book. The book casts a wide net to help developers who have a range of technical backgrounds, which is the rationale for the inclusion of numerous topics. Regardless of your background, please keep in mind the following point: you will not become an expert in machine learning or NLP by reading this book, and be prepared to read some of the content in this book multiple times.

However, you will be exposed to many NLP and machine learning topics, and many topics are presented in a cursory manner for two reasons. First, it’s important that you be exposed to these concepts. In some cases, you will find topics that might pique your interest, and motivate you to learn more about them through self-study; in other cases, you will probably be satisfied with a brief introduction.

Second, a full treatment of all the topics that are covered in this book would probably triple the size of this book, and few people are interested in reading 1,000-page technical books. Subsequently, the book provides a broad view of the NLP and machine learning landscape, based on the belief that this approach will be more beneficial for readers who are already experienced developers, but need to learn about NLP and machine learning.

The Target Audience

The book is intended primarily for people who have a solid background as software developers. Specifically, it is for developers who are accustomed to searching online for more detailed information about technical topics. If you are a beginner, there are other books that are more suitable for you, and you can find them by performing an online search.

The book is also intended to reach an international audience of readers with highly diverse backgrounds in various age groups. While many readers know how to read English, their native spoken language is not English (which could be their second, third, or even fourth language). Consequently, this book uses standard English rather than colloquial expressions that might be confusing to those readers. As you know, many people learn by different types of imitation, which includes reading, writing, or hearing new material. This book takes these points into consideration in order to provide a comfortable and meaningful learning experience for the intended readers.

Why Such a Massive Number of Topics in This Book?

As mentioned in the response to the previous question, this book is intended for developers who want to learn NLP concepts and machine learning. Since this encompasses people with vastly different technical backgrounds, there are readers who “don’t know what they don’t know” regarding NLP. Therefore, it exposes people to a plethora of NLP-related concepts, after which they can decide which topics to select for greater study. Consequently, this book does not have a “zero-to-hero” approach, nor is it necessary to master all the topics that are discussed in the chapters and the appendices; rather, they are a go-to source of information to help you decide where you want to invest your time and effort.

As you might already know, learning often takes place through an iterative and repetitive approach whereby the cumulative exposure leads to a greater level of comfort and understanding of technical concepts. For some readers, this will be the first step in their journey toward mastering NLP and machine learning.

Please read the document ChapterOutline.doc that provides the rationale for each chapter, as well as the sequence in which you can read the chapters in this book.

How Is the Book Organized and What Will I Learn?

Most of this book is organized as paired chapters: the first two chapters contain introductory material for NumPy and Pandas, followed by a pair of chapters that contain NLP concepts, and then another pair of chapters that contain Python code samples that illustrate the NLP concepts.

The next pair of chapters introduce machine learning concepts and algorithms (such as Decision Trees, Random Forests, and SVMs), followed by chapter nine that explores sentiment analysis, recommender systems, COVID-19 analysis, spam detection, and a short discussion regarding chatbots. The tenth chapter contains examples of performing NLP tasks using TF2 and Keras, and the eleventh chapter presents the Transformer architecture, BERT-based models, and the GPT family of models, all of which have been developed during the past three years and to varying degrees they are considered SOTA (“state of the art”).

The appendices contain introductory material (including Python code samples) for various topics, including Python 3, Regular Expressions, Keras, TF2, Matplotlib and Seaborn. The Appendix A (which is the most extensive in terms of page count) contains myriad topics, such as working with datasets that contain different types of data, handling missing data, statistical concepts, how to handle imbalanced features (SMOTE), how to analyze classifiers, variance and correlation matrices, dimensionality reduction (including SVD and t-SNE), and a section that discusses Gini impurity, entropy, and KL-divergence.

Why Is There Minimal Coverage of Deep Learning?

This book is for developers who are looking for an introduction to NLP, along with an introduction to machine learning. If you peruse the table of contents, you will see that this book covers a vast assortment of topics, and weighs in around 600 pages. Books have a “tipping point” in terms of page count, beyond which few people have the time to read 1000-page books on technical topics, especially when the field is undergoing continual innovation.

With the preceding points in mind, the inclusion of an extensive section pertaining to deep learning is beyond the scope of an introductory book, and better suited in a book called “Deep Learning and NLP” (or some other similar title).

Why Are the Code Samples Primarily in Python?

Most of the code samples are short (usually less than one page and sometimes less than half a page), and if need be, you can easily and quickly copy/paste the code into a new Jupyter notebook.

The machine learning code samples that perform more time-consuming computations are available as Python scripts as well as Jupyter notebooks. For the Python code samples that reference a CSV file, you do not need any additional code in the corresponding Jupyter notebook to access the CSV file. Moreover, the code samples execute quickly, so you won’t need to avail yourself of the free GPU that is provided in Google Colaboratory.

If you do decide to use Google Colaboratory, you can easily copy/paste the Python code into a notebook, and also use the upload feature to upload existing Jupyter notebooks. Keep in mind the following point: if the Python code references a CSV file, make sure that you include the appropriate code snippet (as explained in Chapter 1) to access the CSV file in the corresponding Jupyter notebook in Google Colaboratory.

How Much Keras Knowledge Is Needed for This Book?

Some exposure to Keras is helpful, and you can read Appendix D if Keras is new to you. In addition, one of the appendices provides an introduction to TensorFlow 2. Please keep in mind that Keras is well-integrated into TensorFlow 2 (in the tf.keras namespace), and it provides a layer of abstraction over “pure” TensorFlow that will enable you to develop prototypes more quickly.

Do I Need to Learn the Theory Portions of This Book?

Once again, the answer depends on the extent to which you plan to become involved in NLP and machine learning. In addition to creating a model, you will use various algorithms to see which ones provide the level of accuracy (or some other metric) that you need for your project. If you fall short, the theoretical aspects of machine learning can help you perform a “forensic” analysis of your model and your data, and ideally assist in determining how to improve your model.

How Were the Code Samples Created?

The code samples in this book were created and tested using Python 3 and Keras that’s built into TensorFlow 2 on a MacBook Pro with OS X 10.12.6 (macOS Sierra). Regarding their content: the code samples are derived primarily from the author for his Deep Learning and Keras graduate course. In some cases, there are code samples that incorporate short sections of code from discussions in online forums. The key point to remember is that the code samples follow the “Four Cs”: they must be Clear, Concise, Complete, and Correct to the extent that it’s possible to do so, given the size of this book.

Getting the Most from This Book

Some programmers learn well from prose, others learn well from sample code (and lots of it), which means that there’s no single style that can be used for everyone.

Moreover, some programmers want to run the code first, see what it does, and then return to the code to delve into the details (and others use the opposite approach).

Consequently, there are various types of code samples in this book: some are short, some are long, and other code samples “build” from earlier code samples.

What Do I Need to Know for This Book?

Current knowledge of Python 3.x is the most helpful skill. Knowledge of other programming languages (such as Java) can also be helpful because of the exposure to programming concepts and constructs. The less technical knowledge that you have, the more diligence will be required in order to understand the various topics that are covered.

If you want to be sure that you can grasp the material in this book, glance through some of the code samples to get an idea of how much is familiar to you and how much is new for you.

Doesn’t the Companion Disc Obviate the Need for This Book?

The companion files contain all the code samples to save you time and effort from the error-prone process of manually typing code into a text file. In addition, there are situations in which you might not have easy access to these files. Furthermore, the code samples in the book provide explanations that are not available on the companion files.

The companion files are available for downloading by writing to the publisher at [email protected].

Does This Book Contain Production-Level Code Samples?

The primary purpose of the code samples is to show you Python-based libraries for solving a variety of NLP-related tasks in conjunction with machine learning. Clarity has higher priority than writing more compact code that is more difficult to understand (and possibly more prone to bugs). If you decide to use any of the code in a production Website, you ought to subject that code to the same rigorous analysis as the other parts of your code base.

What Are the Non-Technical Prerequisites for This Book?

Although the answer to this question is more difficult to quantify, it’s especially important to have a strong desire to learn about machine learning, along with the motivation and discipline to read and understand the code samples.

Even simple machine language APIs can be a challenge the first time you encounter them, so be prepared to read the code samples several times.

How Do I Set Up a Command Shell?

If you are a Mac user, there are three ways to do so. The first method is to use Finder to navigate to Applications > Utilities and then double click on the Utilities application. Next, if you already have a command shell available, you can launch a new command shell by typing the following command:

open /Applications/Utilities/Terminal.app

A second method for Mac users is to open a new command shell on a MacBook from a command shell that is already visible simply by clicking command+n in that command shell, and your Mac will launch another command shell.

If you are a PC user, you can install Cygwin (open source https://cygwin.com/) that simulates bash commands, or use another toolkit such as MKS (a commercial product). Please read the online documentation that describes the download and installation process. Note that custom aliases are not automatically set if they are defined in a file other than the main start-up file (such as .bash_login).

Companion Files

All the code samples and figures in this book may be obtained by writing to the publisher at [email protected].

Other Books by the Author

This book contains several appendices that are portions from the following books that are also published by Mercury Learning and Information:

Python Pocket Primer:

9781938549854

Regular Expressions Pocket Primer:

9781683922278

Data Cleaning Pocket Primer

9781683922179

What Are the “Next Steps” After Finishing This Book?

The answer to this question varies widely, mainly because the answer depends heavily on your objectives. If you are interested primarily in NLP, then you can learn more advanced concepts, such as attention, transformers, and the BERT-related models.

If you are primarily interested in machine learning, there are some subfields of machine learning, such as deep learning and reinforcement learning (and deep reinforcement learning) that might appeal to you. Fortunately, there are many resources available, and you can perform an Internet search for those resources. One other point: the aspects of machine learning for you to learn depend on who you are: the needs of a machine learning engineer, data scientist, manager, student, or software developer are all different.

Oswald CampesatoApril 2021

CHAPTER 1

INTRODUCTION TO NUMPY

 

This chapter provides a quick introduction to the PythonNumPy package that provides very useful functionality, not only for Python scripts, but also for Python-based scripts with TensorFlow. This chapter contains NumPy code samples with loops, arrays, and lists. You will also learn about dot products, the reshape() method (very useful!), how to plot with Matplotlib (discussed in Appendix F), and examples of linear regression.

The first part of this chapter briefly introduces NumPy and some of its useful features. The second part contains examples of working arrays in NumPy, and contrasts some of the APIs for lists with the same APIs for arrays. In addition, you will see how easy it is to compute the exponent-related values (square, cube, and so forth) of elements in an array.

The second part of the chapter introduces subranges, which are very useful (and frequently used) for extracting portions of datasets in machine learning tasks. In particular, you will see code samples that handle negative (-1) subranges for vectors as well as for arrays, because they are interpreted one way for vectors and a different way for arrays.

The third part of this chapter delves into other NumPy methods, including the reshape() method, which is extremely useful (and very common) when working with images files: some TensorFlow APIs require converting a 2D array of (R,G,B) values into a corresponding one-dimensional vector.

The fourth part of this chapter delves into linear regression, the mean squared error (MSE), and how to calculate MSE with the NumPylinspace() API.

What is NumPy?

NumPy is a Python module that provides many convenience methods and also better performance. NumPy provides a core library for scientific computing in Python, with performant multidimensional arrays and good vectorized math functions, along with support for linear algebra and random numbers.

NumPy is modeled after MatLab, with support for lists, arrays, and so forth. NumPy is easier to use than MatLab, and it’s very common in TensorFlow 2.x code as well as Python code. Moreover, Chapter 2 contains code samples that combine NumPy with Pandas.

Useful NumPy Features

The NumPy package provides the ndarray object that encapsulates multidimensional arrays of homogeneous data types. Many ndarray operations are performed in compiled code in order to improve performance.

NumPy arrays have the following properties:

They have a fixed size

Elements have the same data type

Elements have the same size (except for objects)

Modifying an array involves creating a new array

Now that you have a general idea about NumPy, let’s delve into some examples that illustrate how to work with NumPy arrays, which is the topic of the next section.

Working with “-1” Subranges with Arrays

Listing 1.11 displays the contents of np2darray2.py that illustrates how to select different ranges of elements in a two-dimensional NumPy array.

LISTING 1.11: np2darray2.py

import numpy as np
 
# -1 => "the last element in . . ." (row or col)
 
arr1  = np.array([(1,2,3),(4,5,6),(7,8,9),(10,11,12)])
print('arr1:',        arr1)
print('arr1[-1,:]:',  arr1[-1,:])
print('arr1[:,-1]:',  arr1[:,-1])
print('arr1[-1:,-1]:',arr1[-1:,-1])

Listing 1.11 contains a NumPy array called arr1 followed by four print statements, each of which displays a different subrange of values in arr1. The output from launching Listing 1.11 is here:

(arr1:', array([[1,  2,  3],
               [4,  5,  6],
               [7,  8,  9],
               [10, 11, 12]]))
(arr1[-1,:]]',   array([10, 11, 12]))
(arr1[:,-1]:',   array([3,  6,  9, 12]))
(arr1[-1:,-1]]', array([12]))

Other Useful NumPy Methods

In addition to the NumPy methods that you saw in the code samples prior to this section, the following (often intuitively named) NumPy methods are also very useful.

The method np.zeros() initializes an array with 0 values.

The method np.ones() initializes an array with 1 values.

The method np.empty()initializes an array with 0 values.

The method np.arange() provides a range of numbers:

The method np.shape() displays the shape of an object:

The method np.reshape() <= very useful!

The method np.linspace() <= useful in regression

The method np.mean() computes the mean of a set of numbers:

The method np.std() computes the standard deviation of a set of numbers:

Although the np.zeros() and np.empty() both initialize a 2D array with 0, np.zeros() requires less execution time. You could also use np.full(size, 0), but this method is the slowest of the three methods.

The reshape() method and the linspace() method are very useful for changing the dimensions of an array and generating a list of numeric values, respectively. The reshape() method appears in TensorFlow code, and the linspace() method is useful for generating a set of numbers in linear regression (discussed in Chapter 8). The mean() and std() methods are useful for calculating the mean and the standard deviation of a set of numbers. For example, you can use these two methods in order to resize the values in a Gaussian distribution so that their mean is 0 and the standard deviation is 1. This process is called standardizing a Gaussian distribution.