58,79 €
Unlock the potential of NLP and machine learning with this comprehensive guide. Learn advanced techniques, implement practical applications, and master tools like NumPy, Pandas, and transformer models.
This book is ideal for developers and data scientists who want to enhance their skills in NLP and machine learning. Basic knowledge of Python is recommended. Prior experience with data manipulation and machine learning concepts is beneficial.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 988
Veröffentlichungsjahr: 2024
Pocket Primer
Oswald Campesato
Copyright ©2021 by MERCURY LEARNINGAND INFORMATION LLC. All rights reserved.
This publication, portions of it, or any accompanying software may not be reproduced in any way, stored in a retrieval system of any type, or transmitted by any means, media, electronic display or mechanical display, including, but not limited to, photocopy, recording, Internet postings, or scanning, without prior permission in writing from the publisher.
Publisher: David Pallai
MERCURY LEARNINGAND INFORMATION
22841 Quicksilver Drive
Dulles, VA 20166
www.merclearning.com
800-232-0223
O. Campesato. Natural Language Processing and Machine Learning for Developers.
ISBN: 978-1-68392-618-4
The publisher recognizes and respects all marks used by companies, manufacturers, and developers as a means to distinguish their products. All brand names and product names mentioned in this book are trademarks or service marks of their respective companies. Any omission or misuse (of any kind) of service marks or trademarks, etc. is not an attempt to infringe on the property of others.
Library of Congress Control Number: 2021936681
212223321 Printed on acid-free paper in the United States of America.
Our titles are available for adoption, license, or bulk purchase by institutions, corporations, etc.For additional information, please contact the Customer Service Dept. at 800-232-0223(toll free).
All of our titles are available in digital format at academiccourseware.com and other digital vendors.The sole obligation of MERCURY LEARNING AND INFORMATION to the purchaser is to replace the book, based on defective materials or faulty workmanship, but not based on the operation or functionality of the product.
I’d like to dedicate this book to my parents-may this bring joy and happiness into their lives.
Preface
Chapter 1 Introduction to NumPy
What is NumPy?
Useful NumPy Features
What are NumPy Arrays?
Working with Loops
Appending Elements to Arrays (1)
Appending Elements to Arrays (2)
Multiply Lists and Arrays
Doubling the Elements in a List
Lists and Exponents
Arrays and Exponents
Math Operations and Arrays
Working with “-1” Subranges with Vectors
Working with “-1” Subranges with Arrays
Other Useful NumPy Methods
Arrays and Vector Operations
NumPy and Dot Products (1)
NumPy and Dot Products (2)
NumPy and the “Norm” of Vectors
NumPy and Other Operations
NumPy and the reshape() Method
Calculating the Mean and Standard Deviation
Trimmed Mean and Weighted Mean
Code Sample with Mean and Standard Deviation
Working with Lines in the Plane (Optional)
Plotting a Line with NumPy and Matplotlib
Plotting a Quadratic with NumPy and Matplotlib
What is Linear Regression?
What is Multivariate Analysis?
What about Nonlinear Datasets?
The MSE Formula
Other Error Types
Nonlinear Least Squares
Calculating the MSE Manually
Find the Best-Fitting Line with NumPy
Calculating MSE by Successive Approximation (1)
Calculating MSE by Successive Approximation (2)
What is Jax?
Google Colaboratory
Uploading CSV Files in Google Colaboratory
Summary
Chapter 2 Introduction to Pandas
What is Pandas?
Pandas Options and Settings
Pandas Data Frames
Data Frames and Data Cleaning Tasks
Alternatives to Pandas
A Pandas Data Frame with NumPy Example
Describing a Pandas Data Frame
Pandas Boolean Data Frames
Transposing a Pandas Data Frame
Pandas Data Frames and Random Numbers
Reading CSV Files in Pandas
The loc() and iloc() Methods in Pandas
Converting Categorical Data to Numeric Data
Matching and Splitting Strings in Pandas
Converting Strings to Dates in Pandas
Merging and Splitting Columns in Pandas
Combining Pandas Data frames
Data Manipulation with Pandas Data Frames (1)
Data Manipulation with Pandas Data Frames (2)
Data Manipulation with Pandas Data Frames (3)
Pandas Data Frames and CSV Files
Managing Columns in Data Frames
Switching Columns
Appending Columns
Deleting Columns
Inserting Columns
Scaling Numeric Columns
Managing Rows in Pandas
Selecting a Range of Rows in Pandas
Finding Duplicate Rows in Pandas
Inserting New Rows in Pandas
Handling Missing Data in Pandas
Multiple Types of Missing Values
Test for Numeric Values in a Column
Replacing NaN Values in Pandas
Sorting Data Frames in Pandas
Working with groupby() in Pandas
Working with apply() and mapapply() in Pandas
Handling Outliers in Pandas
Pandas Data Frames and Scatterplots
Pandas Data Frames and Simple Statistics
Aggregate Operations in Pandas Data Frames
Aggregate Operations with the titanic.csv Dataset
Save Data Frames as CSV Files and Zip Files
Pandas Data Frames and Excel Spreadsheets
Working with JSON-based Data
Python Dictionary and JSON
Python, Pandas, and JSON
Pandas and Regular Expressions (Optional)
Useful One-Line Commands in Pandas
What is Method Chaining?
Pandas and Method Chaining
Pandas Profiling
What is Texthero?
Summary
Chapter 3 NLP Concepts (I)
The Origin of Languages
Language Fluency
Major Language Groups
Peak Usage of Some Languages
Languages and Regional Accents
Languages and Slang
Languages and Dialects
The Complexity of Natural Languages
Word Order in Sentences
What about Verbs?
Auxiliary Verbs
What are Case Endings?
Languages and Gender
Singular and Plural Forms of Nouns
Changes in Spelling of Words
Japanese Grammar
Japanese Postpositions (Particles)
Ambiguity in Japanese Sentences
Japanese Nominalization
Google Translate and Japanese
Japanese and Korean
Vowel-Optional Languages and Word Direction
Mutating Consonant Spelling
Expressing Negative Opinions
Phonetic Languages
Phonemes and Morphemes
English Words of Greek and Latin Origin
Multiple Ways to Pronounce Consonants
The Letter “j” in Various Languages
“Hard” versus “Soft” Consonant Sounds
“Ess,” “zee,” and “sh” Sounds
Three Consecutive Consonants
Diphthongs and Triphthongs in English
Semi-Vowels in English
Challenging English Sounds
English in Canada, UK, Australia, and the United States
English Pronouns and Prepositions
What is NLP?
The Evolution of NLP
A Wide-Angle View of NLP
NLP Applications and Use Cases
NLU and NLG
What is Text Classification?
Information Extraction and Retrieval
Word Sense Disambiguation
NLP Techniques in ML
NLP Steps for Training a Model
Text Normalization and Tokenization
Word Tokenization in Japanese
Text Tokenization with Unix Commands
Handling Stop Words
What is Stemming?
Singular versus Plural Word Endings
Common Stemmers
Stemmers and Word Prefixes
Over Stemming and Under Stemming
What is Lemmatization?
Stemming/Lemmatization Caveats
Limitations of Stemming and Lemmatization
Working with Text: POS
POS Tagging
POS Tagging Techniques
Working with Text: NER
Abbreviations and Acronyms
NER Techniques
What is Topic Modeling?
Keyword Extraction, Sentiment Analysis, and Text Summarization
Summary
Chapter 4 NLP Concepts (II)
What is Word Relevance?
What is Text Similarity?
Sentence Similarity
Sentence Encoders
Working with Documents
Document Classification
Document Similarity (doc2vec)
Techniques for Text Similarity
Similarity Queries
What is Text Encoding?
Text Encoding Techniques
Document Vectorization
One-Hot Encoding (OHE)
Index-Based Encoding
Additional Encoders
The BoW Algorithm
What are n-grams?
Calculating Probabilities with N-grams
Calculating tf, idf, and tf-idf
What is Term Frequency (TF)?
What is Inverse Document Frequency (IDF)?
What is tf-idf?
Limitations of tf-idf
Pointwise Mutual Information (PMI)
The Context of Words in a Document
What is Semantic Context?
Textual Entailment
Discrete, Distributed, and Contextual Word Representations
What is Cosine Similarity?
Text Vectorization (aka Word Embeddings)
Overview of Word Embeddings and Algorithms
Word Embeddings
Word Embedding Algorithms
What is Word2vec?
The Intuition for Word2vec
The Word2vec Architecture
Limitations of Word2vec
The CBoW Architecture
What are Skip-grams?
Skip-gram Example
The Skip-gram Architecture
Neural Network Reduction
What is GloVe?
Working with GloVe
What is FastText?
Comparison of Word Embeddings
What is Topic Modeling?
Topic Modeling Algorithms
LDA and Topic Modeling
Text Classification versus Topic Modeling
Language Models and NLP
How to Create a Language Model
Vector Space Models
Term-Document Matrix
Tradeoffs of the VSM
NLP and Text Mining
Text Extraction Preprocessing and N-Grams
Relation Extraction and Information Extraction
What is a BLEU Score?
ROUGE Score: An Alternative to BLEU
Summary
Chapter 5 Algorithms and Toolkits (I)
Cleaning Data with Regular Expressions
Handling Contracted Words
Python Code Samples of BoW
One-Hot Encoding Examples
Sklearn and Word Embedding Examples
What is BeautifulSoup?
Web Scraping with Pure Regular Expressions
What is Scrapy?
What is SpaCy?
SpaCy and Stop Words
SpaCy and Tokenization
SpaCy and Lemmatization
SpaCy and NER
SpaCy Pipelines
SpaCy and Word Vectors
The scispaCy Library (Optional)
Summary
Chapter 6 Algorithms and Toolkits (II)
What is NLTK?
NLTK and BoW
NLTK and Stemmers
NLTK and Lemmatization
NLTK and Stop Words
What is Wordnet?
Synonyms and Antonyms
NLTK, lxml, and XPath
NLTK and n-grams
NLTK and POS (1)
NLTK and POS (2)
NLTK and Tokenizers
NLTK and Context-Free Grammars (Optional)
What is Gensim?
Gensim and tf-idf Example
Saving a Word2vec Model in Genism
An Example of Topic Modeling
A Brief Comparison of Popular Python-Based NLP Libraries
Miscellaneous Libraries
Summary
Chapter 7 Introduction to Machine Learning
What is Machine Learning?
Learning Style of Machine Learning Algorithms
Types of Machine Learning Algorithms
Machine Learning Tasks
Preparing a Dataset and Training a Model
Feature Engineering, Selection, and Extraction
Feature Engineering
Feature Selection
Feature Extraction
Model Selection
Working with Datasets
Training Data versus Test Data
What is Cross-Validation?
Overfitting versus Underfitting
What is Regularization?
ML and Feature Scaling
Data Normalization Techniques
Metrics in Machine Learning
R-Squared and its Limitations
Confusion Matrix
Precision, Recall, and Specificity
The ROC Curve and AUC
Metrics for Model Evaluation and Selection
What is Linear Regression?
Linear Regression versus Curve-Fitting
When are Solutions Exact Values?
What is Multivariate Analysis?
Other Types of Regression
Working with Lines in the Plane (Optional)
Scatter Plots with NumPy and Matplotlib (1)
Why the “Perturbation Technique” is Useful
Scatter Plots with NumPy and Matplotlib (2)
A Quadratic Scatterplot with NumPy and Matplotlib
The Mean Squared Error (MSE) Formula
A List of Error Types
Nonlinear Least Squares
Calculating the MSE Manually
Approximating Linear Data with np.linspace()
What are Ensemble Methods?
Four Types of Ensemble Methods
Bagging
Boosting
Stacked Models and Blending Models
What is Bootstrapping?
Common Boosting Algorithms
Hyperparameter Optimization
Grid Search
Randomized Search
Bayesian Optimization
AutoML, AutoML-Zero, and AutoNLP
Miscellaneous Topics
What is Causality?
What is Explainability?
What is Interpretability?
Summary
Chapter 8 Classifiers in Machine Learning
What is Classification?
What are Classifiers?
Common Classifiers
Binary versus Multiclass Classification
Multilabel Classification
What are Linear Classifiers?
What is kNN?
How to Handle a Tie in kNN
SMOTE and kNN
kNN for Data Imputation
What are Decision Trees?
Trade-offs with Decision Trees
Decision Tree Algorithms
Decision Tree Code Samples
Decision Trees, Gini Impurity, and Entropy
What are Random Forests?
What are Support Vector Machines?
Trade-offs of SVMs
What is a Bayesian Classifier?
Types of Naïve Bayes Classifiers
Training Classifiers
Evaluating Classifiers
Trade-offs for ML Algorithms
What are Activation Functions?
Why Do we Need Activation Functions?
How Do Activation Functions Work?
Common Activation Functions
Activation Functions in Python
Keras Activation Functions
The ReLU and ELU Activation Functions
The Advantages and Disadvantages of ReLU
ELU
Sigmoid, Softmax, and Hardmax Similarities
Softmax
Softplus
Tanh
Sigmoid, Softmax, and HardMax Differences
Hyperparameters for Neural Networks
The Loss Function Hyperparameter
The Optimizer Hyperparameter
The Learning Rate Hyperparameter
The Dropout Rate Hyperparameter
What is Backward Error Propagation?
What is Logistic Regression?
Setting a Threshold Value
Logistic Regression: Important Assumptions
Linearly Separable Data
Keras, Logistic Regression, and Iris Dataset
Sklearn and Linear Regression
SciPy and Linear Regression
Keras and Linear Regression
Summary
Chapter 9 NLP Applications
What is Text Summarization?
Extractive Text Summarization
Abstractive Text Summarization
Text Summarization with gensim and SpaCy
What are Recommender Systems?
Movie Recommender Systems
Factoring the Rating Matrix R
Content-Based Recommendation Systems
Analyzing only the Description of the Content
Building User Profiles and Item Profiles
Collaborative Filtering Algorithm
User–User Collaborative Filtering
Item–Item Collaborative Filtering
Recommender System with Surprise
Recommender Systems and Reinforcement Learning (Optional)
Basic Reinforcement Learning in Five Minutes
What is RecSim?
What is Sentiment Analysis?
Useful Tools for Sentiment Analysis
Aspect-Based Sentiment Analysis
Deep Learning and Sentiment Analysis
Sentiment Analysis with Naïve Bayes
Sentiment Analysis in NLTK and VADER
Sentiment Analysis with Textblob
Sentiment Analysis with Flair
Detecting Spam
Logistic Regression and Sentiment Analysis
Working with COVID-19
What are Chatbots?
Open Domain Chatbots
Chatbot Types
Logic Flow of Chatbots
Chatbot Abuses
Useful Links
Summary
Chapter 10 NLP and TF2/Keras
Term-Document Matrix
Text Classification Algorithms in Machine Learning
A Keras-Based Tokenizer
TF2 and Tokenization
TF2 and Encoding
A Keras-Based Word Embedding
An Example of BoW with TF2
The 20newsgroup Dataset
Text Classification with the kNN Algorithm
Text Classification with a Decision Tree Algorithm
Text Classification with a Random Forest Algorithm
Text Classification with the SVC Algorithm
Text Classification with the Naïve Bayes Algorithm
Text Classification with the kMeans Algorithm
TF2/Keras and Word Tokenization
TF2/Keras and Word Encodings
Text Summarization with TF2/Keras and Reuters Dataset
Summary
Chapter 11 Transformer, BERT, and GPT
What is Attention?
Types of Word Embeddings
Types of Attention and Algorithms
An Overview of the Transformer Architecture
The Transformers Library from HuggingFace
Transformer and NER Tasks
Transformer and QnA Tasks
Transformer and Sentiment Analysis Tasks
Transformer and Mask Filling Tasks
What is T5?
What is BERT?
BERT Features
How is BERT Trained?
How BERT Differs from Earlier NLP Techniques
The Inner Workings of BERT
What is MLM?
What is NSP?
Special Tokens
BERT Encoding: Sequence of Steps
Subword Tokenization
Sentence Similarity in BERT
Word Context in BERT
Generating BERT Tokens (1)
Generating BERT Tokens (2)
The BERT Family
Surpassing Human Accuracy: deBERTa
What is Google Smith?
Introduction to GPT
Installing the Transformers Package
Working with GPT-2
What is GPT-3?
What is the Goal?
GPT-3 Task Strengths and Mistakes
GPT-3 Architecture
GPT versus BERT
Zero-Shot, One-Shot, and Few Shot Learners
GPT Task Performance
The Switch Transformer: One Trillion Parameters
Looking Ahead
Summary
Appendix A Data and Statistics
What are Datasets?
Data Preprocessing
Data Types
Preparing Datasets
Continuous versus Discrete Data
“Binning” Continuous Data
Scaling Numeric Data via Normalization
Scaling Numeric Data via Standardization
What to Look for in Categorical Data
Mapping Categorical Data to Numeric Values
Working with Dates
Working with Currency
Missing Data, Anomalies, and Outliers
Anomalies and Outliers
Outlier Detection
Missing Data: MCAR, MAR, and MNAR
What is Data Drift?
What is Imbalanced Classification?
Undersampling and Oversampling
Limitations of Resampling
What is SMOTE?
SMOTE Extensions
Analyzing Classifiers
What is LIME?
What is ANOVA?
What is a Probability?
Calculating the Expected Value
Random Variables
Discrete versus Continuous Random Variables
Well-Known Probability Distributions
Fundamental Concepts in Statistics
The Mean
The Median
The Mode
The Variance and Standard Deviation
Population, Sample, and Population Variance
Chebyshev’s Inequality
What is a p-Value?
The Moments of a Function (Optional)
Skewness
Kurtosis
Data and Statistics
The Central Limit Theorem
Correlation versus Causation
Statistical Inferences
The Bias-Variance Trade-off
Types of Bias in Data
Gini Impurity, Entropy, and Perplexity
What is Gini Impurity?
What is Entropy?
Calculating Gini Impurity and Entropy Values
Multidimensional Gini Index
What is Perplexity?
Cross-Entropy and KL Divergence
What is Cross Entropy?
What is KL Divergence?
What’s their Purpose?
Covariance and Correlation Matrices
Covariance Matrix
Covariance Matrix: An Example
Correlation Matrix
Eigenvalues and Eigenvectors
Calculating Eigenvectors: A Simple Example
Gauss Jordan Elimination (Optional)
Principal Component Analysis (PCA)
The New Matrix of Eigenvectors
Dimensionality Reduction
Dimensionality Reduction Techniques
The Curse of Dimensionality
What are Manifolds (Optional)?
Singular Value Decomposition (SVD)
Locally Linear Embedding (LLE)
UMAP
t-SNE (“tee-snee”)
PHATE
Linear Versus Nonlinear Reduction Techniques
Types of Distance Metrics
Other Well-Known Distance Metrics
Pearson Correlation Coefficient
Jaccard Index (or Similarity)
Local Sensitivity Hashing (Optional)
What is Sklearn?
Sklearn, Pandas, and the IRIS Dataset
Sklearn and Outlier Detection
What is Bayesian Inference?
Bayes Theorem
Some Bayesian Terminology
What is MAP?
Why Use Bayes Theorem?
What are Vector Spaces?
Summary
Appendix B Introduction to Python
Tools for Python
easy_install and pip
virtualenv
IPython
Python Installation
Setting the PATH Environment Variable (Windows Only)
Launching Python on Your Machine
The Python Interactive Interpreter
Python Identifiers
Lines, Indentation, and Multilines
Quotation and Comments in Python
Saving Your Code in a Module
Some Standard Modules in Python
The help() and dir() Functions
Compile Time and Runtime Code Checking
Simple Data Types in Python
Working with Numbers
Working with Other Bases
The chr() Function
The round() Function in Python
Formatting Numbers in Python
Working with Fractions
Unicode and UTF-8
Working with Unicode
Working with Strings
Comparing Strings
Formatting Strings in Python
Uninitialized Variables and the Value None in Python
Slicing and Splicing Strings
Testing for Digits and Alphabetic Characters
Search and Replace a String in Other Strings
Remove Leading and Trailing Characters
Printing Text without NewLine Characters
Text Alignment
Working with Dates
Converting Strings to Dates
Exception Handling in Python
Handling User Input
Python and Emojis (Optional)
Command-Line Arguments
Summary
Appendix C Introduction to Regular Expressions
What are Regular Expressions?
Metacharacters in Python
Character Sets in Python
Working with “^” and “\”
Character Classes in Python
Matching Character Classes with the re Module
Using the re.match() Method
Options for the re.match() Method
Matching Character Classes with the re.search() Method
Matching Character Classes with the findAll() Method
Finding Capitalized Words in a String
Additional Matching Function for Regular Expressions
Grouping with Character Classes in Regular Expressions
Using Character Classes in Regular Expressions
Matching Strings with Multiple Consecutive Digits
Reversing Words in Strings
Modifying Text Strings with the re Module
Splitting Text Strings with the re.split() Method
Splitting Text Strings Using Digits and Delimiters
Substituting Text Strings with the re.sub() Method
Matching the Beginning and the End of Text Strings
Compilation Flags
Compound Regular Expressions
Counting Character Types in a String
Regular Expressions and Grouping
Simple String Matches
Additional Topics for Regular Expressions
Summary
Exercises
Appendix D Introduction to Keras
What is Keras?
Working with Keras Namespaces in TF 2
Working with the tf.keras.layers Namespace
Working with the tf.keras.activations Namespace
Working with the keras.tf.datasets Namespace
Working with the tf.keras.experimental Namespace
Working with Other tf.keras Namespaces
TF 2 Keras versus “Standalone” Keras
Creating a Keras-Based Model
Keras and Linear Regression
Keras, MLPs, and MNIST
Keras, CNNs, and cifar10
Resizing Images in Keras
Keras and Early Stopping (1)
Keras and Early Stopping (2)
Keras and Metrics
Saving and Restoring Keras Models
Summary
Appendix E Introduction to TensorFlow 2
What is TF 2?
TF 2 Use Cases
TF 2 Architecture: The Short Version
TF 2 Installation
TF 2 and the Python REPL
Other TF 2-Based Toolkits
TF 2 Eager Execution
TF 2 Tensors, Data Types, and Primitive Types
TF 2 Data Types
TF 2 Primitive Types
Constants in TF 2
Variables in TF 2
The tf.rank() API
The tf.shape() API
Variables in TF 2 (Revisited)
TF 2 Variables versus Tensors
What is @tf.function in TF 2?
How Does @tf.function Work?
A Caveat about @tf.function in TF 2
The tf.print() Function and Standard Error
Working with @tf.function in TF 2
An Example without @tf.function
An Example with @tf.function
Overloading Functions with @tf.function
What is AutoGraph in TF 2?
Arithmetic Operations in TF 2
Caveats for Arithmetic Operations in TF 2
TF 2 and Built-In Functions
Calculating Trigonometric Values in TF 2
Calculating Exponential Values in TF 2
Working with Strings in TF 2
Working with Tensors and Operations in TF 2
Second-Order Tensors in TF 2 (1)
Second-Order Tensors in TF 2 (2)
Multiplying Two Second-Order Tensors in TF
Convert Python Arrays to TF Tensors
Conflicting Types in TF 2
Differentiation and tf.GradientTape in TF 2
Examples of tf.GradientTape
Using Nested Loops with tf.GradientTape
Other Tensors with tf.GradientTape
A Persistent Gradient Tape
What is Trax?
Google Colaboratory
Other Cloud Platforms
GCP SDK
TF2 and tf.data.Dataset
The TF 2 tf.data.Dataset
Creating a Pipeline
A Simple TF 2 tf.data.Dataset
What are Lambda Expressions?
Working with Generators in TF 2
Summary
Appendix F Data Visualization
What is Data Visualization?
Types of Data Visualization
What is Matplotlib?
Horizontal Lines in Matplotlib
Slanted Lines in Matplotlib
Parallel Slanted Lines in Matplotlib
A Grid of Points in Matplotlib
A Dotted Grid in Matplotlib
Lines in a Grid in Matplotlib
A Colored Grid in Matplotlib
A Colored Square in an Unlabeled Grid in Matplotlib
Randomized Data Points in Matplotlib
A Histogram in Matplotlib
A Set of Line Segments in Matplotlib
Plotting Multiple Lines in Matplotlib
Trigonometric Functions in Matplotlib
Display IQ Scores in Matplotlib
Plot a Best-Fitting Line in Matplotlib
Introduction to Sklearn (scikit-learn)
The Digits Dataset in Sklearn
The Iris Dataset in Sklearn
Sklearn, Pandas, and the Iris Dataset
The Iris Dataset in Sklearn (Optional)
The faces Dataset in Sklearn (Optional)
Working with Seaborn
Features of Seaborn
Seaborn Built-in Datasets
The Iris Dataset in Seaborn
The Titanic Dataset in Seaborn
Extracting Data from the Titanic Dataset in Seaborn (1)
Extracting Data from the Titanic Dataset in Seaborn (2)
Visualizing a Pandas Dataset in Seaborn
Data Visualization in Pandas
Summary
Index
This book contains a fast-paced introduction to as much relevant information about NLP and machine learning as possible that can be reasonably included in a book of this size. Some chapters contain topics that are discussed in great detail (such as the first half of Chapter 3), and other chapters contain advanced statistical concepts that you can safely omit during your first pass through this book. The book casts a wide net to help developers who have a range of technical backgrounds, which is the rationale for the inclusion of numerous topics. Regardless of your background, please keep in mind the following point: you will not become an expert in machine learning or NLP by reading this book, and be prepared to read some of the content in this book multiple times.
However, you will be exposed to many NLP and machine learning topics, and many topics are presented in a cursory manner for two reasons. First, it’s important that you be exposed to these concepts. In some cases, you will find topics that might pique your interest, and motivate you to learn more about them through self-study; in other cases, you will probably be satisfied with a brief introduction.
Second, a full treatment of all the topics that are covered in this book would probably triple the size of this book, and few people are interested in reading 1,000-page technical books. Subsequently, the book provides a broad view of the NLP and machine learning landscape, based on the belief that this approach will be more beneficial for readers who are already experienced developers, but need to learn about NLP and machine learning.
The book is intended primarily for people who have a solid background as software developers. Specifically, it is for developers who are accustomed to searching online for more detailed information about technical topics. If you are a beginner, there are other books that are more suitable for you, and you can find them by performing an online search.
The book is also intended to reach an international audience of readers with highly diverse backgrounds in various age groups. While many readers know how to read English, their native spoken language is not English (which could be their second, third, or even fourth language). Consequently, this book uses standard English rather than colloquial expressions that might be confusing to those readers. As you know, many people learn by different types of imitation, which includes reading, writing, or hearing new material. This book takes these points into consideration in order to provide a comfortable and meaningful learning experience for the intended readers.
As mentioned in the response to the previous question, this book is intended for developers who want to learn NLP concepts and machine learning. Since this encompasses people with vastly different technical backgrounds, there are readers who “don’t know what they don’t know” regarding NLP. Therefore, it exposes people to a plethora of NLP-related concepts, after which they can decide which topics to select for greater study. Consequently, this book does not have a “zero-to-hero” approach, nor is it necessary to master all the topics that are discussed in the chapters and the appendices; rather, they are a go-to source of information to help you decide where you want to invest your time and effort.
As you might already know, learning often takes place through an iterative and repetitive approach whereby the cumulative exposure leads to a greater level of comfort and understanding of technical concepts. For some readers, this will be the first step in their journey toward mastering NLP and machine learning.
Please read the document ChapterOutline.doc that provides the rationale for each chapter, as well as the sequence in which you can read the chapters in this book.
Most of this book is organized as paired chapters: the first two chapters contain introductory material for NumPy and Pandas, followed by a pair of chapters that contain NLP concepts, and then another pair of chapters that contain Python code samples that illustrate the NLP concepts.
The next pair of chapters introduce machine learning concepts and algorithms (such as Decision Trees, Random Forests, and SVMs), followed by chapter nine that explores sentiment analysis, recommender systems, COVID-19 analysis, spam detection, and a short discussion regarding chatbots. The tenth chapter contains examples of performing NLP tasks using TF2 and Keras, and the eleventh chapter presents the Transformer architecture, BERT-based models, and the GPT family of models, all of which have been developed during the past three years and to varying degrees they are considered SOTA (“state of the art”).
The appendices contain introductory material (including Python code samples) for various topics, including Python 3, Regular Expressions, Keras, TF2, Matplotlib and Seaborn. The Appendix A (which is the most extensive in terms of page count) contains myriad topics, such as working with datasets that contain different types of data, handling missing data, statistical concepts, how to handle imbalanced features (SMOTE), how to analyze classifiers, variance and correlation matrices, dimensionality reduction (including SVD and t-SNE), and a section that discusses Gini impurity, entropy, and KL-divergence.
This book is for developers who are looking for an introduction to NLP, along with an introduction to machine learning. If you peruse the table of contents, you will see that this book covers a vast assortment of topics, and weighs in around 600 pages. Books have a “tipping point” in terms of page count, beyond which few people have the time to read 1000-page books on technical topics, especially when the field is undergoing continual innovation.
With the preceding points in mind, the inclusion of an extensive section pertaining to deep learning is beyond the scope of an introductory book, and better suited in a book called “Deep Learning and NLP” (or some other similar title).
Most of the code samples are short (usually less than one page and sometimes less than half a page), and if need be, you can easily and quickly copy/paste the code into a new Jupyter notebook.
The machine learning code samples that perform more time-consuming computations are available as Python scripts as well as Jupyter notebooks. For the Python code samples that reference a CSV file, you do not need any additional code in the corresponding Jupyter notebook to access the CSV file. Moreover, the code samples execute quickly, so you won’t need to avail yourself of the free GPU that is provided in Google Colaboratory.
If you do decide to use Google Colaboratory, you can easily copy/paste the Python code into a notebook, and also use the upload feature to upload existing Jupyter notebooks. Keep in mind the following point: if the Python code references a CSV file, make sure that you include the appropriate code snippet (as explained in Chapter 1) to access the CSV file in the corresponding Jupyter notebook in Google Colaboratory.
Some exposure to Keras is helpful, and you can read Appendix D if Keras is new to you. In addition, one of the appendices provides an introduction to TensorFlow 2. Please keep in mind that Keras is well-integrated into TensorFlow 2 (in the tf.keras namespace), and it provides a layer of abstraction over “pure” TensorFlow that will enable you to develop prototypes more quickly.
Once again, the answer depends on the extent to which you plan to become involved in NLP and machine learning. In addition to creating a model, you will use various algorithms to see which ones provide the level of accuracy (or some other metric) that you need for your project. If you fall short, the theoretical aspects of machine learning can help you perform a “forensic” analysis of your model and your data, and ideally assist in determining how to improve your model.
The code samples in this book were created and tested using Python 3 and Keras that’s built into TensorFlow 2 on a MacBook Pro with OS X 10.12.6 (macOS Sierra). Regarding their content: the code samples are derived primarily from the author for his Deep Learning and Keras graduate course. In some cases, there are code samples that incorporate short sections of code from discussions in online forums. The key point to remember is that the code samples follow the “Four Cs”: they must be Clear, Concise, Complete, and Correct to the extent that it’s possible to do so, given the size of this book.
Some programmers learn well from prose, others learn well from sample code (and lots of it), which means that there’s no single style that can be used for everyone.
Moreover, some programmers want to run the code first, see what it does, and then return to the code to delve into the details (and others use the opposite approach).
Consequently, there are various types of code samples in this book: some are short, some are long, and other code samples “build” from earlier code samples.
Current knowledge of Python 3.x is the most helpful skill. Knowledge of other programming languages (such as Java) can also be helpful because of the exposure to programming concepts and constructs. The less technical knowledge that you have, the more diligence will be required in order to understand the various topics that are covered.
If you want to be sure that you can grasp the material in this book, glance through some of the code samples to get an idea of how much is familiar to you and how much is new for you.
The companion files contain all the code samples to save you time and effort from the error-prone process of manually typing code into a text file. In addition, there are situations in which you might not have easy access to these files. Furthermore, the code samples in the book provide explanations that are not available on the companion files.
The companion files are available for downloading by writing to the publisher at [email protected].
The primary purpose of the code samples is to show you Python-based libraries for solving a variety of NLP-related tasks in conjunction with machine learning. Clarity has higher priority than writing more compact code that is more difficult to understand (and possibly more prone to bugs). If you decide to use any of the code in a production Website, you ought to subject that code to the same rigorous analysis as the other parts of your code base.
Although the answer to this question is more difficult to quantify, it’s especially important to have a strong desire to learn about machine learning, along with the motivation and discipline to read and understand the code samples.
Even simple machine language APIs can be a challenge the first time you encounter them, so be prepared to read the code samples several times.
If you are a Mac user, there are three ways to do so. The first method is to use Finder to navigate to Applications > Utilities and then double click on the Utilities application. Next, if you already have a command shell available, you can launch a new command shell by typing the following command:
A second method for Mac users is to open a new command shell on a MacBook from a command shell that is already visible simply by clicking command+n in that command shell, and your Mac will launch another command shell.
If you are a PC user, you can install Cygwin (open source https://cygwin.com/) that simulates bash commands, or use another toolkit such as MKS (a commercial product). Please read the online documentation that describes the download and installation process. Note that custom aliases are not automatically set if they are defined in a file other than the main start-up file (such as .bash_login).
All the code samples and figures in this book may be obtained by writing to the publisher at [email protected].
This book contains several appendices that are portions from the following books that are also published by Mercury Learning and Information:
Python Pocket Primer:
9781938549854
Regular Expressions Pocket Primer:
9781683922278
Data Cleaning Pocket Primer
9781683922179
The answer to this question varies widely, mainly because the answer depends heavily on your objectives. If you are interested primarily in NLP, then you can learn more advanced concepts, such as attention, transformers, and the BERT-related models.
If you are primarily interested in machine learning, there are some subfields of machine learning, such as deep learning and reinforcement learning (and deep reinforcement learning) that might appeal to you. Fortunately, there are many resources available, and you can perform an Internet search for those resources. One other point: the aspects of machine learning for you to learn depend on who you are: the needs of a machine learning engineer, data scientist, manager, student, or software developer are all different.
Oswald CampesatoApril 2021
This chapter provides a quick introduction to the PythonNumPy package that provides very useful functionality, not only for Python scripts, but also for Python-based scripts with TensorFlow. This chapter contains NumPy code samples with loops, arrays, and lists. You will also learn about dot products, the reshape() method (very useful!), how to plot with Matplotlib (discussed in Appendix F), and examples of linear regression.
The first part of this chapter briefly introduces NumPy and some of its useful features. The second part contains examples of working arrays in NumPy, and contrasts some of the APIs for lists with the same APIs for arrays. In addition, you will see how easy it is to compute the exponent-related values (square, cube, and so forth) of elements in an array.
The second part of the chapter introduces subranges, which are very useful (and frequently used) for extracting portions of datasets in machine learning tasks. In particular, you will see code samples that handle negative (-1) subranges for vectors as well as for arrays, because they are interpreted one way for vectors and a different way for arrays.
The third part of this chapter delves into other NumPy methods, including the reshape() method, which is extremely useful (and very common) when working with images files: some TensorFlow APIs require converting a 2D array of (R,G,B) values into a corresponding one-dimensional vector.
The fourth part of this chapter delves into linear regression, the mean squared error (MSE), and how to calculate MSE with the NumPylinspace() API.
NumPy is a Python module that provides many convenience methods and also better performance. NumPy provides a core library for scientific computing in Python, with performant multidimensional arrays and good vectorized math functions, along with support for linear algebra and random numbers.
NumPy is modeled after MatLab, with support for lists, arrays, and so forth. NumPy is easier to use than MatLab, and it’s very common in TensorFlow 2.x code as well as Python code. Moreover, Chapter 2 contains code samples that combine NumPy with Pandas.
The NumPy package provides the ndarray object that encapsulates multidimensional arrays of homogeneous data types. Many ndarray operations are performed in compiled code in order to improve performance.
NumPy arrays have the following properties:
They have a fixed size
Elements have the same data type
Elements have the same size (except for objects)
Modifying an array involves creating a new array
Now that you have a general idea about NumPy, let’s delve into some examples that illustrate how to work with NumPy arrays, which is the topic of the next section.
Listing 1.11 displays the contents of np2darray2.py that illustrates how to select different ranges of elements in a two-dimensional NumPy array.
LISTING 1.11: np2darray2.py
Listing 1.11 contains a NumPy array called arr1 followed by four print statements, each of which displays a different subrange of values in arr1. The output from launching Listing 1.11 is here:
In addition to the NumPy methods that you saw in the code samples prior to this section, the following (often intuitively named) NumPy methods are also very useful.
The method np.zeros() initializes an array with 0 values.
The method np.ones() initializes an array with 1 values.
The method np.empty()initializes an array with 0 values.
The method np.arange() provides a range of numbers:
The method np.shape() displays the shape of an object:
The method np.reshape() <= very useful!
The method np.linspace() <= useful in regression
The method np.mean() computes the mean of a set of numbers:
The method np.std() computes the standard deviation of a set of numbers:
Although the np.zeros() and np.empty() both initialize a 2D array with 0, np.zeros() requires less execution time. You could also use np.full(size, 0), but this method is the slowest of the three methods.
The reshape() method and the linspace() method are very useful for changing the dimensions of an array and generating a list of numeric values, respectively. The reshape() method appears in TensorFlow code, and the linspace() method is useful for generating a set of numbers in linear regression (discussed in Chapter 8). The mean() and std() methods are useful for calculating the mean and the standard deviation of a set of numbers. For example, you can use these two methods in order to resize the values in a Gaussian distribution so that their mean is 0 and the standard deviation is 1. This process is called standardizing a Gaussian distribution.