Natural Language Processing using R Pocket Primer - Oswald Campesato - E-Book

Natural Language Processing using R Pocket Primer E-Book

Oswald Campesato

0,0
29,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

This book is for developers seeking an overview of basic concepts in Natural Language Processing (NLP). It caters to a technical audience, offering numerous code samples and listings to illustrate the wide range of topics covered. The journey begins with managing data relevant to NLP, followed by two chapters on fundamental NLP concepts. This foundation is reinforced with Python code samples that bring these concepts to life.
The book then delves into practical NLP applications, such as sentiment analysis, recommender systems, COVID-19 analysis, spam detection, and chatbots. These examples provide real-world context and demonstrate how NLP techniques can be applied to solve common problems. The final chapter introduces advanced topics, including the Transformer architecture, BERT-based models, and the GPT family, highlighting the latest state-of-the-art developments in the field.
Appendices offer additional resources, including Python code samples on regular expressions and probability/statistical concepts, ensuring a well-rounded understanding. Companion files with source code and figures enhance the learning experience, making this book a comprehensive guide for mastering NLP techniques and applications.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 359

Veröffentlichungsjahr: 2024

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



NATURAL LANGUAGE PROCESSINGUSING R

Pocket Primer

LICENSE, DISCLAIMER OF LIABILITY, AND LIMITED WARRANTY

By purchasing or using this book and companion files (the “Work”), you agree that this license grants permission to use the contents contained herein, including the disc, but does not give you the right of ownership to any of the textual content in the book/disc or ownership to any of the information or products contained in it. This license does not permit uploading of the Work onto the Internet or on a network (of any kind) without the written consent of the Publisher. Duplication or dissemination of any text, code, simulations, images, etc. contained herein is limited to and subject to licensing terms for the respective products, and permission must be obtained from the Publisher or the owner of the content, etc., in order to reproduce or network any portion of the textual material (in any media) that is contained in the Work.

MERCURY LEARNING AND INFORMATION (“MLI” or “the Publisher”) and anyone involved in the creation, writing, or production of the companion disc, accompanying algorithms, code, or computer programs (“the software”), and any accompanying Web site or software of the Work, cannot and do not warrant the performance or results that might be obtained by using the contents of the Work. The author, developers, and the Publisher have used their best efforts to ensure the accuracy and functionality of the textual material and/or programs contained in this package; we, however, make no warranty of any kind, express or implied, regarding the performance of these contents or programs. The Work is sold “as is” without warranty (except for defective materials used in manufacturing the book or due to faulty workmanship).

The author, developers, and the publisher of any accompanying content, and anyone involved in the composition, production, and manufacturing of this work will not be liable for damages of any kind arising out of the use of (or the inability to use) the algorithms, source code, computer programs, or textual material contained in this publication. This includes, but is not limited to, loss of revenue or profit, or other incidental, physical, or consequential damages arising out of the use of this Work.

The sole remedy in the event of a claim of any kind is expressly limited to replacement of the book and/or disc, and only at the discretion of the Publisher. The use of “implied warranty” and certain “exclusions” vary from state to state, and might not apply to the purchaser of this product.

Companion files for this title are available by writing to the publisher at info@merclearning.com.

NATURAL LANGUAGE PROCESSINGUSING R

Pocket Primer

Oswald Campesato

MERCURY LEARNING AND INFORMATION

Dulles, Virginia

Boston, Massachusetts

New Delhi

Copyright ©2022 by MERCURY LEARNING AND INFORMATION LLC. All rights reserved.

This publication, portions of it, or any accompanying software may not be reproduced in any way, stored in a retrieval system of any type, or transmitted by any means, media, electronic display or mechanical display, including, but not limited to, photocopy, recording, Internet postings, or scanning, without prior permission in writing from the publisher.

Publisher: David Pallai

MERCURY LEARNING AND INFORMATION

22841 Quicksilver Drive

Dulles, VA 20166

info@merclearning.com

www.merclearning.com

800-232-0223

O. Campesato. Natural Language Processing Using R Pocket Primer.

ISBN: 978-1-68392-730-3

The publisher recognizes and respects all marks used by companies, manufacturers, and developers as a means to distinguish their products. All brand names and product names mentioned in this book are trademarks or service marks of their respective companies. Any omission or misuse (of any kind) of service marks or trademarks, etc. is not an attempt to infringe on the property of others.

Library of Congress Control Number: 2021950959

222324321 This book is printed on acid-free paper in the United States of America.

Our titles are available for adoption, license, or bulk purchase by institutions, corporations, etc. For additional information, please contact the Customer Service Dept. at 800-232-0223(toll free).

All of our titles are available in digital format at academiccourseware.com and other digital vendors. Companion files (figures and code listings) for this title are available by contacting info@merclearning.com. The sole obligation of MERCURY LEARNING AND INFORMATION to the purchaser is to replace the disc, based on defective materials or faulty workmanship, but not based on the operation or functionality of the product.

I’d like to dedicate this book to my parents – may this bring joy and happiness into their lives.

CONTENTS

Preface

Chapter 1: Introduction to R

What is R?

Features of R

Installing R, RStudio, and RStudio Cloud

Variable Names, Operators, and Data Types in R

Assigning Values to Variables in R

Operators in R

Data Types in R

Working with Strings in R

Uppercase and Lowercase Strings

Other String-Related Tasks

Working with Vectors in R

Finding NULL Values in a Vector in R

Updating NA Values in a Vector in R

Sorting a Vector of Elements in R

Working with the Built-in Letters Variable in R

Working with Lists in R

Useful Vector-Related Functions in R

Working with Matrices in R (1)

Working with Matrices in R (2)

Working with Matrices in R (3)

Working with Matrices in R (3)

Working with Matrices in R (4)

Updating Matrix Elements

Logical Constraints and Matrices

Assigning Values to Matrix Elements

Working with Matrices in R (5)

Working with Dates in R

The seq Function in R

Summary

Chapter 2: Loops, Conditional Logic, and Dataframes

Working with Simple Loops in R

Working with Other Types of Loops in R

Working with Nested Loops in R

Working with While Loops in R

Working with Conditional Logic in R

Compound Conditional Logic

Check if a Number is Prime in R

Check if Numbers in an Array are Prime in R

Check for Leap Years in R

Well-formed Triangle Values in R

What are Factors in R?

What are Data Frames in R?

Working with Dataframes in R (1)

Working with Data Frames in R (2)

Working with Data Frames in R (3)

Working with Data Frames in R (4)

Working with Data Frames in R (5)

Reading Excel Files in R

Reading SQLITE Tables in R

Reading Text Files in R

Saving and Restoring Objects in R

Data Visualization in R

Working with Bar Charts in R (1)

Working with Bar Charts in R (2)

Working with Line Graphs in R (1)

Working with Line Graphs in R (2)

Working with Multi-Line Graphs in R

Working with Histograms in R

Working with Scatter Plots in R (1)

Working with Scatter Plots in R (2)

Working with Box Plots in R

Working with Pie Charts in R (1)

Working with Pie Charts in R (2)

Summary

Chapter 3: Working with Functions in R

NaN and Functions in R

Math-Related Functions in R

String-Related Functions in R

The gsub() Function in R

Miscellaneous Built-in Functions

Set Functions in R

The “Apply” Family of Built-in Functions

The “Must Learn” dplyr Package in R

Other Useful R Packages

The Pipe Operator %>%

Working with CSV Files in R

Working with XML in R

Reading an XML Document into an R Dataframe

Working with JSON in R

Reading a JSON File into an R Dataframe

Statistical Functions in R

Summary Functions in R

Defining a Custom Function in R

Recursion in R

Calculating Factorial Values in R (non-recursive)

Calculating Factorial Values in R (recursive)

Calculating Fibonacci Numbers in R (non-recursive)

Calculating Fibonacci Numbers in R (recursive)

Convert a Decimal Integer to a Binary Integer in R

Calculating the GCD of Two Integers in R

Calculating the LCM of Two Integers in R

Summary

Chapter 4: NLP Concepts (I)

What is NLP?

The Evolution of NLP

A Wide-Angle View of NLP

NLP Applications and Use Cases

NLU and NLG

What is Text Classification?

Information Extraction and Retrieval

Word Sense Disambiguation

NLP Techniques in ML

NLP Steps for Training a Model

Text Normalization and Tokenization

Word Tokenization in Japanese

Text Tokenization with Unix Commands

Handling Stop Words

What is Stemming?

Singular vs. Plural Word Endings

Common Stemmers

Stemmers and Word Prefixes

Over Stemming and Under Stemming

What is Lemmatization?

Stemming/Lemmatization Caveats

Limitations of Stemming and Lemmatization

Working with Text: POS

POS Tagging

POS Tagging Techniques

Working with Text: NER

Abbreviations and Acronyms

NER Techniques

What is Topic Modeling?

Keyword Extraction, Sentiment Analysis, and Text Summarization

Summary

Chapter 5: NLP Concepts (II)

What is Word Relevance?

What is Text Similarity?

Sentence Similarity

Sentence Encoders

Working with Documents

Document Classification

Document Similarity (doc2vec)

Techniques for Text Similarity

Similarity Queries

What is Text Encoding?

Text Encoding Techniques

Document Vectorization

One-Hot Encoding (OHE)

Index-Based Encoding

Additional Encoders

The BoW Algorithm

What are N-grams?

Calculating Probabilities with n-grams

Calculating tf, idf, and tf-idf

What is Term Frequency (TF)?

What is Inverse Document Frequency (IDF)?

What is tf-idf?

Limitations of tf-idf

What is BM25?

Pointwise Mutual Information (PMI)

The Context of Words in a Document

What is Semantic Context?

Textual Entailment

Discrete, Distributed, and Contextual Word Representations

What is Cosine Similarity?

Text Vectorization (aka Word Embeddings)

Overview of Word Embeddings and Algorithms

Word Embeddings

Word Embedding Algorithms

What is word2vec?

The Intuition for word2vec

The word2vec Architecture

Limitations of word2vec

The CBoW Architecture

What are Skip-grams?

An Example of Skip-grams

The Skip-gram Architecture

Neural Network Reduction

What is GloVe?

Working with GloVe

What is fastText?

Comparison of Word Embeddings

What is Topic Modeling?

Topic Modeling Algorithms

LDA and Topic Modeling

Text Classification vs Topic Modeling

Language Models and NLP

How to Create a Language Model

Vector Space Models

Term-Document Matrix

Tradeoffs of the VSM

NLP and Text Mining

Text Extraction Preprocessing and N-Grams

Relation Extraction and Information Extraction

What is a BLEU Score?

ROUGE Score: An Alternative to BLEU

Summary

Chapter 6: NLP in R

Launch R Scripts from the Command Line

Installing RStudio Packages

NLP Packages in R

Common Tasks for Cleaning NLP Datasets

Does the Language Make a Difference?

Cleaning NLP Data in R

Tokenization

Remove Punctuation in Strings

Convert Strings to Lowercase and Uppercase

Convert File Data to Lowercase and Uppercase

Stop Words

Stemming in R

Lemmatization

POS (Parts Of Speech) with SpaCy in R

POS in R

NER in R

The tf-idf Algorithm

Working with N-Grams

Topic Modeling in R

Working With word2vec in R

Summary

Chapter 7: Transformer, BERT, and GPT

What is Attention?

Types of Word Embeddings

Types of Attention and Algorithms

An Overview of the Transformer Architecture

The Transformers Library from HuggingFace

Transformer and NER Tasks

Transformer and QnA Tasks

Transformer and Sentiment Analysis Tasks

Transformer and Mask Filling Tasks

What is T5?

What is BERT?

BERT Features

How is BERT Trained?

How BERT Differs from Earlier NLP Models

The Inner Workings of BERT

What is MLM?

What is NSP?

Special Tokens

BERT Encoding: Sequence of Steps

Subword Tokenization

Sentence Similarity in BERT

Word Context in BERT

Generating BERT Tokens (1)

Generating BERT Tokens (2)

The BERT Family

Surpassing Human Accuracy: deBERTa

What is Google Smith?

Introduction to GPT

Installing the Transformers Package

Working with GPT-2

GPT-2 versus BERT

What is GPT-3?

GPT-3 Task Strengths and Mistakes

GPT-3 Architecture

The GPT-3 Playground

Accessing the GPT-3 Playground

What is the Goal of GPT-3?

Zero-Shot, One-Shot, and Few Shot Learners

GPT-3 Task Performance

The Switch Transformer: One Trillion Parameters

Looking Ahead

Summary

Appendix: Intro to Probability and Statistics

Index

PREFACE

What Is the Value Proposition for This Book?

This book contains a fast-paced introduction to as much relevant information about NLP using R that can be reasonably included in a book of this size. Some chapters contain topics that are discussed in great detail with many code samples, whereas other chapters contain theoretical foundations of NLP concepts (such as Chapter 4).

This book helps developers who have a wide range of technical backgrounds, which is the rationale for the inclusion of a plethora of topics. Regardless of your background, please remember the following point: this book is essentially a stepping stone for your study of NLP.

You will be exposed to various NLP and machine learning topics in this book, some of which are presented in a cursory manner for two reasons. First, it’s important that you be exposed to these concepts. In some cases, you will find topics that might pique your interest, and hence motivate you to learn more about them through self-study; in other cases, you will probably be satisfied with a brief introduction. Hence, you can decide whether to delve into more detail regarding the topics in this book.

Second, a full treatment of all the topics that are covered in this book would probably triple its page count, and few people are interested in reading long technical books. Hence, this book provides a decent view of the NLP and machine learning landscape, based on the belief that this approach will be more beneficial for readers who are experienced developers who want to learn about NLP and machine learning.

The Target Audience

This book is intended primarily for people who have a solid background as software developers. Specifically, this book is for developers who are accustomed to searching online for more detailed information about technical topics. If you are a beginner, there are other books that are more suitable for you, and you can find them by performing an online search.

This book is also intended to reach an international audience of readers with highly diverse backgrounds in various age groups. This book uses standard English rather than colloquial expressions that might be confusing to those readers. People learn in different ways, which includes reading, writing, or hearing new material. This book tries to take these approaches into consideration to provide a comfortable and meaningful learning experience for the intended readers.

Do I Need to Learn the Theory Portions of This Book?

Once again, the answer depends on the extent to which you plan to become involved in NLP and machine learning. In addition to creating a model, you will use algorithms to see which ones provide the level of accuracy (or some other metric) that you need for your project. The theoretical aspects of machine learning can help you perform a forensic analysis of your model and your data, and ideally assist in determining how to improve your model.

Why is a Python-based Chapter in This Book?

Chapter 7 is devoted to the Transformer architecture, the BERT model, and GPT-related models. The reason for the inclusion of Python-based code samples in this chapter is simple: there is a plethora of Python-based code available to illustrate how to use the NLP-related APIs of these models, whereas R-based code samples are typically unavailable. Most of the code samples in Chapter 7 require Python 3.7.

In addition, many of the R-based code samples in Chapter 6 are wrappers around Python-based code, which will necessitate installing Python 3 and other Python-based NLP libraries. The installation details are provided in Chapter 6.

Getting the Most From This Book

Some programmers learn well from prose and others learn well from sample code (and lots of it), which means that there’s no single style that can be used for everyone.

Moreover, some programmers want to run the code first, see what it does, and then return to the code to delve into the details (and others use the opposite approach).

Consequently, there are various types of code samples in this book: some are short, some are long, and other code samples build on earlier code samples.

What Do I need to Know for This Book?

Although this book is introductory in nature, some knowledge of R for the first three chapters is helpful. In addition, some knowledge of Python 3.x for the code samples in Chapter 7 would also be helpful. Knowledge of other programming languages (such as Java) can also be helpful because of the exposure to programming concepts and constructs. The less technical knowledge that you have, the more diligence will be required to understand the various topics that are covered.

If you want to be sure that you can grasp the material in this book, glance through some of the code samples to get an idea of how much is familiar to you and how much is new for you.

Does This Book Contain Production-Level Code Samples?

The code samples in this book are for basic NLP tasks. The primary purpose of the code samples is to show you how to solve various NLP-related tasks, some of which are performed in conjunction with machine learning. Moreover, clarity has a higher priority than writing more compact code that is more difficult to understand (and possibly more prone to bugs). If you decide to use any of the code in this book in a production website, you should subject that code to the same rigorous analysis as the other parts of your code base.

What Are the Non-Technical Prerequisites for This Book?

Although the answer to this question is more difficult to quantify, it’s important to have strong desire to learn about NLP, along with the motivation and discipline to read and understand the code samples. As a reminder, even simple machine language APIs can be a challenge to understand the first time you encounter them, so be prepared to read the code samples several times.

How Do I Set up a Command Shell?

If you are a Mac user, there are three ways to do so. The first method is to use Finder to navigate to Applications > Utilities and then double click on the Utilities application. Next, if you already have a command shell available, you can launch a new command shell by typing the following command:

open /Applications/Utilities/Terminal.app

A second method for Mac users is to open a new command shell on a Macbook from a command shell that is already visible simply by clicking command+n in that command shell, and your Mac will launch another command shell.

If you are a PC user, you can install Cygwin (open source at https://cygwin.com/) that simulates bash commands or use another toolkit such as MKS (a commercial product). Please read the online documentation that describes the download and installation process. Note that custom aliases are not automatically set if they are defined in a file other than the main start-up file (such as .bash_login).

Companion Files

All the code samples and figures in this book may be obtained by writing to the publisher at info@merclearning.com.

What Are the “Next Steps” After Finishing This Book?

The answer to this question varies widely, mainly because the answer depends heavily on your objectives. If you are interested primarily in NLP, you can learn more advanced concepts, such as attention, transformers, and the BERT-related models.

If you are primarily interested in machine learning, there are some subfields of machine learning, such as deep learning and reinforcement learning (and deep reinforcement learning) that might appeal to you. Fortunately, there are many resources available, and you can perform an Internet search for those resources. One other point: the aspects of machine learning for you to learn depend on who you are. The needs of a machine learning engineer, data scientist, manager, student, or software developer are all different.

O. CampesatoJanuary 2022



Tausende von E-Books und Hörbücher

Ihre Zahl wächst ständig und Sie haben eine Fixpreisgarantie.