The Handbook of NLP with Gensim - Chris Kuo - E-Book

The Handbook of NLP with Gensim E-Book

Chris Kuo

0,0
29,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

Navigating the terrain of NLP research and applying it practically can be a formidable task made easy with The Handbook of NLP with Gensim. This book demystifies NLP and equips you with hands-on strategies spanning healthcare, e-commerce, finance, and more to enable you to leverage Gensim in real-world scenarios.
You’ll begin by exploring motives and techniques for extracting text information like bag-of-words, TF-IDF, and word embeddings. This book will then guide you on topic modeling using methods such as Latent Semantic Analysis (LSA) for dimensionality reduction and discovering latent semantic relationships in text data, Latent Dirichlet Allocation (LDA) for probabilistic topic modeling, and Ensemble LDA to enhance topic modeling stability and accuracy.
Next, you’ll learn text summarization techniques with Word2Vec and Doc2Vec to build the modeling pipeline and optimize models using hyperparameters. As you get acquainted with practical applications in various industries, this book will inspire you to design innovative projects. Alongside topic modeling, you’ll also explore named entity handling and NER tools, modeling procedures, and tools for effective topic modeling applications.
By the end of this book, you’ll have mastered the techniques essential to create applications with Gensim and integrate NLP into your business processes.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 382

Veröffentlichungsjahr: 2023

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



The Handbook of NLP with Gensim

Leverage topic modeling to uncover hidden patterns, themes, and valuable insights within textual data

Chris Kuo

BIRMINGHAM—MUMBAI

The Handbook of NLP with Gensim

Copyright © 2023 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Niranjan Naikwadi

Publishing Product Manager: Anant Jain

Book Project Manager: Hemangi Lotlikar

Senior Editor: Rohit Singh

Technical Editor: Rahul Limbachiya

Copy Editor: Safis Editing

Proofreader: Safis Editing

Indexer: Rekha Nair

Production Designer: Vijay Kamble

DevRel Marketing Executive: Vinishka Kalra

First published: October 2023

Production reference: 2131023

Published by Packt Publishing Ltd.

Grosvenor House

11 St Paul’s Square

Birmingham

B3 1RB, UK.

ISBN 978-1-80324-494-5

www.packtpub.com

To God be the glory

– Chris Kuo

Contributors

About the author

Chris Kuo is a data scientist and an adjunct professor with over 23 years of experience. He led various data science solutions including customer analytics, health analytics, fraud detection, and litigation. He is also an inventor of a U.S. patent. He has worked at several Fortune 500 companies in the insurance and retail industries.

Chris teaches at Columbia University and has taught at Boston University and other universities. He has published articles in economic and management journals and served as a journal reviewer. He is the author of The eXplainable A.I., Modern Time Series Anomaly Detection, Transfer Learning for Image Classification, and The Handbook of Anomaly Detection. He received his undergraduate degree in Nuclear Engineering and Ph.D. in Economics.

About the reviewers

Amreth Chandrasehar is a director at Informatica, where he is responsible for ML engineering, observability, and SRE teams. Over the last few years, he has played a key role in cloud migration, generative AI, observability, and ML adoption at various organizations. He is also a co-creator of the Conducktor Platform, serving T-Mobile’s 100 million+ customers, and a Tech/Customer Advisory board member at various companies on observability.

Amreth has also co-created the open source Kardio.io, a service health dashboard tool. He has been invited to speak at several key conferences and has won several awards within the company and was recently awarded 3 Gold Awards at Globee, Stevie, and International Achievers Awards for his contributions in observability and generative AI.

I would like to thank my wife, Ashwinya Mani, and my son, Athvik A, for their patience and support provided during my review of this book.

Devashish Deshpande is a Machine Learning Engineer with expertise in Natural Language Processing. He possesses an undergraduate degree in Computer Science as well as an advanced master’s degree in Artificial Intelligence. Contributing early on to popular open source libraries, such as Gensim and Scikit-Learn, helped him gain experience in how research papers can be translated to code and deployed and used by large communities. Later, his corporate experience helped in developing strong software engineering fundamentals. Currently, Devashish works on researching and developing better machine learning models and deploying them in a performant and practical way.

Preface

With the arrival of ChatGPT in late 2022 and GPT-4 in early 2023, there is an ignited interest in natural language processing (NLP) including large language models (LLMs). You will find this book very helpful if you are picking it up hoping to get a start with NLP, learn and build the NLP techniques that have matured in the past few decades, or understand the differences between pre-LLM and LLM techniques. With the NLP development in the past four decades, there have been many commercial NLP products built on pre-LLM techniques, such as Word2Vec, Doc2Vec, Latent Semantic Analysis (LSA) or called Latent Semantic Indexing (LSI), Latent Dirichlet Allocation (LDA), and Ensemble LDA.

With the help of this book, you will not only get started with NLP to build NLP models but also get equipped with some background knowledge of LLMs. We believe the concepts covered in this book will be the necessary bridge for anyone new who comes to NLP, who wants to build NLP products, and who wants to learn LLMs.

Why read this book?

To assist you in learning fundamental NLP concepts and building your NLP applications, we will start with NLP concepts and techniques that enable commercial NLP applications. This guide covers both theories and code practices. It presents NLP topics, so beginners as well as experienced data scientists can benefit from it.

Many of the techniques mentioned earlier, such as Word2Vec, Doc2Vec, LSA, LDA, and Ensemble LDA, are included in the Python Gensim module. Gensim is an open source Python library widely used by NLP researchers and developers, together with other NLP open source modules, including NLTK, Scikit-learn, and spaCy. We will learn how to build models using these modules. In addition, you will also learn about the Transformer-based topic modeling BERTopic in a separate chapter, and a BERTopic use case in the last chapter for NLP use cases.

You will also get to practice implementing your model for scoring and predictions. This implementation perspective enables you to work with data engineers closely in model deployment. We’ll conclude the book with a study of selected large-scale NLP use cases. We believe these use cases can inspire you to build your NLP applications.

What is Gensim

New NLP learners may find the Gensim library cited in many tutorials. Gensim is an open source Python library to process unstructured texts using unsupervised machine learning algorithms. It was first created by Radim Řehůřek in 2011 and is now developed and maintained continually by 400+ contributors. It has been used in over 2000 research papers and student theses.

One of Gensim’s merits is its fast execution speed. Gensim attributes this advantage to its use of low-level BLAS libraries through NumPy, highly optimized Fortran/C, and multithreading under the hood. Memory independence is also one of their design objectives. Gensim enables data streaming to process large corpora without the need to load a whole training corpus in RAM.

Who this book is for

This book does not assume any prior linguistic knowledge or NLP techniques, so it is suitable for anyone who wants to learn NLP. Data scientists and professionals who want to develop NLP applications will also find it helpful. If you are an NLP practitioner, you can consider this book as a code reference when working on your projects. Those practicing for an upper-class level NLP course can also use this book.

What this book covers

Chapter 1, Introduction to NLP, is an introductory chapter that explains the development from Natural Language Understanding (NLU) and Natural Language Generation (NLG) to NLP. It briefs the core techniques including text pre-processing, LSA/LSI, Word2Vec, Doc2Vec, LDA, Ensemble LDA, and BERTopic. It presents the open source NLP modules Gensim, Scikit-learn, and Spacy.

Chapter 2, Text Representation, starts with the basic step of text representation. It explains the motivation from one-hot encoding to Bag-of-words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF). It demonstrates how to perform BoW and TF-IDF with Gensim, Scikit-learn, and NLTK.

Chapter 3, Text Wrangling and Preprocessing, presents the essential text pre-processing tasks: (a) tokenization, (b) lowercase conversion, (c) stop words removal, (d) punctuation removal, (e) stemming, and (f) lemmatization. It guides you to perform the pre-processing tasks with Gensim, spaCy, and NLTK.

Chapter 4, Latent Semantic Analysis with scikit-learn, presents the theory of LSA/LSI. This chapter introduces Singular Vector Decomposition (SVD), Truncated SVD, and Truncated SVD’s application to LSA/LSI. This chapter uses Scikit-learn to illustrate the transition of Truncated SVD to LSA/LSI explicitly.

Chapter 5, Cosine Similarity, is dedicated to explaining this fundamental measure in NLP. Cosine similarity, among other metrics such as Euclidean distance or Manhattan distance, measures the similarity between embedded data in the vector space. This chapter also indicates the applications of cosine similarity for image comparison and querying.

Chapter 6, Latent Semantic Indexing with Gensim, builds an LSA/LSI model with Gensim. This chapter introduces the concept of coherence score that determines the optimal number of topics. It shows how to score new documents with the use of cosine similarity to add to an information retrieval tool.

Chapter 7, Using Word2Vec, introduces the milestone Word2Vec technique and its two neural network architectural variations: Continuous Bag-of-Word (CBOW) and Skip Gram (SG). It illustrates the concept and operation for word embedding in the vector space. It guides you to build a word2Vec model and prepares it as part of an informational retrieval tool. It visualizes word vectors of a Word2Vec model with t-SNE and TensorBoard (by TensorFlow). This chapter ends with the comparisons of Word2Vec with Doc2Vec, GloVe, and FastText.

Chapter 8, Doc2Vec with Gensim, presents the evolution from Word2Vec to Doc2Vec. It details the two neural network architectural variations: Paragraph Vector with Distributed Bag-of-words (PV-DBOW) and Paragraph Vectors with Distributed Memory (PV-DM). It guides you to build a Doc2Vec model and prepares it as part of an informational retrieval tool

Chapter 9, Understanding Discrete Distributions, introduces the discrete distribution family including Bernoulli, binomial, multinomial, beta, and Dirichlet distribution. Because the complex distributions are the generalization of the simple distributions, this sequence helps you to understand Dirichlet distribution. The fact that ‘Dirichlet’ is in the title of LDA tells us its significance. This chapter helps you understand LDA in the next chapter.

Chapter 10, Latent Dirichlet Allocation, presents the LDA algorithm, including the structural design of LDA, generative modeling, and Variational Expectation-Maximization.

Chapter 11, LDA Modeling, demonstrates how to build an LDA model, perform hyperparameter turning, and determine the optimal number of topics. You will learn the steps to apply an LDA model to score new documents as part of an informational retrieval tool.

Chapter 12, LDA Visualization, presents the visualization for LDA. This chapter starts with a design thinking for the rich content of a topic model. Then it shows how to use pyLADvizfor visualization.

Chapter 13, The Ensemble LDA for Model Stability, investigates the root causes of the instability of LDA. It explains the Ensemble approach for LDA and the use of Checkback DBSCAN, a clustering algorithm, to deliver a stable set of topics.

Chapter 14, LDA and BERTopic, presents the BERTopic modeling technique that uses an LLM-based BERT algorithm for word embeddings, UMAP for dimensionality reduction for word embedding, HDBSCAN for topic clustering, c-TFIDF for word presentation for topics, and MMR to fine-tune the word representation for topics. It guides you through BERT modeling, visualization, and scoring new documents for topics.

Chapter 15, Real-World Use Cases, presents seven NLP projects in healthcare, medical, legal, finance, and social media. By learning these NLP solutions, you will be motivated to apply code notebooks of this book to perform similar jobs or apply to your future applications.

To get the most out of this book

You’ll need to make sure that you have the following setup requirements fulfilled in order to follow the instructions given in this book:

Software/hardware covered in the book

Operating system requirements

Python version ≥ 3.7

Windows, macOS, or Linux

Gensim

If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

Download the example code files

The Python notebooks are available for download at https://github.com/PacktPublishing/The-Handbook-of-NLP-with-Gensim. If there’s an update to the code, it will be updated in the GitHub repository. You are encouraged to use Google Colab. Google Colab is a free Jupyter Notebook environment that runs entirely in the cloud. Google Colab has already pre-installed popular machine-learning libraries such as pandas, NumPy, TensorFlow, Keras, and OpenCV.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Data for this book

The AG’s corpus of news articles, made public by A. Gulli, is a collection of more than 1 million news articles from more than 2,000 news sources. Zhang, Zhao, and LeCun sampled news articles from on “world”, “sports”, “business”, and “Science” categories. This dataset ag_news is a frequently used dataset and is available in Kaggle, PyTorch, Huggingface, and Tensorflow. There are 120,000 and 7,600 news articles in the training and testing samples respectively. This dataset is used throughout the book.

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: “Mount the downloaded WebStorm-10*.dmg disk image file as another disk in your system.”

A block of code is set as follows:

import gensimfrom gensim.utils import simple_preprocess from gensim.corpora import Dictionary import pprint

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

from gensim.summarization import keywords

Any command-line input or output is written as follows:

pip install gensim==3.8.3

Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: “Without natural language processing (NLP) tools, the marketing team can only do basic operations with these text messages and data.”

Tips or important notes

Appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Share Your Thoughts

Once you’ve read The Handbook of NLP with Gensim, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

Download a free PDF copy of this book

Thanks for purchasing this book!

Do you like to read on the go but are unable to carry your print books everywhere?

Is your eBook purchase not compatible with the device of your choice?

Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.

Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.

The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily

Follow these simple steps to get the benefits:

Scan the QR code or visit the link below

https://packt.link/free-ebook/9781803244945

Submit your proof of purchaseThat’s it! We’ll send your free PDF and other benefits to your email directly

Part 1: NLP Basics

In this part, you will get an overview of NLP. You will understand the concept of text representation and learn two basic forms of word embeddings. You will learn the key steps in NLP preprocessing including tokenization, lowercase conversion, stop words removal, punctuation removal, stemming, and lemmatization. You will learn how to do coding with spaCy, NLTK, and Gensim, and know how to build a pipeline applicable for any NLP preprocessing in the future.

This part contains the following chapters:

Chapter 1, Introduction to NLPChapter 2, Text RepresentationChapter 3, Text Wrangling and Preprocessing