Machine Learning Techniques for Text - Nikos Tsourakis - E-Book

Machine Learning Techniques for Text E-Book

Nikos Tsourakis

0,0
34,79 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Take your Python text processing skills to another level by learning about the latest natural language processing and machine learning techniques with this full color guide


Key Features


Learn how to acquire and process textual data and visualize the key findings


Obtain deeper insight into the most commonly used algorithms and techniques and understand their tradeoffs


Implement models for solving real-world problems and evaluate their performance


Book Description


With the ever-increasing demand for machine learning and programming professionals, it's prime time to invest in the field. This book will help you in this endeavor, focusing specifically on text data and human language by steering a middle path among the various textbooks that present complicated theoretical concepts or focus disproportionately on Python code.


A good metaphor this work builds upon is the relationship between an experienced craftsperson and their trainee. Based on the current problem, the former picks a tool from the toolbox, explains its utility, and puts it into action. This approach will help you to identify at least one practical use for each method or technique presented. The content unfolds in ten chapters, each discussing one specific case study. For this reason, the book is solution-oriented. It's accompanied by Python code in the form of Jupyter notebooks to help you obtain hands-on experience. A recurring pattern in the chapters of this book is helping you get some intuition on the data and then implement and contrast various solutions.


By the end of this book, you'll be able to understand and apply various techniques with Python for text preprocessing, text representation, dimensionality reduction, machine learning, language modeling, visualization, and evaluation.


What you will learn


Understand fundamental concepts of machine learning for text


Discover how text data can be represented and build language models


Perform exploratory data analysis on text corpora


Use text preprocessing techniques and understand their trade-offs


Apply dimensionality reduction for visualization and classification


Incorporate and fine-tune algorithms and models for machine learning


Evaluate the performance of the implemented systems


Know the tools for retrieving text data and visualizing the machine learning workflow


Who this book is for


This book is for professionals in the area of computer science, programming, data science, informatics, business analytics, statistics, language technology, and more who aim for a gentle career shift in machine learning for text. Students in relevant disciplines that seek a textbook in the field will benefit from the practical aspects of the content and how the theory is presented. Finally, professors teaching a similar course will be able to pick pertinent topics in terms of content and difficulty. Beginner-level knowledge of Python programming is needed to get started with this book.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 517

Veröffentlichungsjahr: 2022

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Machine Learning Techniques for Text

Apply modern techniques with Python for text processing, dimensionality reduction, classification, and evaluation

Nikos Tsourakis

BIRMINGHAM—MUMBAI

Machine Learning Techniques for Text

Copyright © 2022 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Publishing Product Manager: Ali Abidi

Content Development Editor: Shreya Moharir

Technical Editor: Rahul Limbachiya

Copy Editor: Safis Editing

Project Coordinator: Farheen Fathima

Proofreader: Safis Editing

Indexer: Manju Arasan

Production Designer: Vijay Kamble

Marketing Coordinator: Shifa Ansari

First published: October 2022

Production reference: 3111122

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-80324-238-5

www.packt.com

This book is dedicated to my parents, Vasileios and Zoi

Acknowledgments

Why bother writing a book on topics for which there is already a vast amount of information available? My main driving force was to share knowledge in a beloved field in a way I would have liked to have been exposed to several years ago. Creating a book for a single reader who has long ago ceased to exist (my past self) has no merit. Instead, we wanted to offer a practical resource to a broader audience, which required feedback from colleagues in active conversations, who also functioned as a sounding board for our ideas. These people directly or indirectly affected the current book’s structure and content, and to these people I am overwhelmingly indebted.

Initially, I would like to thank Vassilis Digalakis, for opening the door and welcoming me to a new playground. Nikos Chatzichrisafis, for providing a unique landing point in the field. Pierrette Bouillon, for enlarging the space with more toys, and Manny Rayner for the company during gameplay.

Special thanks go to my colleagues at the International Institute in Geneva, and particularly to Dogan Guven (for architecting our Business Analytics program), Andrea di Mauro (for the inspiration), and Ioanna Liouka (for the opportunity).

It would be negligent to omit my master’s students in the Text Mining and Python course, who unwittingly became testing subjects for a large part of the content.

Many thanks to my colleagues at the University of Geneva for fostering a productive multidisciplinary work environment. The technologies described in this book find practical usage in our research tasks. My daily interaction with such competent researchers in the field is hopefully reflected in the quality of the current book.

I can only recall positive sentiments from my collaboration with the Packt team. Their professionalism and interpersonal interaction gave me the freedom to create a book as I imagined. But, on the other hand, they provided a large bounding box and direction that prevented unintended ricochets. In particular, I would like to thank Shreya Moharir (for making the book appropriate for a global audience), Aparna Nair (for maintaining the right pace), and Ali Abidi (for orchestrating the whole process). In addition, the two reviewers, Ved Mathai and Saurabh Shahane, did their utmost to highlight all the unintentional pitfalls in my initial drafts and provided a genuine quality boost to the outcome. Finally, I would also like to thank Costas Boulis for reading part of this work and providing valuable feedback.

Last but not least, I would like to acknowledge my wife Kyriaki and my son Vassili for their love and support during the compilation of this work. Without them, the book would have finished a little earlier, but it wouldn’t have meant nearly as much.

All these people have assisted me in one way or another in creating a better book.

Enjoy the ride!

Contributors

About the author

Nikos Tsourakis is a professor of computer science and business analytics at the International Institute in Geneva, Switzerland, and a research associate at the University of Geneva. He has over 20 years of experience designing, building, and evaluating intelligent systems using speech and language technologies. He has also co-authored over 50 research publications in the area. In the past, he worked as a software engineer, developing products for major telecommunication vendors. He also served as an expert for the European Commission and is currently a certified educator at the Amazon Web Services Academy. He holds a degree in electronic and computer engineering, a master’s in management, and a PhD in multilingual information processing.

About the reviewers

Ved Mathai is a graduate of Manipal Institute of Technology and has a postgraduate degree in information technology from the International Institute of Information Technology, Bangalore. He has worked on numerous start-ups. He worked on semantics and machine learning at DataWeave, as a senior NLP engineer for 4 years at Slang Labs, and, most recently, as the CTO at Navanc Data Sciences. When he is not programming, he can be found watching Formula One or running in the park while listening to a podcast.

Saurabh Shahane is a data scientist-turned-entrepreneur. Currently, he is the CEO of The Machine Learning Company (TMLC). With TMLC, he is creating a data science ecosystem for both industries and educational organizations. He is an adjunct professor at the AI faculty at Symbiosis Institute of Technology and is also a Kaggle Grandmaster. He has a blend of academic and industry experience having worked with industrialists and researchers from domains such as pharmaceuticals, sports, finance, and business to promote and release research work and practical data strategies.

Table of Contents

Preface

1

Introducing Machine Learning for Text

The language phenomenon

The data explosion

The era of AI

Relevant research fields

The machine learning paradigm

Taxonomy of machine learning techniques

Supervised learning

Unsupervised learning

Semi-supervised learning

Reinforcement learning

Visualization of the data

Evaluation of the results

Summary

2

Detecting Spam Emails

Technical requirements

Understanding spam detection

Explaining feature engineering

Extracting word representations

Using label encoding

Using one-hot encoding

Using token count encoding

Using tf-idf encoding

Executing data preprocessing

Tokenizing the input

Removing stop words

Stemming the words

Lemmatizing the words

Performing classification

Getting the data

Creating the train and test sets

Preprocessing the data

Extracting the features

Introducing the Support Vector Machines algorithm

Understanding Bayes’ theorem

Measuring classification performance

Calculating accuracy

Calculating precision and recall

Calculating the F-score

Creating ROC and AUC

Creating precision-recall curves

Summary

3

Classifying Topics of Newsgroup Posts

Technical requirements

Understanding topic classification

Performing exploratory data analysis

Executing dimensionality reduction

Understanding principal component analysis

Understanding linear discriminant analysis

Putting PCA and LDA into action

Introducing the k-nearest neighbors algorithm

Performing feature extraction

Performing cross-validation

Performing classification

Comparison to the baseline model

Introducing the random forest algorithm

Contracting a decision tree

Performing classification

Extracting word embedding representation

Understanding word embedding

Performing vector arithmetic

Performing classification

Using the fastText tool

Summary

4

Extracting Sentiments from Product Reviews

Technical requirements

Understanding sentiment analysis

Performing exploratory data analysis

Using the Software dataset

Exploiting the ratings of products

Extracting the word count of reviews

Exploiting the helpfulness score

Introducing linear regression

Putting linear regression into action

Introducing logistic regression

Understanding gradient descent

Using logistic regression

Creating training and test sets

Performing classification

Applying regularization

Introducing deep neural networks

Understanding logic gates

Understanding perceptrons

Understanding artificial neurons

Creating artificial neural networks

Training artificial neural networks

Performing classification

Summary

5

Recommending Music Titles

Technical requirements

Understanding recommender systems

Performing exploratory data analysis

Cleaning the data

Extracting information from the data

Understanding the Pearson correlation

Introducing content-based filtering

Extracting music recommendations

Introducing collaborative filtering

Using memory-based collaborative recommenders

Applying SVD

Clustering handwritten text

Applying t-SNE

Using model-based collaborative systems

Introducing autoencoders

Summary

6

Teaching Machines to Translate

Technical requirements

Understanding machine translation

Introducing rule-based machine translation

Using direct machine translation

Using transfer-based machine translation

Using interlingual machine translation

Introducing example-based machine translation

Introducing statistical machine translation

Modeling the translation problem

Creating the models

Introducing sequence-to-sequence learning

Deciphering the encoder/decoder architecture

Understanding long short-term memory units

Putting seq2seq in action

Measuring translation performance

Summary

7

Summarizing Wikipedia Articles

Technical requirements

Understanding text summarization

Introducing web scraping

Scraping popular quotes

Scraping book reviews

Scraping Wikipedia articles

Performing extractive summarization

Performing abstractive summarization

Introducing the attention mechanism

Introducing transformers

Putting the transformer into action

Measuring summarization performance

Summary

8

Detecting Hateful and Offensive Language

Technical requirements

Introducing social networks

Understanding BERT

Pre-training phase

Fine-tuning phase

Putting BERT into action

Introducing boosting algorithms

Understanding AdaBoost

Understanding gradient boosting

Understanding XGBoost

Creating validation sets

Learning the myth of Icarus

Extracting the datasets

Treating imbalanced datasets

Classifying with BERT

Training the classifier

Applying early stopping

Understanding CNN

Adding pooling layers

Including CNN layers

Summary

9

Generating Text in Chatbots

Technical requirements

Understanding text generation

Creating a retrieval-based chatbot

Understanding language modeling

Understanding perplexity

Building a language model

Creating a generative chatbot

Using a pre-trained model

Creating the GUI

Creating the web chatbot

Fine-tuning a pre-trained model

Summary

10

Clustering Speech-to-Text Transcriptions

Technical requirements

Understanding text clustering

Preprocessing the data

Using speech-to-text

Introducing the K-means algorithm

Putting K-means into action

Introducing DBSCAN

Putting DBSCAN into action

Assessing DBSCAN

Introducing the hierarchical clustering algorithm

Putting hierarchical clustering into action

Introducing the LDA algorithm

Putting LDA into action

Summary

Index

Other Books You May Enjoy

2

Detecting Spam Emails

Electronic mail is a ubiquitous internet service for exchanging messages between people. A typical problem in this sphere of communication is identifying and blocking unsolicited and unwanted messages. Spam detectors undertake part of this role; ideally, they should not let spam escape uncaught while not obstructing any non-spam.

This chapter deals with this problem from a machine learning (ML) perspective and unfolds as a series of steps for developing and evaluating a typical spam detector. First, we elaborate on the limitations of performing spam detection using traditional programming. Next, we introduce the basic techniques for text representation and preprocessing. Finally, we implement two classifiers using an open source dataset and evaluate their performance based on standard metrics.

By the end of the chapter, you will be able to understand the nuts and bolts behind the different techniques and implement them in Python. But, more importantly, you should be capable of seamlessly applying the same pipeline to similar problems.

We go through the following topics:

Obtaining the dataUnderstanding its contentPreparing the datasets for analysisTraining classification modelsRealizing the tradeoffs of the algorithmsAssessing the performance of the models

Technical requirements

The code of this chapter is available as a Jupyter Notebook in the book’s GitHub repository: https://github.com/PacktPublishing/Machine-Learning-Techniques-for-Text/tree/main/chapter-02.

The Notebook has an in-built step to download the necessary Python modules required for the practical exercises in this chapter. Furthermore, for Windows, you need to download and install Microsoft C++ Build Tools from the following link: https://visualstudio.microsoft.com/visual-cpp-build-tools/.

Understanding spam detection

A spam detector is software that runs on the mail server or our local computer and checks the inbox to detect possible spam. As with traditional letterboxes, an inbox is a destination for electronic mail messages. Generally, any spam detector has unhindered access to this repository and can perform tens, hundreds, or even thousands of checks per day to decide whether an incoming email is spam or not. Fortunately, spam detection is a ubiquitous technology that filters out irrelevant and possibly dangerous electronic correspondence.

How would you implement such a filter from scratch? Before exploring the steps together, look at a contrived (and somewhat naive) spam email message in Figure 2.1. Can you identify some key signs that differentiate this spam from a non-spam email?

Figure 2.1 – A spam email message

Even before reading the content of the message, most of you can immediately identify the scam from the email’s subject field and decide not to open it in the first place. But let’s consider a few signs (coded as T1 to T4) that can indicate a malicious sender:

T1 – The text in the subject field is typical for spam. It is characterized by a manipulative style that creates unnecessary urgency and pressure.T2 – The message begins with the phrase Dear MR tjones. The last word was probably extracted automatically from the recipient’s email address.T3 – Bad spelling and the incorrect use of grammar are potential spam indicators.T4 – The text in the body of the message contains sequences with multiple punctuation marks or capital letters.

We can implement a spam detector based on these four signs, which we will hereafter call triggers. The detector classifies an incoming email as spam if T1, T2, T3, and T4 are True simultaneously. The following example shows the pseudocode for the program: