34,79 €
Take your Python text processing skills to another level by learning about the latest natural language processing and machine learning techniques with this full color guide
Key Features
Learn how to acquire and process textual data and visualize the key findings
Obtain deeper insight into the most commonly used algorithms and techniques and understand their tradeoffs
Implement models for solving real-world problems and evaluate their performance
Book Description
With the ever-increasing demand for machine learning and programming professionals, it's prime time to invest in the field. This book will help you in this endeavor, focusing specifically on text data and human language by steering a middle path among the various textbooks that present complicated theoretical concepts or focus disproportionately on Python code.
A good metaphor this work builds upon is the relationship between an experienced craftsperson and their trainee. Based on the current problem, the former picks a tool from the toolbox, explains its utility, and puts it into action. This approach will help you to identify at least one practical use for each method or technique presented. The content unfolds in ten chapters, each discussing one specific case study. For this reason, the book is solution-oriented. It's accompanied by Python code in the form of Jupyter notebooks to help you obtain hands-on experience. A recurring pattern in the chapters of this book is helping you get some intuition on the data and then implement and contrast various solutions.
By the end of this book, you'll be able to understand and apply various techniques with Python for text preprocessing, text representation, dimensionality reduction, machine learning, language modeling, visualization, and evaluation.
What you will learn
Understand fundamental concepts of machine learning for text
Discover how text data can be represented and build language models
Perform exploratory data analysis on text corpora
Use text preprocessing techniques and understand their trade-offs
Apply dimensionality reduction for visualization and classification
Incorporate and fine-tune algorithms and models for machine learning
Evaluate the performance of the implemented systems
Know the tools for retrieving text data and visualizing the machine learning workflow
Who this book is for
This book is for professionals in the area of computer science, programming, data science, informatics, business analytics, statistics, language technology, and more who aim for a gentle career shift in machine learning for text. Students in relevant disciplines that seek a textbook in the field will benefit from the practical aspects of the content and how the theory is presented. Finally, professors teaching a similar course will be able to pick pertinent topics in terms of content and difficulty. Beginner-level knowledge of Python programming is needed to get started with this book.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 517
Veröffentlichungsjahr: 2022
Apply modern techniques with Python for text processing, dimensionality reduction, classification, and evaluation
Nikos Tsourakis
BIRMINGHAM—MUMBAI
Copyright © 2022 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Publishing Product Manager: Ali Abidi
Content Development Editor: Shreya Moharir
Technical Editor: Rahul Limbachiya
Copy Editor: Safis Editing
Project Coordinator: Farheen Fathima
Proofreader: Safis Editing
Indexer: Manju Arasan
Production Designer: Vijay Kamble
Marketing Coordinator: Shifa Ansari
First published: October 2022
Production reference: 3111122
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-80324-238-5
www.packt.com
This book is dedicated to my parents, Vasileios and Zoi
Why bother writing a book on topics for which there is already a vast amount of information available? My main driving force was to share knowledge in a beloved field in a way I would have liked to have been exposed to several years ago. Creating a book for a single reader who has long ago ceased to exist (my past self) has no merit. Instead, we wanted to offer a practical resource to a broader audience, which required feedback from colleagues in active conversations, who also functioned as a sounding board for our ideas. These people directly or indirectly affected the current book’s structure and content, and to these people I am overwhelmingly indebted.
Initially, I would like to thank Vassilis Digalakis, for opening the door and welcoming me to a new playground. Nikos Chatzichrisafis, for providing a unique landing point in the field. Pierrette Bouillon, for enlarging the space with more toys, and Manny Rayner for the company during gameplay.
Special thanks go to my colleagues at the International Institute in Geneva, and particularly to Dogan Guven (for architecting our Business Analytics program), Andrea di Mauro (for the inspiration), and Ioanna Liouka (for the opportunity).
It would be negligent to omit my master’s students in the Text Mining and Python course, who unwittingly became testing subjects for a large part of the content.
Many thanks to my colleagues at the University of Geneva for fostering a productive multidisciplinary work environment. The technologies described in this book find practical usage in our research tasks. My daily interaction with such competent researchers in the field is hopefully reflected in the quality of the current book.
I can only recall positive sentiments from my collaboration with the Packt team. Their professionalism and interpersonal interaction gave me the freedom to create a book as I imagined. But, on the other hand, they provided a large bounding box and direction that prevented unintended ricochets. In particular, I would like to thank Shreya Moharir (for making the book appropriate for a global audience), Aparna Nair (for maintaining the right pace), and Ali Abidi (for orchestrating the whole process). In addition, the two reviewers, Ved Mathai and Saurabh Shahane, did their utmost to highlight all the unintentional pitfalls in my initial drafts and provided a genuine quality boost to the outcome. Finally, I would also like to thank Costas Boulis for reading part of this work and providing valuable feedback.
Last but not least, I would like to acknowledge my wife Kyriaki and my son Vassili for their love and support during the compilation of this work. Without them, the book would have finished a little earlier, but it wouldn’t have meant nearly as much.
All these people have assisted me in one way or another in creating a better book.
Enjoy the ride!
Nikos Tsourakis is a professor of computer science and business analytics at the International Institute in Geneva, Switzerland, and a research associate at the University of Geneva. He has over 20 years of experience designing, building, and evaluating intelligent systems using speech and language technologies. He has also co-authored over 50 research publications in the area. In the past, he worked as a software engineer, developing products for major telecommunication vendors. He also served as an expert for the European Commission and is currently a certified educator at the Amazon Web Services Academy. He holds a degree in electronic and computer engineering, a master’s in management, and a PhD in multilingual information processing.
Ved Mathai is a graduate of Manipal Institute of Technology and has a postgraduate degree in information technology from the International Institute of Information Technology, Bangalore. He has worked on numerous start-ups. He worked on semantics and machine learning at DataWeave, as a senior NLP engineer for 4 years at Slang Labs, and, most recently, as the CTO at Navanc Data Sciences. When he is not programming, he can be found watching Formula One or running in the park while listening to a podcast.
Saurabh Shahane is a data scientist-turned-entrepreneur. Currently, he is the CEO of The Machine Learning Company (TMLC). With TMLC, he is creating a data science ecosystem for both industries and educational organizations. He is an adjunct professor at the AI faculty at Symbiosis Institute of Technology and is also a Kaggle Grandmaster. He has a blend of academic and industry experience having worked with industrialists and researchers from domains such as pharmaceuticals, sports, finance, and business to promote and release research work and practical data strategies.
Electronic mail is a ubiquitous internet service for exchanging messages between people. A typical problem in this sphere of communication is identifying and blocking unsolicited and unwanted messages. Spam detectors undertake part of this role; ideally, they should not let spam escape uncaught while not obstructing any non-spam.
This chapter deals with this problem from a machine learning (ML) perspective and unfolds as a series of steps for developing and evaluating a typical spam detector. First, we elaborate on the limitations of performing spam detection using traditional programming. Next, we introduce the basic techniques for text representation and preprocessing. Finally, we implement two classifiers using an open source dataset and evaluate their performance based on standard metrics.
By the end of the chapter, you will be able to understand the nuts and bolts behind the different techniques and implement them in Python. But, more importantly, you should be capable of seamlessly applying the same pipeline to similar problems.
We go through the following topics:
Obtaining the dataUnderstanding its contentPreparing the datasets for analysisTraining classification modelsRealizing the tradeoffs of the algorithmsAssessing the performance of the modelsThe code of this chapter is available as a Jupyter Notebook in the book’s GitHub repository: https://github.com/PacktPublishing/Machine-Learning-Techniques-for-Text/tree/main/chapter-02.
The Notebook has an in-built step to download the necessary Python modules required for the practical exercises in this chapter. Furthermore, for Windows, you need to download and install Microsoft C++ Build Tools from the following link: https://visualstudio.microsoft.com/visual-cpp-build-tools/.
A spam detector is software that runs on the mail server or our local computer and checks the inbox to detect possible spam. As with traditional letterboxes, an inbox is a destination for electronic mail messages. Generally, any spam detector has unhindered access to this repository and can perform tens, hundreds, or even thousands of checks per day to decide whether an incoming email is spam or not. Fortunately, spam detection is a ubiquitous technology that filters out irrelevant and possibly dangerous electronic correspondence.
How would you implement such a filter from scratch? Before exploring the steps together, look at a contrived (and somewhat naive) spam email message in Figure 2.1. Can you identify some key signs that differentiate this spam from a non-spam email?
Figure 2.1 – A spam email message
Even before reading the content of the message, most of you can immediately identify the scam from the email’s subject field and decide not to open it in the first place. But let’s consider a few signs (coded as T1 to T4) that can indicate a malicious sender:
T1 – The text in the subject field is typical for spam. It is characterized by a manipulative style that creates unnecessary urgency and pressure.T2 – The message begins with the phrase Dear MR tjones. The last word was probably extracted automatically from the recipient’s email address.T3 – Bad spelling and the incorrect use of grammar are potential spam indicators.T4 – The text in the body of the message contains sequences with multiple punctuation marks or capital letters.We can implement a spam detector based on these four signs, which we will hereafter call triggers. The detector classifies an incoming email as spam if T1, T2, T3, and T4 are True simultaneously. The following example shows the pseudocode for the program:
