Mastering spaCy - Duygu Altınok - E-Book

Mastering spaCy E-Book

Duygu Altınok

0,0
34,79 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

spaCy is an industrial-grade, efficient NLP Python library. It offers various pre-trained models and ready-to-use features. Mastering spaCy provides you with end-to-end coverage of spaCy's features and real-world applications.
You'll begin by installing spaCy and downloading models, before progressing to spaCy's features and prototyping real-world NLP apps. Next, you'll get familiar with visualizing with spaCy's popular visualizer displaCy. The book also equips you with practical illustrations for pattern matching and helps you advance into the world of semantics with word vectors. Statistical information extraction methods are also explained in detail. Later, you'll cover an interactive business case study that shows you how to combine all spaCy features for creating a real-world NLP pipeline. You'll implement ML models such as sentiment analysis, intent recognition, and context resolution. The book further focuses on classification with popular frameworks such as TensorFlow's Keras API together with spaCy. You'll cover popular topics, including intent classification and sentiment analysis, and use them on popular datasets and interpret the classification results.
By the end of this book, you'll be able to confidently use spaCy, including its linguistic features, word vectors, and classifiers, to create your own NLP apps.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 371

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Mastering spaCy

An end-to-end practical guide to implementing NLP applications using the Python ecosystem

Duygu Altınok

BIRMINGHAM—MUMBAI

Mastering spaCy

Copyright © 2021 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Kunal Parikh

Publishing Product Manager: Ali Abidi

Senior Editor: Roshan Kumar

Content Development Editor: Tazeen Shaikh

Technical Editor: Sonam Pandey

Copy Editor: Safis Editing

Project Coordinator: Aparna Ravikumar Nair

Proofreader: Safis Editing

Indexer: Pratik Shirodkar

Production Designer: Joshua Misquitta

First published: July 2021

Production reference: 3211021

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-80056-335-3

www.packt.com

To my mother, Ülker, for her life-long support and endless love. To my sister, for her support and inspiration. To my besties, Umutcan, Simge, and Aydan, for their friendship and support.

Contributors

About the author

Duygu Altınok is a senior Natural Language Processing (NLP) engineer with 12 years of experience in almost all areas of NLP, including search engine technology, speech recognition, text analytics, and conversational AI. She has published several publications in the NLP area at conferences such as LREC and CLNLP. She also enjoys working on open source projects and is a contributor to the spaCy library. Duygu earned her undergraduate degree in computer engineering from METU, Ankara, in 2010 and later earned her master's degree in mathematics from Bilkent University, Ankara, in 2012. She is currently a senior engineer at German Autolabs with a focus on conversational AI for voice assistants. Originally from Istanbul, Duygu currently resides in Berlin, Germany, with her cute dog Adele.

About the reviewers

Kevin Lu is currently a student studying software engineering at the University of Waterloo, with experience in full-stack web development, machine learning, computer vision, and natural language processing, and is the founder of the Python package PyATE (Python Automated Term Extraction). His interests include discrete mathematics, data science, algorithmic optimization, and deep learning. In the future, he is interested in pursuing research in NLP with deep learning and applications of it in accelerating academic research.

Usama Yaseen is currently a PhD candidate at Siemens AG (Munich) and the University of Munich. His research interests lie in data-efficient information extraction. Before starting his PhD, he was the lead data scientist at SAP SE, where he led a machine learning team focused on information extraction from semi-structured documents. He holds a master's from the Technical University of Munich in informatics; his master's thesis explored recurrent neural networks with external memory for question-answering systems. Overall, he has worked at Siemens (AG) (on corporate technology research), SAP SE (on machine learning), and Intel Corporation (on software development).

Souvik Roy is an NLP researcher. He primarily works on recurrent neural networks and transformer model compression methodologies such as pruning, quantization, tensor decomposition, and knowledge distillation to reduce the challenges faced by larger models, including longer training and inference times. He is passionate about working with textual data to solve underlying problems. Souvik has a master's in engineering from the University of Waterloo, specializing in text processing. Additionally, he has worked with Scribendi on document summarization and grammatical error correction. Since then, he has been working in diverse industrial research labs.

Carlos Fernando Schiaffin is passionate about analyzing and describing the underlying phenomena of human language. He is an NLP developer currently focused on conversational AI. He has a degree in linguistics and is a self-taught Python programmer. For more than five years, he has been working on NLP systems to try to understand and explain some of the speakers' linguistic behaviors. He started his career as a data tagger and soon went on to design annotation processes for linguistic data in Spanish, English, and Portuguese. Currently, he works with Rasa, spaCy and others, on the development of a conversational AI in Spanish. I thank Duygu Altinok for giving me the chance to participate in this book and my colleagues who always accompany my learning process.

Table of Contents

Preface

Section 1: Getting Started with spaCy

Chapter 1: Getting Started with spaCy

Technical requirements

Overview of spaCy

Rise of NLP4

NLP with Python7

Reviewing some useful string operations8

Getting a high-level overview of the spaCy library12

Tips for the reader15

Installing spaCy

Installing spaCy with pip15

Installing spaCy with conda16

Installing spaCy on macOS/OS X16

Installing spaCy on Windows17

Troubleshooting while installing spaCy17

Installing spaCy's statistical models

Installing language models21

Visualization with displaCy

Getting started with displaCy24

Entity visualizer26

Visualizing within Python26

Using displaCy in Jupyter notebooks29

Exporting displaCy graphics as an image file31

Summary

Chapter 2: Core Operations with spaCy

Technical requirements

Overview of spaCy conventions

Introducing tokenization

Customizing the tokenizer42

Debugging the tokenizer43

Sentence segmentation44

Understanding lemmatization

Lemmatization in NLU46

Understanding the difference between lemmatization and stemming47

spaCy container objects

Doc49

Token51

Span54

More spaCy features

Summary

Section 2: spaCy Features

Chapter 3: Linguistic Features

Technical requirements

What is POS tagging?

WSD76

Verb tense and aspect in NLU applications76

Understanding number, symbol, and punctuation tags78

Introduction to dependency parsing

What is dependency parsing?80

Dependency relations81

Syntactic relations81

Introducing NER

A real-world example98

Merging and splitting tokens

Summary

Chapter 4: Rule-Based Matching

Token-based matching

Extended syntax support115

Regex-like operators116

Regex support121

Matcher online demo122

PhraseMatcher

EntityRuler

Combining spaCy models and matchers

Extracting IBAN and account numbers128

Extracting phone numbers131

Extracting mentions132

Hashtag and emoji extraction133

Expanding named entities134

Combining linguistic features and named entities135

Summary

Chapter 5: Working with Word Vectors and Semantic Similarity

Technical requirements

Understanding word vectors

One-hot encoding141

Word vectors143

Analogies and vector operations147

How word vectors are produced148

Using spaCy's pretrained vectors

The similarity method151

Using third-party word vectors

Advanced semantic similarity methods

Understanding semantic similarity156

Categorizing text with semantic similarity160

Extracting key phrases 161

Extracting and comparing named entities163

Summary

Chapter 6: Putting Everything Together: Semantic Parsing with spaCy

Technical requirements

Extracting named entities

Getting to know the ATIS dataset168

Extracting named entities with Matcher172

Using dependency trees for extracting entities179

Using dependency relations for intent recognition

Linguistic primer183

Extracting transitive verbs and their direct objects185

Extracting multiple intents with conjunction relation 186

Recognizing the intent using wordlists188

Semantic similarity methods for semantic parsing

Using synonyms lists for semantic similarity190

Using word vectors to recognize semantic similarity192

Putting it all together

Summary

Section 3: Machine Learning with spaCy

Chapter 7: Customizing spaCy Models

Technical requirements

Getting started with data preparation

Do spaCy models perform well enough on your data?201

Does your domain include many labels that are absent in spaCy models?202

Annotating and preparing data

Annotating data with Prodigy204

Annotating data with Brat205

spaCy training data format206

Updating an existing pipeline component

Disabling the other statistical models208

Model training procedure208

Evaluating the updated NER212

Saving and loading custom models213

Training a pipeline component from scratch

Working with a real-world dataset218

Summary

Chapter 8: Text Classification with spaCy

Technical requirements

Understanding the basics of text classification

Training the spaCy text classifier

Getting to know TextCategorizer class230

Formatting training data for the TextCategorizer231

Defining the training loop233

Testing the new component234

Training TextCategorizer for multilabel classification 234

Sentiment analysis with spaCy

Exploring the dataset237

Training the TextClassifier component241

Text classification with spaCy and Keras

What is a layer?244

Sequential modeling with LSTMs245

Keras Tokenizer246

Embedding words 250

Neural network architecture for text classification251

Summary

References

Chapter 9: spaCy and Transformers

Technical requirements

Transformers and transfer learning

Understanding BERT

BERT architecture266

BERT input format268

How is BERT trained?270

Transformers and TensorFlow

HuggingFace Transformers272

Using the BERT tokenizer273

Obtaining BERT word vectors276

Using BERT for text classification 278

Using Transformer pipelines280

Transformers and spaCy

Summary

Chapter 10: Putting Everything Together: Designing Your Chatbot with spaCy

Technical requirements

Introduction to conversational AI

NLP components of conversational AI products292

Getting to know the dataset294

Entity extraction

Extracting city entities298

Extracting date and time entities300

Extracting phone numbers303

Extracting cuisine types304

Intent recognition

Pattern-based text classification 306

Classifying text with a character-level LSTM309

Differentiating subjects from objects313

Parsing the sentence type316

Anaphora resolution320

Summary

References

Other Books You May Enjoy

Section 1: Getting Started with spaCy

This section will begin with an overview of natural language processing (NLP) with Python and spaCy. You will learn how the book is organized and how to make the best use of the book. You will then start by installing spaCy and its statistical models and take a quick dive into the spaCy world. Basic operations, general conventions, and visualization are the core attractions of this section.

This section comprises the following chapters:

Chapter 1, Getting Started with spaCyChapter 2, Core Operations with spaCy