Mastering spaCy - Déborah Mesquita - E-Book

Mastering spaCy E-Book

Déborah Mesquita

0,0
28,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

Mastering spaCy, Second Edition is your comprehensive guide to building sophisticated NLP applications using the spaCy ecosystem. This revised edition embraces the latest advancements in NLP, featuring new chapters on Large Language Models with spaCy-LLM, transformers integration, and end-to-end workflow management with Weasel.
With this new edition you’ll learn to enhance NLP tasks using LLMs with spaCy-llm, manage end-to-end workflows using Weasel and integrating spaCy with third-party libraries like Streamlit, FastAPI, and DVC. From training custom named entity recognition (NER) pipelines to categorizing emotions in Reddit posts, readers will explore advanced topics like text classification and coreference resolution. This book takes you on a journey through spaCy’s capabilities, starting with the fundamentals of NLP, such as tokenization, named entity recognition, and dependency parsing. As you progress, you’ll delve into advanced topics like creating custom components, training domain-specific models, and building scalable NLP workflows.
By end of the book, through practical examples, clear explanations, tips and tricks you will be empowered to build robust NLP pipelines and integrate them with web applications to build end-to-end solutions.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 265

Veröffentlichungsjahr: 2025

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Mastering spaCy

Build structured NLP solutions with custom components and models powered by spacy-llm

Déborah Mesquita

Duygu Altinok

Mastering spaCy

Copyright © 2025 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Niranjan Naikwadi

Publishing Product Manager: Tejashwini R

Book Project Manager: Aparna Ravikumar Nair

Content Engineer: Vandita Grover

Senior Content Development Editor: Priyanka Soam

Technical Editor: Seemanjay Ameriya

Copy Editor: Safis Editing

Proofreader: Priyanka Soam

Indexer: Manju Arasan

Production Designer: Nilesh Mohite

Growth Lead: Kunal Sawant

First published: July 2021

Second edition: February 2025

Production reference: 1240125

Published by Packt Publishing Ltd.

Grosvenor House

11 St Paul’s Square

Birmingham

B3 1RB, UK

ISBN 978-1-83588-046-3

www.packtpub.com

I’d like to thank everyone who directly or indirectly contributed to making this book happen. First and foremost, a huge shoutout to the team at Packt, who made the process of writing a book way less painful than it could’ve been. Special thanks to Aparna Nair and Priyanka Soam for being so understanding when I kept saying I needed more time to finish chapters, and to David for all the super valuable feedback on my very first chapter draft. I also want to thank Quincy Larson from FreeCodeCamp for accepting my submitted piece for Medium back in 2017 and editing it so well that it became a hit, even helping John Maeda learn TensorFlow. A huge thanks to my managers, Carlos Porto Filho and Talita Menezes Brognara, for always championing my work and supporting my growth—you’re the best managers anyone could ask for. To all my friends, thank you for sticking with me. Special thanks to Nicole Charron and Sabrina Guimarães for putting up with my daily complaints about having to finish a chapter, and to Suelen Mazza for always running late when we’d go out, giving me just a bit more time to write. Last but not least, to my family—thank you for loving me no matter my achievements. Augusta and Carlos, I’m so proud to have you as my parents. Thanks for teaching me to always be a good person; for me, that’s the most important lesson in life.

– Déborah Mesquita

Contributors

About the authors

Déborah Mesquita is a data science consultant and writer. With a BSc in computer science from UFPE, one of Brazil’s top computer science programs, she brings a diversified skill set refined through hands-on experience with various technologies. Déborah has consistently delivered exceptional results in various data science projects, being able to navigate the business and technical sides of each project. Her ability to translate complex concepts into simple language, coupled with her quick learning and broad vision, make her an effective educator. Actively engaged in community initiatives, she works to ensure equitable access to knowledge, reflecting her belief that technology is not a panacea, but a powerful tool for societal improvement when used for that purpose. She writes a personal blog at deborahmesquita.com.

Duygu Altinok is a senior Natural Language Processing (NLP) engineer with 12 years of experience in almost all areas of NLP, including search engine technology, speech recognition, text analytics, and conversational AI. She has published several publications in the NLP domain at conferences such as LREC and CLNLP. She also enjoys working on open source projects and is a contributor to the spaCy library. Duygu earned her undergraduate degree in computer engineering from METU, Ankara, in 2010 and later earned her master’s degree in mathematics from Bilkent University, Ankara, in 2012. She is currently a senior engineer at German Autolabs with a focus on conversational AI for voice assistants. Originally from Istanbul, Duygu currently resides in Berlin, Germany, with her cute dog Adele.

About the reviewer

Souvik Roy is a senior data scientist at Sun Life Financial, specializing in NLP and machine learning to address challenges in the financial services domain. He has over four years of experience and a master’s degree in machine learning from the University of Waterloo. Souvik focuses on developing innovative solutions to optimize client experience interactions and enhance financial strategies. At Bell Canada, he improved cross-selling efficiency by 30% through advanced NLP solutions. He has also contributed to transformer model compression research at Huawei Noah’s Ark Lab to optimize inference on resource-constrained devices. He thanks the authors and Packt Publishing for the opportunity to contribute to this book.

Table of Contents

Preface

Part 1: Getting Started with spaCy

1

Getting Started with spaCy

Technical requirements

Overview of spaCy

A high-level overview of the spaCy library

Installing spaCy

Installing spaCy’s language models

Installing a language model

Visualization with displaCy

Getting started with displaCy

Entity visualizer

Using displaCy with pure Python

Using displaCy in Jupyter notebooks

Summary

2

Core Operations with spaCy

Technical requirements

Overview of spaCy conventions

Introducing Tokenization

Customizing the tokenizer

Debugging the tokenizer

Sentence segmentation

Understanding lemmatization

Lemmatization in NLU

spaCy container objects

Doc

Token

Span

More spaCy Token features

Summary

Part 2: Advanced Linguistic and Semantic Analysis

3

Extracting Linguistic Features

Technical requirements

What is POS tagging?

Word-Sense Disambiguation (WSD)

Introduction to dependency parsing

Dependency relations

Syntactic relations

Introducing NER

Merging and splitting tokens

Summary

4

Mastering Rule-Based Matching

Technical requirements

Token-based matching

Extended syntax support

Token attributes

Regex-like operators

Regex support

Matcher online demo

Creating patterns with PhraseMatcher

Creating patterns with SpanRuler

Combining spaCy models and matchers

Extracting an IBAN

Extracting phone numbers

Extracting mentions

Hashtag extraction

Expanding named entities

Summary

5

Extracting Semantic Representations with spaCy Pipelines

Technical requirements

Extracting named entities with SpanRuler

Getting to know the ATIS dataset

Defining LOCATION entities

Adding the SpanRuler component to our processing pipeline

Extracting dependency relations with DependencyMatcher

Linguistic primer

Matching patterns with the DependencyMatcher component

Creating a pipeline component using extension attributes

Running the pipeline with large datasets

Summary

6

Utilizing spaCy with Transformers

Technical requirements

Transformers and transfer learning

From LSTMs to Transformers

Text classification with spaCy

Training the TextCategorizer component

Using Hugging Face transformers in spaCy

The Transformer component

spaCy’s configuration system

Training the TextCategorizer with a config file

BERT and RoBERTa

Training the TextCategorizer with a transformer

Summary

Part 3: Customizing and Integrating NLP Workflows

7

Enhancing NLP Tasks Using LLMs with spacy-llm

Technical requirements

LLMs and prompt engineering basics

Text summarization with LLMs and spacy-llm

Creating custom spacy-llm tasks

Summary

8

Training an NER Component with Your Own Data

Technical requirements

Getting started with data preparation

Do spaCy models perform well enough on your data?

Does your domain include many labels that are absent in spaCy models?

Annotating and preparing data

Training an NER pipeline component

Evaluating the accuracy of the NER component

Training an NER component optimized for accuracy

Combining multiple NER components in the same pipeline

Creating a package for the trained pipeline

Creating a pipeline with different NER components

Summary

9

Creating End-to-End spaCy Workflows with Weasel

Technical requirements

Cloning and running a project template with Weasel

Modifying a project template for a different use case

Uploading and downloading project outputs to remote storage

Managing models with the DVC model registry

What is GitOps?

How DVC addresses common data science and ML challenges

From Weasel to DVC

Summary

10

Training an Entity Linker Model with spaCy

Technical requirements

Understanding the entity linking task

Best practices for creating a good NLP corpus

Training an EntityLinker component with spaCy

Training with a custom corpus reader

Testing the entity linking model

Summary

11

Integrating spaCy with Third-Party Libraries

Technical requirements

Building spaCy-powered Apps with Streamlit

Building NLP apps with spacy-streamlit

Building APIs for NLP models using FastAPI

Python type hinting 101

Creating an API for the spaCy model with FastAPI

Summary

Index

Other Books You May Enjoy

Part 1: Getting Started with spaCy

This section will introduce the basics of Natural Language Processing (NLP) with spaCy and guide you through the initial steps of setting up your environment. You’ll start by learning the core functionalities of spaCy, including its processing pipelines and data structures, providing a solid foundation for the more advanced topics that follow.

This part has the following chapters:

Chapter 1, Getting Started with spaCyChapter 2, Core Operations with spaCy