E-Book
27,59 €

Natural Language Processing with Flair E-Book

Tadej Magajna

0,0

27,59 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Lebensstil
Sprache: Englisch

Beschreibung

Flair is an easy-to-understand natural language processing (NLP) framework designed to facilitate training and distribution of state-of-the-art NLP models for named entity recognition, part-of-speech tagging, and text classification. Flair is also a text embedding library for combining different types of embeddings, such as document embeddings, Transformer embeddings, and the proposed Flair embeddings.
Natural Language Processing with Flair takes a hands-on approach to explaining and solving real-world NLP problems. You'll begin by installing Flair and learning about the basic NLP concepts and terminology. You will explore Flair's extensive features, such as sequence tagging, text classification, and word embeddings, through practical exercises. As you advance, you will train your own sequence labeling and text classification models and learn how to use hyperparameter tuning in order to choose the right training parameters. You will learn about the idea behind one-shot and few-shot learning through a novel text classification technique TARS. Finally, you will solve several real-world NLP problems through hands-on exercises, as well as learn how to deploy Flair models to production.
By the end of this Flair book, you'll have developed a thorough understanding of typical NLP problems and you’ll be able to solve them with Flair.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

MOBI

Seitenzahl: 247

Veröffentlichungsjahr: 2022

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Natural Language Processing with Flair

A practical guide to understanding and solving NLP problems with Flair

Tadej Magajna

BIRMINGHAM—MUMBAI

Natural Language Processing with Flair

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Publishing Product Manager: Ali Abidi

Senior Editor: Nathanya Dias

Content Development Editor: Nazia Shaikh

Technical Editor: Devanshi Ayare

Copy Editor: Safis Editing

Project Coordinator: Aparna Ravikumar Nair

Proofreader: Safis Editing

Indexer: Sejal Dsilva

Production Designer: Aparna Bhagat

Marketing Coordinators: Abeer Dawe, Shifa Ansari

First published: April 2022

Production reference: 1210422

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-80107-231-1

www.packt.com

To my girlfriend and my family, who supported me throughout the entire process of writing this book. Special thanks to Prof. Dr. Alan Akbik, the original author of Flair, for help and for making the framework available to the public.

– Tadej Magajna

Contributors

About the author

Tadej Magajna is a former lead machine learning engineer, former data scientist, and now a software engineer at Microsoft. He currently works in a team responsible for language model training and building language packs for keyboards such as Microsoft SwiftKey. He has a master's degree in computer science. He started his career as a 15-year-old at a local media company as a web developer and progressed toward more complex problems. He has tackled problems such as NLP market research, public transport bus and train capacity forecasting, and finally, language model training in his current role. Currently, he is based in his hometown of Ljubljana, Slovenia.

About the reviewers

After studying engineering and physics, Pascal Tartarin held various senior management positions (sales, marketing, and R&D) mainly in the chemical and medical device industries in Europe and Asia Pacific. A keen lifelong learner, he later developed his expertise in data science and NLP. Over the last 6 years, he has built applications in the field of business (forecasts, CRM) based on ML, deep learning, NLP, and more recently, knowledge graphs (Neo4j).

Amardeep Kumar has sound experience in Natural Language Processing and machine learning systems. He has industry experience of 2 years working on different technical stacks in NLP and software development. Amardeep regularly contributes to open source software development as a developer, mentor, and reviewer in Google Summer of Code programs. He has published research papers for top-tier NLP-proc conferences, such as ACL and EMNLP.

Preface

Part 1: Understanding and Solving NLP with Flair

Chapter 1: Introduction to Flair

Technical requirements

A brief introduction to NLP

Tokenization

Text vectorization

Named entity recognition

Word-sense disambiguation

Part-of-speech tagging

Chunking

Stemming and lemmatization

Text classification

Introducing Flair

Setting up the development environment

Creating the virtual environment

Installing a published version of Flair in a virtual environment

Installing directly from the GitHub repository (optional)

Running code that uses Flair

Summary

Chapter 2: Flair Base Types

Technical requirements

Sentence and Token objects

Understanding the Sentence class

Understanding the Token class

Tokenization in Flair

Sentence and Token object helper methods

Using custom tokenizers

Using the TokenizerWrapper class

Understanding the Corpus object

Summary

Chapter 3: Embeddings in Flair

Technical requirements

Understanding word embeddings

Cosine similarity

The "king – man ≈ queen – woman" analogy

Classic word embeddings in Flair

Flair embeddings

Understanding the contextuality of Flair embeddings

Character-level sequence modeling in Flair embeddings

Pooled Flair embeddings

Available Flair embeddings

Stacked embeddings

Document embeddings

Other embeddings in Flair

Summary

Chapter 4: Sequence Tagging

Technical requirements

Understanding sequence tagging in Flair

Named entity recognition in Flair

Types of NER taggers in Flair

Part-of-speech tagging in Flair

Part-of-speech tagging tag sets

Types of PoS taggers in Flair

Other tagging tasks in Flair

Sequence labeling metrics

Measuring accuracy for sequence labeling tasks

Measuring the F1 score for sequence labeling tasks

Summary

Part 2: Deep Dive into Flair – Training Custom Models

Chapter 5: Training Sequence Labeling Models

Technical requirements

The motivation behind training custom models

Understanding the hardware requirements for training models

Working with parameters for training and evaluation

Understanding neural model training

Parameters and terminology used in Flair model training

Training custom sequence labeling models

Loading a tagged corpus

Loading the tag dictionary

Building the embedding stack

Initializing the SequenceTagger object

Training the model

Understanding training output files

Loading and using custom Flair models

Knowing when to stop and try again

Monitoring loss

Assessing and comparing performance metrics

Summary

Chapter 6: Hyperparameter Optimization in Flair

Technical requirements

Understanding hyperparameter tuning

Optimization methods

Hyperparameter tuning in Python

Hyperparameter tuning in Python hands-on

Hyperparameter optimization in Flair

Hyperparameter tuning in Flair hands-on

Summary

Chapter 7: Train Your Own Embeddings

Technical requirements

Understanding the how and why behind custom Flair embeddings

Why training embeddings rarely ever means training embeddings

Corpora for Flair embeddings

Dictionaries for training Flair embeddings

Evaluating word embeddings

Training Flair embeddings on the world's smallest language

Preparing the dictionary

Preparing the corpus

Training the language model

Using custom embeddings on downstream tasks

Performing intrinsic evaluation on custom Flair embeddings

Summary

Chapter 8: Text Classification in Flair

Technical requirements

Understanding text classification

Training text classifiers

Text classification in Flair

Using pre-trained Flair text classification models

Document embeddings in Flair

Text classifiers in Flair

Training a text classifier in Flair

Working with text classifiers that require little to no training data

Zero-shot classification with TARS

Few-shot classification with TARS

Summary

Part 3: Real-World Applications with Flair

Chapter 9: Deploying and Using Models in Production

Technical requirements

Technical considerations for NLP models in production

Client-side versus server-side inference

When to use managed services and when to self-serve server-side models

Managed services for Flair models

A model deployment decision-making flowchart

Self-serving flair models

A proof-of-concept self-serve tool for Flair

Using managed services for deploying and using Flair models

The Hugging Face Models Hub

Summary

Chapter 10: Hands-On Exercise – Building a Trading Bot with Flair

Technical requirements

Understanding the trading strategy

Implementing the Flair trading bot

A Flair coding cheat sheet

Summary

Other Books You May Enjoy

Part 1: Understanding and Solving NLP with Flair

In this part, you will learn the basics of NLP and get an overview of the Flair framework. You will set up your environment, install Flair, and explore its basic features. You will learn how to extract knowledge from embeddings and use pre-trained sequence labeling models in Flair.

This part comprises the following chapters:

Chapter 1, Introduction to FlairChapter 2, Flair Base TypesChapter 3, Embeddings in FlairChapter 4, Sequence Tagging

Chapter 1: Introduction to Flair

There are few Natural Language Processing (NLP) frameworks out there as easy to learn and as easy to work with as Flair. Packed with pre-trained models, excellent documentation, and readable syntax, it provides a gentle learning curve for NLP researchers who are not necessarily skilled in coding; software engineers with poor theoretical foundations; students and graduates; as well as individuals with no prior knowledge simply interested in the topic. But before diving straight into coding, some background about the motivation behind Flair, the basic NLP concepts, and the different approaches to how you can set up your local environment may help you on your journey toward becoming a Flair NLP expert.

In Flair's official GitHub README, the framework is described as:

"A very simple framework for state-of-the-art Natural Language Processing"

This description will raise a few eyebrows. NLP researchers will immediately be interested in knowing what specific tasks the framework achieves its state-of-the-art results in. Engineers will be intrigued by the very simple label, but will wonder what steps are required to get up and running and what environments it can be used in. And those who are not knowledgeable in NLP will wonder whether they will be able to grasp the knowledge required to understand the problems Flair is trying to solve.

In this chapter, we will be answering all of these questions by covering the basic NLP concepts and terminology, providing an overview of Flair, and setting up our development environment with the help of the following sections:

A brief introduction to NLPWhat is Flair?Getting ready

Technical requirements

To get started, you will need a development environment with Python 3.6+. Platform-specific instructions for installing Python can be found at https://docs.python-guide.org/starting/installation/.

You will not require a GPU-equipped development machine, though having one will significantly speed up some of the training-related exercises described later in the book.

You will require access to a command line. On Linux and macOS, simply start the Terminal application. On Windows, press Windows + R to open the Run box, type cmd and then click OK.

Flair's official GitHub repository is available via the following link: https://github.com/flairNLP/flair. In this chapter we will install Flair version 0.11.

The code examples covered in this chapter are found in this book's official GitHub repository in the following Jupyter notebook: https://github.com/PacktPublishing/Natural-Language-Processing-with-Flair/tree/main/Chapter01.

A brief introduction to NLP

Before diving straight into what Flair is capable of and how to leverage its features, we will be going through a brief introduction to NLP to provide some context for readers who are not familiar with all the NLP techniques and tasks solved by Flair. NLP is a branch of artificial intelligence, linguistics, and software engineering that helps machines understand human language. When we humans read a sentence, our brains immediately make sense of many seemingly trivial problems such as the following:

Is the sentence written in a language I understand?How can the sentence be split into words?What is the relationship between the words?What are the meanings of the individual words?Is this a question or an answer?Which part-of-speech categories are the words assigned to?What is the abstract meaning of the sentence?

The human brain is excellent at solving these problems conjointly and often seamlessly, leaving us unaware that we made sense of all of these things simply by reading a sentence.

Even now, machines are still not as good as humans at solving all these problems at once. Therefore, to teach machines to understand human language, we have to split understanding of natural language into a set of smaller, machine-intelligible tasks that allow us to get answers to these questions one by one.

In this section, you will find a list of some important NLP tasks with emphasis on the tasks supported by Flair.

Tokenization

Tokenization is the process of breaking down a sentence or a document into meaningful units called tokens. A token can be a paragraph, a sentence, a collocation, or just a word.

For example, a word tokenizer would split the sentence Learning to use Flair into a list of tokens as ["Learning", "to", "use", "Flair"].

Tokenization has to adhere to language-specific rules and is rarely a trivial task to solve. For example, with unspaced languages where word boundaries aren't defined with spaces, it's very difficult to determine where one word ends and the next one starts. Well-defined token boundaries are a prerequisite for most NLP tasks that aim to process words, collocations, or sentences including the following tasks explained in this chapter.

Text vectorization

Text vectorization is a process of transforming words, sentences, or documents in their written form into a numerical representation understandable to machines.

One of the simplest forms of text vectorization is one-hot encoding. It maps words to binary vectors of length equal to the number of words in the dictionary. All elements of the vector are 0 apart from the element that represents the word, which is set to 1 – hence the name one-hot.

For example, take the following dictionary:

CatDogGoat

The word cat would be the first word in our dictionary and its one-hot encoding would be [1, 0, 0]. The word dog would be the second word in our dictionary and its one-hot encoding would be [0, 1, 0]. And the word goat would be the third and last word in our dictionary and its one-hot encoding would be [0, 0, 1].

This approach, however, suffers from the problem of high dimensionality as the length of this vector grows linearly with the number of words in the dictionary. It also doesn't capture any semantic meaning of the word. To counter this problem, most modern state-of-the-art approaches use representations called word or document embeddings. Each embedding is usually a fixed-length vector consisting of real numbers. While the numbers will at first seem unintelligible to a human, in some cases, some vector dimensions may represent some abstract property of the word – for example, a dimension of a word-embedding vector could represent the general (positive or negative) sentiment of the word. Given two or more embeddings, we will be able to compute the similarity or distance between them using a distance measure called cosine similarity. With many modern NLP solutions, including Flair, embeddings are used as the underlying input representation for higher-level NLP tasks such as named entity recognition.

One of the main problems with early word embedding approaches was that words with multiple meanings (polysemic words) were limited to a single and constant embedding representation. One of the solutions to this problem in Flair is the use of contextual string embeddings where words are contextualized by their surrounding text, meaning that they will have a different representation given a different surrounding text.

Named entity recognition

Named entity recognition (NER) is an NLP task or technique that identifies named entities in a text and tags them with their corresponding categories. Named entity categories include, but aren't limited to, places, personnames, brands, time expressions, and monetary values.

The following figure illustrates NER using colored backgrounds and tags associated with the words:

Figure 1.1 – Visualization of NER tagging

In the previous example, we can see that three entities were identified and tagged. The first and third tags are particularly interesting because they both represent the same word, Berkeley, yet the first one clearly refers to an organization whereas the second one refers to a geographic location. The human brain is excellent at distinguishing between different entity types based on context and is able to do so almost seamlessly, whereas machines have struggled with it for decades. Recent advancements in contextual string embeddings, an essential part of Flair, made a huge leap forward in solving that.

Word-sense disambiguation

Word-Sense Disambiguation (WSD) is an NLP technique concerned with identifying the intended sense of a given word with multiple meanings.

For example, take the given sentence:

George tried to return to Berlin to return his hat.

WSD would aim to identify the sense of the first use of the word return, referring to the act of giving something back, and the sense of the second return, referring to the act of going back to the same place.

Part-of-speech tagging

Part-of-Speech (POS) tagging is a technique closely related to both WSD and NER that aims to tag the words as corresponding to a particular part of speech such as nouns, verbs, adjectives adverbs, and so on.

Figure 1.2 – Visualization of POS tagging

Actual POS taggers provide a lot more information with the tags than simply associating the words with noun/verb/adjective categories. For example, the Penn Treebank Project corpus, one of the most widely used NER corpora, distinguishes between 36 different types of POS tags.

Chunking

Another NLP technique closely related to POS tagging is chunking. Unlike parts of speech (POS), where we identify individual POS, in chunking we identify complete short phrases such as noun phrases. In Figure 1.2, the phrase A lovely day can be considered a chunk as it is a noun phrase, and in its relationship to other words works the same way as a noun.

Stemming and lemmatization

Stemming and lemmatization are two closely related text normalization techniques used in NLP to reduce the words to their common base forms. For example, the word play is the base word of the words playing, played and plays.

The simpler of the two techniques, stemming, simply accomplishes this by cutting off the ends or beginnings of words. This simple solution often works, but is not foolproof. For example, the word ladies can never be transformed into the word lady by stemming only. We therefore need a technique that understands the POS category of a word and takes into account its context. This technique is called lemmatization. The process of lemmatization can be demonstrated using the following example.

Take the following sentence:

this meeting was exhausting

Lemmatization reduces the previous sentence to the following:

this meeting be exhaust

It reduces the word was to be and the word exhausting to exhaust. Also note that the word meeting is used as a noun and it is therefore mapped to the same word meeting, whereas if the word meeting was used as a verb, it would be reduced to meet.

A popular and easy-to-use library for performing lemmatization with Python is spaCy. Its models are trained on large corpora and are able to distinguish between different POS, yielding impressive results.

Text classification

Text classification is an NLP technique used to assign a text or a document to one or more classes or document types. Practical uses for text classification include spam filtering, language identification, sentiment analysis, and programming language identification from syntax.

Having covered the basic NLP concepts and terminology, we can now move on to understanding what Flair is and how it manages to solve NLP tasks with state-of-the-art results.

Introducing Flair

Flair is a powerful NLP framework published as a Python package. It provides a simple interface that is friendly, easy to use, and caters to people from various backgrounds including those with little prior knowledge in programming. It is published under the MIT License, which is one of the most permissive free software licenses.

Flair as an NLP framework comes with a variety of tools and uses. It can be defined in the following ways:

It is an NLP framework used in NLP research for producing models that achieve state-of-the-art results across many NLP tasks such as POS tagging, NER, and chunking across several languages and datasets. In Flair's GitHub repository, you will find step-by-step instructions on how to reproduce these results.It is a tool for training, validating, and distributing NER, POS tagging, chunking, word sense disambiguation, and text classification models. It features tools that help ease the training and validation processes such as the automatic corpora downloading tool, and tools that facilitate model tuning such as the hyperparameter optimization tool. It supports a growing number of languages.It is a tool for downloading and using state-of-the-art pre-trained models. The models are downloaded seamlessly, meaning that they will be automatically downloaded the first time you use them and will remain stored for future use.It is a platform for the proposed state-of-the-art Flair embeddings. The state-of-the-art results Flair achieves in many NLP tasks can by and large be attributed to its proposed Flair contextual string embeddings described in more detail in the paper Contextual String Embeddings for Sequence Labeling. The author refers to them as "the secret sauce" of Flair.It is an NLP framework for working with biomedical data. A special section of Flair is dedicated solely to working with biomedical data and features a set of pretrained models that achieve state-of-the-art results, as well as a number of corpora and comprehensive documentation on how to train custom biomedical tagging models.It is a great practical introduction to NLP. Flair's extensive online documentation, simple interface, inclusive support for a large number of languages, and its ability to perform a lot of the tasks on non-GPU-equipped machines all make it an excellent entry point for someone aiming to learn about NLP through practical hands-on experimentation.

Setting up the development environment

Now that you have a basic understanding of features offered by the framework, as well as an understanding of the basic NLP concepts, you are now ready to move to the next step of setting up your development environment for Flair.

To be able to follow the instructions in this section, first make sure you have Python 3.6+ installed on your device as described in the Technical requirements section.

Creating the virtual environment

In Python, it's generally good practice to install packages in virtual environments so that the project dependencies you are currently working on will not affect your global Python dependencies or other projects you may work on in the future.

We will use the venv tool that is part of the Python Standard Library and requires no installation. To create a virtual environment, simply create a new directory, move into it, then run the following command:

$ python3 -m venv learning-flair

Then, to activate the virtual environment on Linux or macOS, run the following:

$ source learning-flair/bin/activate

If you are running Windows, run the following:

$ learning-flair\Scripts\activate.bat

Tausende von E-Books und Hörbücher

Ihre Zahl wächst ständig und Sie haben eine Fixpreisgarantie.

Sie haben über uns geschrieben:

Natural Language Processing with Flair E-Book

Tadej Magajna

Natural Language Processing with Flair

Natural Language Processing with Flair

Contributors

About the author

About the reviewers

Table of Contents

Preface

Part 1: Understanding and Solving NLP with Flair

Chapter 1: Introduction to Flair

Technical requirements

A brief introduction to NLP

Tokenization

Text vectorization

Named entity recognition

Word-sense disambiguation

Part-of-speech tagging

Chunking

Stemming and lemmatization

Text classification

Introducing Flair

Setting up the development environment

Creating the virtual environment

Installing a published version of Flair in a virtual environment

Installing directly from the GitHub repository (optional)

Running code that uses Flair

Summary

Chapter 2: Flair Base Types

Technical requirements

Sentence and Token objects

Understanding the Sentence class

Understanding the Token class

Tokenization in Flair

Sentence and Token object helper methods

Using custom tokenizers

Using the TokenizerWrapper class

Understanding the Corpus object

Summary

Chapter 3: Embeddings in Flair

Technical requirements

Understanding word embeddings

Cosine similarity

The "king – man ≈ queen – woman" analogy

Classic word embeddings in Flair

Flair embeddings

Understanding the contextuality of Flair embeddings

Character-level sequence modeling in Flair embeddings

Pooled Flair embeddings

Available Flair embeddings

Stacked embeddings

Document embeddings

Other embeddings in Flair

Summary

Chapter 4: Sequence Tagging

Technical requirements

Understanding sequence tagging in Flair

Named entity recognition in Flair

Types of NER taggers in Flair

Part-of-speech tagging in Flair

Part-of-speech tagging tag sets

Types of PoS taggers in Flair

Other tagging tasks in Flair

Sequence labeling metrics

Measuring accuracy for sequence labeling tasks

Measuring the F1 score for sequence labeling tasks

Summary

Part 2: Deep Dive into Flair – Training Custom Models

Chapter 5: Training Sequence Labeling Models

Technical requirements

The motivation behind training custom models

Understanding the hardware requirements for training models

Working with parameters for training and evaluation

Understanding neural model training

Parameters and terminology used in Flair model training

Training custom sequence labeling models

Loading a tagged corpus

Loading the tag dictionary

Building the embedding stack

Initializing the SequenceTagger object