27,59 €
Flair is an easy-to-understand natural language processing (NLP) framework designed to facilitate training and distribution of state-of-the-art NLP models for named entity recognition, part-of-speech tagging, and text classification. Flair is also a text embedding library for combining different types of embeddings, such as document embeddings, Transformer embeddings, and the proposed Flair embeddings.
Natural Language Processing with Flair takes a hands-on approach to explaining and solving real-world NLP problems. You'll begin by installing Flair and learning about the basic NLP concepts and terminology. You will explore Flair's extensive features, such as sequence tagging, text classification, and word embeddings, through practical exercises. As you advance, you will train your own sequence labeling and text classification models and learn how to use hyperparameter tuning in order to choose the right training parameters. You will learn about the idea behind one-shot and few-shot learning through a novel text classification technique TARS. Finally, you will solve several real-world NLP problems through hands-on exercises, as well as learn how to deploy Flair models to production.
By the end of this Flair book, you'll have developed a thorough understanding of typical NLP problems and you’ll be able to solve them with Flair.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 247
Veröffentlichungsjahr: 2022
A practical guide to understanding and solving NLP problems with Flair
Tadej Magajna
BIRMINGHAM—MUMBAI
Copyright © 2022 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Publishing Product Manager: Ali Abidi
Senior Editor: Nathanya Dias
Content Development Editor: Nazia Shaikh
Technical Editor: Devanshi Ayare
Copy Editor: Safis Editing
Project Coordinator: Aparna Ravikumar Nair
Proofreader: Safis Editing
Indexer: Sejal Dsilva
Production Designer: Aparna Bhagat
Marketing Coordinators: Abeer Dawe, Shifa Ansari
First published: April 2022
Production reference: 1210422
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-80107-231-1
www.packt.com
To my girlfriend and my family, who supported me throughout the entire process of writing this book. Special thanks to Prof. Dr. Alan Akbik, the original author of Flair, for help and for making the framework available to the public.
– Tadej Magajna
Tadej Magajna is a former lead machine learning engineer, former data scientist, and now a software engineer at Microsoft. He currently works in a team responsible for language model training and building language packs for keyboards such as Microsoft SwiftKey. He has a master's degree in computer science. He started his career as a 15-year-old at a local media company as a web developer and progressed toward more complex problems. He has tackled problems such as NLP market research, public transport bus and train capacity forecasting, and finally, language model training in his current role. Currently, he is based in his hometown of Ljubljana, Slovenia.
After studying engineering and physics, Pascal Tartarin held various senior management positions (sales, marketing, and R&D) mainly in the chemical and medical device industries in Europe and Asia Pacific. A keen lifelong learner, he later developed his expertise in data science and NLP. Over the last 6 years, he has built applications in the field of business (forecasts, CRM) based on ML, deep learning, NLP, and more recently, knowledge graphs (Neo4j).
Amardeep Kumar has sound experience in Natural Language Processing and machine learning systems. He has industry experience of 2 years working on different technical stacks in NLP and software development. Amardeep regularly contributes to open source software development as a developer, mentor, and reviewer in Google Summer of Code programs. He has published research papers for top-tier NLP-proc conferences, such as ACL and EMNLP.
In this part, you will learn the basics of NLP and get an overview of the Flair framework. You will set up your environment, install Flair, and explore its basic features. You will learn how to extract knowledge from embeddings and use pre-trained sequence labeling models in Flair.
This part comprises the following chapters:
Chapter 1, Introduction to FlairChapter 2, Flair Base TypesChapter 3, Embeddings in FlairChapter 4, Sequence TaggingThere are few Natural Language Processing (NLP) frameworks out there as easy to learn and as easy to work with as Flair. Packed with pre-trained models, excellent documentation, and readable syntax, it provides a gentle learning curve for NLP researchers who are not necessarily skilled in coding; software engineers with poor theoretical foundations; students and graduates; as well as individuals with no prior knowledge simply interested in the topic. But before diving straight into coding, some background about the motivation behind Flair, the basic NLP concepts, and the different approaches to how you can set up your local environment may help you on your journey toward becoming a Flair NLP expert.
In Flair's official GitHub README, the framework is described as:
"A very simple framework for state-of-the-art Natural Language Processing"
This description will raise a few eyebrows. NLP researchers will immediately be interested in knowing what specific tasks the framework achieves its state-of-the-art results in. Engineers will be intrigued by the very simple label, but will wonder what steps are required to get up and running and what environments it can be used in. And those who are not knowledgeable in NLP will wonder whether they will be able to grasp the knowledge required to understand the problems Flair is trying to solve.
In this chapter, we will be answering all of these questions by covering the basic NLP concepts and terminology, providing an overview of Flair, and setting up our development environment with the help of the following sections:
A brief introduction to NLPWhat is Flair?Getting readyTo get started, you will need a development environment with Python 3.6+. Platform-specific instructions for installing Python can be found at https://docs.python-guide.org/starting/installation/.
You will not require a GPU-equipped development machine, though having one will significantly speed up some of the training-related exercises described later in the book.
You will require access to a command line. On Linux and macOS, simply start the Terminal application. On Windows, press Windows + R to open the Run box, type cmd and then click OK.
Flair's official GitHub repository is available via the following link: https://github.com/flairNLP/flair. In this chapter we will install Flair version 0.11.
The code examples covered in this chapter are found in this book's official GitHub repository in the following Jupyter notebook: https://github.com/PacktPublishing/Natural-Language-Processing-with-Flair/tree/main/Chapter01.
Before diving straight into what Flair is capable of and how to leverage its features, we will be going through a brief introduction to NLP to provide some context for readers who are not familiar with all the NLP techniques and tasks solved by Flair. NLP is a branch of artificial intelligence, linguistics, and software engineering that helps machines understand human language. When we humans read a sentence, our brains immediately make sense of many seemingly trivial problems such as the following:
Is the sentence written in a language I understand?How can the sentence be split into words?What is the relationship between the words?What are the meanings of the individual words?Is this a question or an answer?Which part-of-speech categories are the words assigned to?What is the abstract meaning of the sentence?The human brain is excellent at solving these problems conjointly and often seamlessly, leaving us unaware that we made sense of all of these things simply by reading a sentence.
Even now, machines are still not as good as humans at solving all these problems at once. Therefore, to teach machines to understand human language, we have to split understanding of natural language into a set of smaller, machine-intelligible tasks that allow us to get answers to these questions one by one.
In this section, you will find a list of some important NLP tasks with emphasis on the tasks supported by Flair.
Tokenization is the process of breaking down a sentence or a document into meaningful units called tokens. A token can be a paragraph, a sentence, a collocation, or just a word.
For example, a word tokenizer would split the sentence Learning to use Flair into a list of tokens as ["Learning", "to", "use", "Flair"].
Tokenization has to adhere to language-specific rules and is rarely a trivial task to solve. For example, with unspaced languages where word boundaries aren't defined with spaces, it's very difficult to determine where one word ends and the next one starts. Well-defined token boundaries are a prerequisite for most NLP tasks that aim to process words, collocations, or sentences including the following tasks explained in this chapter.
Text vectorization is a process of transforming words, sentences, or documents in their written form into a numerical representation understandable to machines.
One of the simplest forms of text vectorization is one-hot encoding. It maps words to binary vectors of length equal to the number of words in the dictionary. All elements of the vector are 0 apart from the element that represents the word, which is set to 1 – hence the name one-hot.
For example, take the following dictionary:
CatDogGoatThe word cat would be the first word in our dictionary and its one-hot encoding would be [1, 0, 0]. The word dog would be the second word in our dictionary and its one-hot encoding would be [0, 1, 0]. And the word goat would be the third and last word in our dictionary and its one-hot encoding would be [0, 0, 1].
This approach, however, suffers from the problem of high dimensionality as the length of this vector grows linearly with the number of words in the dictionary. It also doesn't capture any semantic meaning of the word. To counter this problem, most modern state-of-the-art approaches use representations called word or document embeddings. Each embedding is usually a fixed-length vector consisting of real numbers. While the numbers will at first seem unintelligible to a human, in some cases, some vector dimensions may represent some abstract property of the word – for example, a dimension of a word-embedding vector could represent the general (positive or negative) sentiment of the word. Given two or more embeddings, we will be able to compute the similarity or distance between them using a distance measure called cosine similarity. With many modern NLP solutions, including Flair, embeddings are used as the underlying input representation for higher-level NLP tasks such as named entity recognition.
One of the main problems with early word embedding approaches was that words with multiple meanings (polysemic words) were limited to a single and constant embedding representation. One of the solutions to this problem in Flair is the use of contextual string embeddings where words are contextualized by their surrounding text, meaning that they will have a different representation given a different surrounding text.
Named entity recognition (NER) is an NLP task or technique that identifies named entities in a text and tags them with their corresponding categories. Named entity categories include, but aren't limited to, places, personnames, brands, time expressions, and monetary values.
The following figure illustrates NER using colored backgrounds and tags associated with the words:
Figure 1.1 – Visualization of NER tagging
In the previous example, we can see that three entities were identified and tagged. The first and third tags are particularly interesting because they both represent the same word, Berkeley, yet the first one clearly refers to an organization whereas the second one refers to a geographic location. The human brain is excellent at distinguishing between different entity types based on context and is able to do so almost seamlessly, whereas machines have struggled with it for decades. Recent advancements in contextual string embeddings, an essential part of Flair, made a huge leap forward in solving that.
Word-Sense Disambiguation (WSD) is an NLP technique concerned with identifying the intended sense of a given word with multiple meanings.
For example, take the given sentence:
George tried to return to Berlin to return his hat.
WSD would aim to identify the sense of the first use of the word return, referring to the act of giving something back, and the sense of the second return, referring to the act of going back to the same place.
Part-of-Speech (POS) tagging is a technique closely related to both WSD and NER that aims to tag the words as corresponding to a particular part of speech such as nouns, verbs, adjectives adverbs, and so on.
Figure 1.2 – Visualization of POS tagging
Actual POS taggers provide a lot more information with the tags than simply associating the words with noun/verb/adjective categories. For example, the Penn Treebank Project corpus, one of the most widely used NER corpora, distinguishes between 36 different types of POS tags.
Another NLP technique closely related to POS tagging is chunking. Unlike parts of speech (POS), where we identify individual POS, in chunking we identify complete short phrases such as noun phrases. In Figure 1.2, the phrase A lovely day can be considered a chunk as it is a noun phrase, and in its relationship to other words works the same way as a noun.
Stemming and lemmatization are two closely related text normalization techniques used in NLP to reduce the words to their common base forms. For example, the word play is the base word of the words playing, played and plays.
The simpler of the two techniques, stemming, simply accomplishes this by cutting off the ends or beginnings of words. This simple solution often works, but is not foolproof. For example, the word ladies can never be transformed into the word lady by stemming only. We therefore need a technique that understands the POS category of a word and takes into account its context. This technique is called lemmatization. The process of lemmatization can be demonstrated using the following example.
Take the following sentence:
this meeting was exhausting
Lemmatization reduces the previous sentence to the following:
this meeting be exhaust
It reduces the word was to be and the word exhausting to exhaust. Also note that the word meeting is used as a noun and it is therefore mapped to the same word meeting, whereas if the word meeting was used as a verb, it would be reduced to meet.
A popular and easy-to-use library for performing lemmatization with Python is spaCy. Its models are trained on large corpora and are able to distinguish between different POS, yielding impressive results.
Text classification is an NLP technique used to assign a text or a document to one or more classes or document types. Practical uses for text classification include spam filtering, language identification, sentiment analysis, and programming language identification from syntax.
Having covered the basic NLP concepts and terminology, we can now move on to understanding what Flair is and how it manages to solve NLP tasks with state-of-the-art results.
Flair is a powerful NLP framework published as a Python package. It provides a simple interface that is friendly, easy to use, and caters to people from various backgrounds including those with little prior knowledge in programming. It is published under the MIT License, which is one of the most permissive free software licenses.
Flair as an NLP framework comes with a variety of tools and uses. It can be defined in the following ways:
It is an NLP framework used in NLP research for producing models that achieve state-of-the-art results across many NLP tasks such as POS tagging, NER, and chunking across several languages and datasets. In Flair's GitHub repository, you will find step-by-step instructions on how to reproduce these results.It is a tool for training, validating, and distributing NER, POS tagging, chunking, word sense disambiguation, and text classification models. It features tools that help ease the training and validation processes such as the automatic corpora downloading tool, and tools that facilitate model tuning such as the hyperparameter optimization tool. It supports a growing number of languages.It is a tool for downloading and using state-of-the-art pre-trained models. The models are downloaded seamlessly, meaning that they will be automatically downloaded the first time you use them and will remain stored for future use.It is a platform for the proposed state-of-the-art Flair embeddings. The state-of-the-art results Flair achieves in many NLP tasks can by and large be attributed to its proposed Flair contextual string embeddings described in more detail in the paper Contextual String Embeddings for Sequence Labeling. The author refers to them as "the secret sauce" of Flair.It is an NLP framework for working with biomedical data. A special section of Flair is dedicated solely to working with biomedical data and features a set of pretrained models that achieve state-of-the-art results, as well as a number of corpora and comprehensive documentation on how to train custom biomedical tagging models.It is a great practical introduction to NLP. Flair's extensive online documentation, simple interface, inclusive support for a large number of languages, and its ability to perform a lot of the tasks on non-GPU-equipped machines all make it an excellent entry point for someone aiming to learn about NLP through practical hands-on experimentation.Now that you have a basic understanding of features offered by the framework, as well as an understanding of the basic NLP concepts, you are now ready to move to the next step of setting up your development environment for Flair.
To be able to follow the instructions in this section, first make sure you have Python 3.6+ installed on your device as described in the Technical requirements section.
In Python, it's generally good practice to install packages in virtual environments so that the project dependencies you are currently working on will not affect your global Python dependencies or other projects you may work on in the future.
We will use the venv tool that is part of the Python Standard Library and requires no installation. To create a virtual environment, simply create a new directory, move into it, then run the following command:
$ python3 -m venv learning-flair
Then, to activate the virtual environment on Linux or macOS, run the following:
$ source learning-flair/bin/activate
If you are running Windows, run the following:
$ learning-flair\Scripts\activate.bat