36,59 €
Foster your NLP applications with the help of deep learning, NLTK, and TensorFlow
Key Features
Book Description
Natural language processing (NLP) has found its application in various domains, such as web search, advertisements, and customer services, and with the help of deep learning, we can enhance its performances in these areas. Hands-On Natural Language Processing with Python teaches you how to leverage deep learning models for performing various NLP tasks, along with best practices in dealing with today's NLP challenges.
To begin with, you will understand the core concepts of NLP and deep learning, such as Convolutional Neural Networks (CNNs), recurrent neural networks (RNNs), semantic embedding, Word2vec, and more. You will learn how to perform each and every task of NLP using neural networks, in which you will train and deploy neural networks in your NLP applications. You will get accustomed to using RNNs and CNNs in various application areas, such as text classification and sequence labeling, which are essential in the application of sentiment analysis, customer service chatbots, and anomaly detection. You will be equipped with practical knowledge in order to implement deep learning in your linguistic applications using Python's popular deep learning library, TensorFlow.
By the end of this book, you will be well versed in building deep learning-backed NLP applications, along with overcoming NLP challenges with best practices developed by domain experts.
What you will learn
Who this book is for
Hands-on Natural Language Processing with Python is for you if you are a developer, machine learning or an NLP engineer who wants to build a deep learning application that leverages NLP techniques. This comprehensive guide is also useful for deep learning users who want to extend their deep learning skills in building NLP applications. All you need is the basics of machine learning and Python to enjoy the book.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 306
Veröffentlichungsjahr: 2018
Copyright © 2018 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Commissioning Editor: Pravin DhandreAcquisition Editor: Aman SinghContent Development Editor: Snehal KolteTechnical Editor: Sayli NikaljeCopy Editor: Safis EditingProject Coordinator: Manthan PatelProofreader: Safis EditingIndexer: Pratik ShirodkarGraphics: Jisha ChirayilProduction Coordinator: Nilesh Mohite
First published: July 2018
Production reference: 1160718
Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.
ISBN 978-1-78913-949-5
www.packtpub.com
Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.
Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals
Improve your learning with Skill Plans built especially for you
Get a free eBook or video every month
Mapt is fully searchable
Copy and paste, print, and bookmark content
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.
Intelligent digital assistants in form of voice transcription, machine translation, conversational agents, and sentiment analysis are ubiquitously applied across various domains to facilitate human-computer interactions. Chatbot's are becoming an integrated part of any website; while virtual assistants are gaining more popularity in homes and office spaces. Consequently, given numerous existing resources that contain these topics under the notion of natural language processing (NLP), a contribution that covers a comprehensive guide to the fundamentals and the state-of-the-art, and on top of that includes practical examples with most popular frameworks and tool kits, is a rare find.
When I was first asked to write the foreword for this book, I was very delighted to convey the level of passion that drove the authors to write it, and yet was uncertain on how to best present an excellent source of up-to-date knowledge and a practical handbook of machine learning (ML) for NLP that truly stands out from the crowd.
The leading authors' reputation in ML needs no further explanation. With both equipped with academic education in world-class universities and many years of leadership in ML development, Rajesh and Rajalingappa's qualification to lead the authorship of this book is confirmed. I have come to know them not only as knowledgeable individuals but also as passionate educators who convey the most sophisticated concepts in the simplest words. Raja's passion to help start-ups get off the ground and to offer his expertise to young companies with an open heart is admirable. I'm sure that, even as readers of this book, you can approach him for questions and be sure of a convincing response.
The book itself is very well organized and written to serve the purpose. From concrete examples to explain the fundamentals to code snippets for guiding readers with different levels of deep learning backgrounds, the chapters are structured to retain full attention of the reader all throughout. You will discover an exciting combination of the most popular techniques and state-of-the-art approaches to text processing and classification.
By reading this book, you can expect to learn how to perform common NLP tasks, such as preprocessing and exploratory analysis of text using the Python's Natural Language Toolkit. You will understand deep neural networks, Google's TensorFlow framework, and the building blocks of recurrent neural networks (RNNs), including Long Short-Term Memory. And you will grasp the notion of word embeddings to allow for semantics in context.
Having taught the basics, the book further takes you through the development of architectures and deep neural network models for a variety of applications, including text classification, text generation and summarization, question-answering, language translation, speech recognition, and text-to-speech.
The book concludes by presenting various methods to deploy a trained model for NLP tasks, on a variety of platforms. By the end of your experience with the book, you will have learned the data science paradigm in NLP and can hopefully deploy deep learning models in commercial applications in a production environment as the authors envisioned.
Maryam Azh, PhDFounder of Overlay Technologies
Rajesh Arumugam is an ML developer at SAP, Singapore. Previously, he developed ML solutions for smart city development in areas such as passenger flow analysis in public transit systems and optimization of energy consumption in buildings when working with Centre for Social Innovation at Hitachi Asia, Singapore. He has published papers in conferences and has pending patents in storage and ML. He holds a PhD in computer engineering from Nanyang Technological University, Singapore.
Rajalingappaa Shanmugamani is a deep learning lead at SAP, Singapore. Previously, he worked and consulted at various start-ups for developing computer vision products. He has a masters from IIT Madras, where his thesis was based on applications of computer vision in manufacturing. He has published articles in peer-reviewed journals and conferences and applied for a few patents in ML. In his spare time, he teaches programming and machine learning to school students and engineers.
Karthik Muthusamy works for SAP, Singapore, as an ML researcher. He has designed and developed ML solutions for problems ranging from algorithms that guide autonomous vehicles to understanding semantic meanings of sentences in documents. He is currently a Google Developer Expert in ML. He gives talks and conducts workshops on ML for the developer community with an aim of reducing the entry barriers to developing ML applications. He graduated from Nanyang Technological University, Singapore, with a PhD in computer engineering.
Chaitanya Joshi is working towards obtaining a bachelor's in Computer Science from Nanyang Technological University in 2019. He has experience in building deep learning solutions for automatic accounting at SAP, Singapore, and conversational chatbots at Evie.ai. He is also a research assistant with the dialog systems group at Laboratory of Artificial Intelligence, Swiss Federal Institute of Technology, Lausanne (EPFL). His research at EPFL was recently published the Conference on Neural Information Processing Systems (NIPS 2017) in Los Angeles.
Auguste Byiringiro is an ML developer at SAP, Singapore, in the cash application team. Previously, he mainly worked in healthcare. At GE Healthcare, he built deep learning models to detect diseases in CT images. Then at Fibronostics, a start-up focused on non-invasive medical diagnosis, he heavily contributed to two products: LiverFASt, an ML-based tool to diagnose fatty liver disease, and HealthFACTR, a data-driven algorithm used by Gravity-Fitness First in Singapore to optimize the fitness, diet, and long-term health of its members.
Chintan Gajjar is an associate senior consultant in KNOWARTH Technologies. He has played dynamic roles during his career in developing ERP, search engines with Python, Single-Page Applications (SPA), and mobile apps with Node.js, MongoDB, and AngularJS.
He received multiple awards in recognition of his valuable contributions to the team and the company. He has also contributed to the books Hadoop Backup and Recovery Solutions, MySQL 8 for Big Data, and MySQL 8 Administrator's Guide. He has a master's in computer applications from Ganpat University.
If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.
Title Page
Copyright and Credits
Hands-On Natural Language Processing with Python
Packt Upsell
Why subscribe?
PacktPub.com
Foreword
Contributors
About the authors
About the reviewer
Packt is searching for authors like you
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Download the color images
Conventions used
Get in touch
Reviews
Getting Started
Basic concepts and terminologies in NLP
Text corpus or corpora
Paragraph
Sentences
Phrases and words
N-grams
Bag-of-words
Applications of NLP
Analyzing sentiment
Recognizing named entities
Linking entities
Translating text
Natural Language Inference
Semantic Role Labeling
Relation extraction
SQL query generation, or semantic parsing
Machine Comprehension
Textual Entailment
Coreference resolution
Searching
Question answering and chatbots
Converting text-to-voice
Converting voice-to-text
Speaker identification
Spoken dialog systems
Other applications
Summary
Text Classification and POS Tagging Using NLTK
Installing NLTK and its modules
Text preprocessing and exploratory analysis
Tokenization
Stemming
Removing stop words
Exploratory analysis of text
POS tagging
What is POS tagging?
Applications of POS tagging
Training a POS tagger
Training a sentiment classifier for movie reviews
Training a bag-of-words classifier
Summary
Deep Learning and TensorFlow
Deep learning
Perceptron
Activation functions
Sigmoid
Hyperbolic tangent
Rectified linear unit 
Neural network
One-hot encoding
Softmax
Cross-entropy
Training neural networks
Backpropagation
Gradient descent
Stochastic gradient descent
Regularization techniques
Dropout
Batch normalization
L1 and L2 normalization
Convolutional Neural Network
Kernel
Max pooling
Recurrent neural network
Long-Short Term Memory
TensorFlow
General Purpose – Graphics Processing Unit
CUDA
cuDNN
Installation
Hello world!
Adding two numbers
TensorBoard
The Keras library
Summary
Semantic Embedding Using Shallow Models
Word vectors
The classical approach
Word2vec
The CBOW model
The skip-gram model
A comparison of skip-gram and CBOW model architectures
Building a skip-gram model
Visualization of word embeddings
From word to document embeddings
Sentence2vec
Doc2vec
Visualization of document embeddings
Summary
Text Classification Using LSTM
Data for text classification
Topic modeling 
Topic modeling versus text classification
Deep learning meta architecture for text classification
Embedding layer
Deep representation
Fully connected part
Identifying spam in YouTube video comments using RNNs
Classifying news articles by topic using a CNN
Transfer learning using GloVe embeddings
Multi-label classification
Binary relevance
Deep learning for multi-label classification
Attention networks for document classification
Summary
Searching and DeDuplicating Using CNNs
Data
Data description
Training the model
Encoding the text
Modeling with CNN
Training
Inference
Summary
Named Entity Recognition Using Character LSTM
NER with deep learning
Data
Model
Word embeddings
Walking through the code
Input
Word embedding
The effects of different pretrained word embeddings
Neural network architecture
Decoding predictions
The training step
Scope for improvement
Summary
Text Generation and Summarization Using GRUs
Generating text using RNNs
Generating Linux kernel code with a GRU
Text summarization
Extractive summarization
Summarization using gensim
Abstractive summarization
Encoder-decoder architecture
Encoder
Decoder
News summarization using GRU
Data preparation
Encoder network
Decoder network
Sequence to sequence
Building the graph
Training
Inference
TensorBoard visualization
State-of-the-art abstractive text summarization
Summary
Question-Answering and Chatbots Using Memory Networks
The Question-Answering task
Question-Answering datasets
Memory networks for Question-Answering
Memory network pipeline overview
Writing a memory network in TensorFlow
Class constructor
Input module
Question module
Memory module  
Output module
Putting it together
Extending memory networks for dialog modeling
Dialog datasets
The bAbI dialog dataset
Raw data format
Writing a chatbot in TensorFlow
Loading dialog datasets in the QA format
Vectorizing the data
Wrapping the memory network model in a chatbot class
Class constructor
Building a vocabulary for word embedding lookup
Training the chatbot model
Evaluating the chatbot on the testing set
Interacting with the chatbot
Putting it all together
Example of an interactive conversation
Literature on and related to memory networks
Summary
Machine Translation Using the Attention-Based Model
Overview of machine translation
Statistical machine translation
English to French using NLTK SMT models
Neural machine translation
Encoder-decoder network
Encoder-decoder with attention
NMT for French to English using attention
Data preparation
Encoder network
Decoder network
Sequence-to-sequence model
Building the graph
Training
Inference
TensorBoard visualization
Summary
Speech Recognition Using DeepSpeech
Overview of speech recognition
Building an RNN model for speech recognition
Audio signal representation
LSTM model for spoken digit recognition
TensorBoard visualization
Speech to text using the DeepSpeech architecture
Overview of the DeepSpeech model
Speech recordings dataset
Preprocessing the audio data
Creating the model
TensorBoard visualization
State-of-the-art in speech recognition
Summary
Text-to-Speech Using Tacotron
Overview of text to speech
Naturalness versus intelligibility 
How is the performance of a TTS system evaluated?
Traditional techniques – concatenative and parametric models
A few reminders on spectrograms and the mel scale
TTS in deep learning
WaveNet, in brief
Tacotron
The encoder
The attention-based decoder
The Griffin-Lim-based postprocessing module
Details of the architecture
Limitations
Implementation of Tacotron with Keras
The dataset
Data preparation
Preparation of text data
Preparation of audio data
Implementation of the architecture
Pre-net 
Encoder and postprocessing CBHG
Attention RNN
Decoder RNN
The attention mechanism
Full architecture, with attention
Training and testing
Summary
Deploying Trained Models
Increasing performance
Quantizing the weights
MobileNets
TensorFlow Serving
Exporting the trained model
Serving the exported model
Deploying in the cloud
Amazon Web Services
Google Cloud Platform
Deploying on mobile devices
iPhone
Android
Summary
Other Books You May Enjoy
Leave a review - let other readers know what you think
Before the advent of deep learning, traditional natural language processing (NLP) approaches had been widely used in tasks such as spam filtering, sentiment classification, and part of speech (POS) tagging. These classic approaches utilized statistical characteristics of sequences such as word count and co-occurrence, as well as simple linguistic features. However, the main disadvantage of these techniques was that they could not capture complex linguistic characteristics, such as context and intra-word dependencies.
Recent developments in neural networks and deep learning have given us powerful new tools to match human-level performance on NLP tasks and build products that deal with natural language. Deep learning for NLP is centered around the concept of word embeddings or vectors, also known as Word2vec, which encapsulate the meanings of words and phrases as dense vector representations. Word vectors, which are able to capture semantic information about words better than traditional one-hot representations, allow us to handle the temporal nature of language in an intuitive way when used in combination with a class of neural networks known as recurrent neural networks (RNNs). While RNNs can capture only local word dependencies, recently proposed vector-based operations for attention and alignment over word vector sequences allow neural networks to model global intra-word dependencies, including context. Due to their capability to model the syntax and semantics of language, strong empirical performance, and ability to generalize to new data, neural networks have become the go-to model for building highly sophisticated commercial products, such as search engines, translation services, and dialog systems.
This book introduces the basic building blocks of deep learning models for NLP and explores cutting-edge techniques from recent literature. We take a problem-based approach, where we introduce new models as solutions to various NLP tasks. Our focus is on providing practical code implementations in Python that can be applied to your use cases to bring human capabilities into your applications.
This book is intended for developers who want to leverage NLP techniques to develop intelligent applications with rich human-centric interfaces. The book assumes introductory knowledge of machine learning (ML) or deep learning and intermediate Python programming skills. Our aim is to introduce cutting-edge techniques for NLP tasks, such as sentiment detection, conversational systems, language translation, speech-to-text, and much more, using the TensorFlow framework and Python.
The reader will go from the basic concepts of deep learning to state-of-the-art algorithms and best practices for dealing with natural language. Our focus is on implementing applications using real-world data and deploying deep learning models to add human capabilities to commercial applications in a production environment.
Chapter 1, Getting Started, explores the basic concepts of NLP and the various problems it tries to solve. We also look at some of the real-world applications to give the reader the feeling of the wide range of applications that leverage NLP.
Chapter 2, Text Classification and POS Tagging Using NLTK, introduces the popular NLTK Python library. We will be using NLTK to describe basic NLP tasks, such as tokenizing, stemming, tagging, and classic text classification. We also explore POS tagging with NLTK. We provide the reader with the tools and techniques necessary to prepare data for input into deep learning models.
Chapter 3, Deep Learning and TensorFlow, introduces the basic concepts of deep learning. This chapter will also help the reader to set up the environment and tools such as TensorFlow. At the end of the chapter, the reader will get an understanding of basic deep learning concepts, such as CNN, RNN, LSTM, attention-based models, and problems in NLP.
Chapter 4, Semantic Embedding Using Shallow Models, explores how to identify semantic relationships between words in a document, and in the process, we obtain a vector representation for words in a corpus. The chapter describes developing word embedding models, such as CBOW using neural networks. It also describes techniques for developing neural network models to obtain document vectors. At the end of this chapter, the reader will get familiar with training embeddings for word, sentence, and document; and visualize simple networks.
Chapter 5, Text Classification Using LSTM, discusses various approaches for classifying text, a specific application of which is to classify sentiments of words or phrases in a document. The chapter introduces the problem of text classification. Following this, we describe techniques for developing deep learning models using CNNs and LSTMs. The chapter also explains transfer learning for text classification using pretrained word embeddings. At the end, the reader will get familiar with implementing deep learning models for sentiment classification, spam detection, and using pretrained word embeddings for his/her classification task.
Chapter 6, Searching and Deduplicating Using CNNs, covers the problems of searching, matching and deduplicating documents and approaches used in solving them. The chapter describes developing deep learning models for searching text in a corpus. At the end of this chapter, you will learn to implement a CNN-based deep learning model for searching and deduplicating text.
Chapter 7, Named Entity Recognition Using Character LSTM, describes methods and approaches to perform Named Entity Recognition (NER), a sub-task of information extraction, to locate and classify entities in text of a document. The chapter introduces the problem of NER and the applications where it can be used. We then explain the implementation of a deep learning model using character-based LSTM for identifying named entities trained using labeled datasets.
Chapter 8, Text Generation and Summarization Using GRUs, covers the methods used for the task of generating text, an extension of which can be used to create summaries from text data. We then explain the implementation of a deep learning model for generating text. This is followed by a description of implementing GRU-based deep learning models to summarize text. At the end of this chapter, the reader will learn the techniques of implementing deep learning models for text generation and summarization.
Chapter 9, Question-Answering and Chatbots Using Memory Networks, describes how to train a deep learning model to answer questions and extend it to build a chatbot. The chapter introduces the problem of question answering and the approaches used in building an answering engine using deep learning models. We then describe how to leverage a question-answering engine to build a chatbot capable of answering questions like a conversation. At the end of this chapter, you will be able to implement an interactive chatbot.
Chapter 10, Machine Translation Using Attention-Based Models, covers various methods for translating text from one language to another, without the need to learn the grammar structure of either language. The chapter introduces traditional machine translation approaches, such as Hidden Markov Model (HMM) based methods. We then explain the implementation of an encoder-decoder model with attention for translating text from French to the English language. At the end of this chapter, the reader will be able to implement deep learning models for translating text.
Chapter 11, Speech Recognition Using Deep Speech, describes the problem of converting voice to text, as a beginning of a conversational interface. The chapter begins with feature extraction from speech data. This is followed by a brief introduction of the deep speech architecture. We then explain the detailed implementation of the Deep Speech architecture to transcribe speech to text. At the end of this chapter, the reader will be equipped with the knowledge to implement a speech-to-text deep learning model.
Chapter 12, Text to Speech Using Tacotron, describes the problem of converting text to speech. The chapter describes the implementation of the Tacotron model to convert text to voice. At the end, the reader will get familiar with the implementation of a text-to-speech model based on the Tacotron architecture.
Chapter 13, Deploying Trained Models, is the concluding chapter and describes model deployments in various cloud and mobile platforms.
The prerequisites for the book are basic knowledge of ML or deep learning and intermediate Python skills, although both are not mandatory. We have given a brief introduction to deep learning, touching upon topics such as multi-layer perceptrons, Convolutional Neural Networks (CNNs), and RNNs in Chapter 1, Getting Started. It would be helpful if the reader knows general ML concepts, such as overfitting and model regularization, and classical models, such as linear regression and random forest. In more advanced chapters, the reader might encounter in-depth code walkthroughs that expect at least a basic level of Python programming experience.
All the code examples in the book can be downloaded from the code book repository as described in the next section. The examples mainly utilize open source tools and open data repositories, and were written for Python 3.5 or higher. The major libraries that are extensively used throughout the book are TensorFlow and NLTK. Detailed installation instructions for these packages can be found in Chapter 1, Getting Started, and Chapter 2, Text Classification and POS Tagging Using NLTK, respectively. Though a GPU is not required for the examples to run, it is advisable to have a system that has one. We recommend training models from the second half of the book on a GPU, as more complicated tasks involve bigger models and larger datasets.
You can download the example code files for this book from your account at www.packtpub.com. If you purchased this book elsewhere, you can visit www.packtpub.com/support and register to have the files emailed directly to you.
You can download the code files by following these steps:
Log in or register at
www.packtpub.com
.
Select the
SUPPORT
tab.
Click on
Code Downloads & Errata
.
Enter the name of the book in the
Search
box and follow the onscreen instructions.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR/7-Zip for Windows
Zipeg/iZip/UnRarX for Mac
7-Zip/PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Hands-On-Natural-Language-Processing-with-Python. In case there's an update to the code, it will be updated on the existing GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://www.packtpub.com/sites/default/files/downloads/HandsOnNaturalLanguageProcessingwithPython_ColorImages.pdf.
Feedback from our readers is always welcome.
General feedback: Email [email protected] and mention the book title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!
For more information about Packt, please visit packtpub.com.
Natural language processing (NLP) is the field of understanding human language using computers. It involves the analysis and of large volumes of natural language data using computers to glean meaning and value for consumption in real-world applications. While NLP has been around since the 1950s, there has been tremendous growth in practical applications in the area, due to recent advances in machine learning (ML) and deep learning. The majority of this book will focus on various real-world applications of NLP, such as text classification, and sub-tasks of NLP, such as Named Entity Recognition (NER), with a particular emphasis on deep learning approaches. In this chapter, we will first introduce the basic concepts and terms in NLP. Following this, we will discuss some of the currently used applications that leverage NLP.
The following are some of the important terminologies and concepts in NLP mostly related to the language data. Getting familiar with these terms and concepts will help the reader in getting up to speed in understanding the contents in later chapters of the book:
Text corpus or corpora
Paragraph
Sentences
Phrases and words
N-grams
Bag-of-words
We will explain these in the following sections.
The language data that all NLP tasks depend upon is called the text corpus or simply corpus. A corpus is a large set of text data that can be in one of the languages like English, French, and so on. The corpus can consist of a single document or a bunch of documents. The source of the text corpus can be social network sites like Twitter, blog sites, open discussion forums like Stack Overflow, books, and several others. In some of the tasks like machine translation, we would require a multilingual corpus. For example we might need both the English and French translations of the same document content for developing a machine translation model. For speech tasks, we would also need human voice recordings and the corresponding transcribed corpus.
In most of the later chapters, we will be using text corpus and speech recordings available from the internet or open source data repositories. For many of the NLP task, the corpus is split into chunks for further analysis. These chunks could be at the paragraph, sentence, or word level. We will touch upon these in the following sections.
A paragraph is the largest unit of text handled by an NLP task. Paragraph level boundaries by itself may not be much use unless broken down into sentences. Though sometimes the paragraph may be considered as context boundaries. Tokenizers that can split a document into paragraphs are available in some of the Python libraries. We will look at such tokenizers in later chapters.
Sentences are the next level of lexical unit of language data. A sentence encapsulates a complete meaning or thought and context. It is usually extracted from a paragraph based on boundaries determined by punctuations like period. The sentence may also convey opinion or sentiment expressed in it. In general, sentences consists of parts of speech (POS) entities like nouns, verbs, adjectives, and so on. There are tokenizers available to split paragraphs to sentences based on punctuations.
Phrases are a group of consecutive words within a sentence that can convey a specific meaning. For example, in the sentence Tomorrow is going to be a rainy day the part going to be a rainy day expresses a specific thought. Some of the NLP tasks extract key phrases from sentences for search and retrieval applications. The next smallest unit of text is the word. The common tokenizers split sentences into text based on punctuations like spaces and comma. One of the problems with NLP is ambiguity in the meaning of same words used in different context. We will later see how this is handled well when we discuss word embeddings.
A sequence of characters or words forms an N-gram. For example, character unigram consists of a single character, a bigram consists of a sequence of two characters and so on. Similarly word N-grams consists of a sequence of n words. In NLP, N-grams are used as features for tasks like text classification.
Bag-of-words in contrast to N-grams does not consider word order or sequence. It captures the word occurrence frequencies in the text corpus. Bag-of-words is also used as features in tasks like sentiment analysis and topic identification.
In the following sections, we will look at an overview of the following applications of NLP:
Analyzing sentiment
Recognizing named entities
Linking entities
Translating text
Natural language interfaces
Semantic Role Labeling
Relation extraction
SQL query generation, or semantic parsing
Machine Comprehension
Textual entailment
Coreference resolution
Searching
Question answering and chatbots
Converting text to voice
Converting voice to text
Speaker identification
Spoken dialog systems
Other applications
In this section, we will provide an overview of the major applications of NLP. While the topics listed here are not quite exhaustive, they will give the reader a sense of the wide range of applications where NLP is used.
The sentiment in a sentence or text reflects the overall positive, negative, or neutral opinion or thought of the person who produces or consumes it. It indicates whether a person is happy, unhappy, or neutral about the subject or context that describes the text. It can be quantified as a discrete value, such as 1 for happy, -1 for unhappy, and 0 for neutral, or it can be quantified on a continuous scale of values, from 0-1. Sentiment analysis, therefore, is the process of deriving this value from a piece of text that can be obtained from different data sources, such as social networks, product reviews, news articles, and so on. One real-world application of sentiment analysis is in social network data to derive actionable insights, such as customer satisfaction, product or brand popularity, fashion trends, and so on. The screenshot that follows shows one of the applications of sentiment analysis, in capturing the overall opinion of a particular news article about Google. The reader may refer to the application, or API, from Google Cloud at https://cloud.google.com/natural-language/:
The preceding screenshot indicates that sentiment data is captured for the whole document, as well as at the individual sentence level.
