E-Book
36,59 €

Hands-On Natural Language Processing with Python E-Book

Rajesh Arumugam

0,0

36,59 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

Foster your NLP applications with the help of deep learning, NLTK, and TensorFlow

Key Features

Weave neural networks into linguistic applications across various platforms

Perform NLP tasks and train its models using NLTK and TensorFlow

Boost your NLP models with strong deep learning architectures such as CNNs and RNNs

Book Description

Natural language processing (NLP) has found its application in various domains, such as web search, advertisements, and customer services, and with the help of deep learning, we can enhance its performances in these areas. Hands-On Natural Language Processing with Python teaches you how to leverage deep learning models for performing various NLP tasks, along with best practices in dealing with today's NLP challenges.

To begin with, you will understand the core concepts of NLP and deep learning, such as Convolutional Neural Networks (CNNs), recurrent neural networks (RNNs), semantic embedding, Word2vec, and more. You will learn how to perform each and every task of NLP using neural networks, in which you will train and deploy neural networks in your NLP applications. You will get accustomed to using RNNs and CNNs in various application areas, such as text classification and sequence labeling, which are essential in the application of sentiment analysis, customer service chatbots, and anomaly detection. You will be equipped with practical knowledge in order to implement deep learning in your linguistic applications using Python's popular deep learning library, TensorFlow.

By the end of this book, you will be well versed in building deep learning-backed NLP applications, along with overcoming NLP challenges with best practices developed by domain experts.

What you will learn

Implement semantic embedding of words to classify and find entities

Convert words to vectors by training in order to perform arithmetic operations

Train a deep learning model to detect classification of tweets and news

Implement a question-answer model with search and RNN models

Train models for various text classification datasets using CNN

Implement WaveNet a deep generative model for producing a natural-sounding voice

Convert voice-to-text and text-to-voice

Train a model to convert speech-to-text using DeepSpeech

Who this book is for

Hands-on Natural Language Processing with Python is for you if you are a developer, machine learning or an NLP engineer who wants to build a deep learning application that leverages NLP techniques. This comprehensive guide is also useful for deep learning users who want to extend their deep learning skills in building NLP applications. All you need is the basics of machine learning and Python to enjoy the book.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 306

Veröffentlichungsjahr: 2018

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Hands-On Natural Language Processing with Python

A practical guide to applying deep learning architectures to your NLP applications

Rajesh Arumugam Rajalingappaa Shanmugamani

BIRMINGHAM - MUMBAI

Hands-On Natural Language Processing with Python

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor: Pravin DhandreAcquisition Editor: Aman SinghContent Development Editor: Snehal KolteTechnical Editor: Sayli NikaljeCopy Editor: Safis EditingProject Coordinator: Manthan PatelProofreader: Safis EditingIndexer: Pratik ShirodkarGraphics: Jisha ChirayilProduction Coordinator: Nilesh Mohite

First published: July 2018

Production reference: 1160718

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN 978-1-78913-949-5

www.packtpub.com

mapt.io

Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Mapt is fully searchable

Copy and paste, print, and bookmark content

PacktPub.com

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

Foreword

Intelligent digital assistants in form of voice transcription, machine translation, conversational agents, and sentiment analysis are ubiquitously applied across various domains to facilitate human-computer interactions. Chatbot's are becoming an integrated part of any website; while virtual assistants are gaining more popularity in homes and office spaces. Consequently, given numerous existing resources that contain these topics under the notion of natural language processing (NLP), a contribution that covers a comprehensive guide to the fundamentals and the state-of-the-art, and on top of that includes practical examples with most popular frameworks and tool kits, is a rare find.

When I was first asked to write the foreword for this book, I was very delighted to convey the level of passion that drove the authors to write it, and yet was uncertain on how to best present an excellent source of up-to-date knowledge and a practical handbook of machine learning (ML) for NLP that truly stands out from the crowd.

The leading authors' reputation in ML needs no further explanation. With both equipped with academic education in world-class universities and many years of leadership in ML development, Rajesh and Rajalingappa's qualification to lead the authorship of this book is confirmed. I have come to know them not only as knowledgeable individuals but also as passionate educators who convey the most sophisticated concepts in the simplest words. Raja's passion to help start-ups get off the ground and to offer his expertise to young companies with an open heart is admirable. I'm sure that, even as readers of this book, you can approach him for questions and be sure of a convincing response.

The book itself is very well organized and written to serve the purpose. From concrete examples to explain the fundamentals to code snippets for guiding readers with different levels of deep learning backgrounds, the chapters are structured to retain full attention of the reader all throughout. You will discover an exciting combination of the most popular techniques and state-of-the-art approaches to text processing and classification.

By reading this book, you can expect to learn how to perform common NLP tasks, such as preprocessing and exploratory analysis of text using the Python's Natural Language Toolkit. You will understand deep neural networks, Google's TensorFlow framework, and the building blocks of recurrent neural networks (RNNs), including Long Short-Term Memory. And you will grasp the notion of word embeddings to allow for semantics in context.

Having taught the basics, the book further takes you through the development of architectures and deep neural network models for a variety of applications, including text classification, text generation and summarization, question-answering, language translation, speech recognition, and text-to-speech.

The book concludes by presenting various methods to deploy a trained model for NLP tasks, on a variety of platforms. By the end of your experience with the book, you will have learned the data science paradigm in NLP and can hopefully deploy deep learning models in commercial applications in a production environment as the authors envisioned.

Maryam Azh, PhDFounder of Overlay Technologies

Contributors

About the authors

Rajesh Arumugam is an ML developer at SAP, Singapore. Previously, he developed ML solutions for smart city development in areas such as passenger flow analysis in public transit systems and optimization of energy consumption in buildings when working with Centre for Social Innovation at Hitachi Asia, Singapore. He has published papers in conferences and has pending patents in storage and ML. He holds a PhD in computer engineering from Nanyang Technological University, Singapore.

I would like to thank my wife Priya and my wonderful children: Akash and Akila for their support and understanding while writing this book. Special thanks to Raja for providing the opportunity to co-write this book.

Rajalingappaa Shanmugamani is a deep learning lead at SAP, Singapore. Previously, he worked and consulted at various start-ups for developing computer vision products. He has a masters from IIT Madras, where his thesis was based on applications of computer vision in manufacturing. He has published articles in peer-reviewed journals and conferences and applied for a few patents in ML. In his spare time, he teaches programming and machine learning to school students and engineers.

I thank my spouse, Ezhil, my family, and friends for their immense support. I thank all the teachers, colleagues, managers, and mentors from whom I have learned a lot.

Karthik Muthusamy works for SAP, Singapore, as an ML researcher. He has designed and developed ML solutions for problems ranging from algorithms that guide autonomous vehicles to understanding semantic meanings of sentences in documents. He is currently a Google Developer Expert in ML. He gives talks and conducts workshops on ML for the developer community with an aim of reducing the entry barriers to developing ML applications. He graduated from Nanyang Technological University, Singapore, with a PhD in computer engineering.

Chaitanya Joshi is working towards obtaining a bachelor's in Computer Science from Nanyang Technological University in 2019. He has experience in building deep learning solutions for automatic accounting at SAP, Singapore, and conversational chatbots at Evie.ai. He is also a research assistant with the dialog systems group at Laboratory of Artificial Intelligence, Swiss Federal Institute of Technology, Lausanne (EPFL). His research at EPFL was recently published the Conference on Neural Information Processing Systems (NIPS 2017) in Los Angeles.

I would like to thank all my mentors and colleagues over the years for guiding me in my endeavors and patiently answering my questions.

Auguste Byiringiro is an ML developer at SAP, Singapore, in the cash application team. Previously, he mainly worked in healthcare. At GE Healthcare, he built deep learning models to detect diseases in CT images. Then at Fibronostics, a start-up focused on non-invasive medical diagnosis, he heavily contributed to two products: LiverFASt, an ML-based tool to diagnose fatty liver disease, and HealthFACTR, a data-driven algorithm used by Gravity-Fitness First in Singapore to optimize the fitness, diet, and long-term health of its members.

I would like to thank Raja Shanmugamani, who gave me the opportunity to contribute to this book, as well as all the people who supported me while I was co-writing the book.

About the reviewer

Chintan Gajjar is an associate senior consultant in KNOWARTH Technologies. He has played dynamic roles during his career in developing ERP, search engines with Python, Single-Page Applications (SPA), and mobile apps with Node.js, MongoDB, and AngularJS.

He received multiple awards in recognition of his valuable contributions to the team and the company. He has also contributed to the books Hadoop Backup and Recovery Solutions, MySQL 8 for Big Data, and MySQL 8 Administrator's Guide. He has a master's in computer applications from Ganpat University.

It's a great experience reviewing this book with the Packt team. I would like to thank the wonderful team at Packt for this effort. I would also like to thank Chintan Mehta, my office colleagues, and my family, who invested their time directly or indirectly to support me throughout the reviewing of this book. I also want to thank all the authors of this book.

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Title Page

Hands-On Natural Language Processing with Python

Packt Upsell

Why subscribe?

PacktPub.com

Foreword

Contributors

About the authors

About the reviewer

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

Getting Started

Basic concepts and terminologies in NLP

Text corpus or corpora

Paragraph

Sentences

Phrases and words

N-grams

Bag-of-words

Applications of NLP

Analyzing sentiment

Recognizing named entities

Linking entities

Translating text

Natural Language Inference

Semantic Role Labeling

Relation extraction

SQL query generation, or semantic parsing

Machine Comprehension

Textual Entailment

Coreference resolution

Searching

Question answering and chatbots

Converting text-to-voice

Converting voice-to-text

Speaker identification

Spoken dialog systems

Other applications

Summary

Text Classification and POS Tagging Using NLTK

Installing NLTK and its modules

Text preprocessing and exploratory analysis

Tokenization

Stemming

Removing stop words

Exploratory analysis of text

POS tagging

What is POS tagging?

Applications of POS tagging

Training a POS tagger

Training a sentiment classifier for movie reviews

Training a bag-of-words classifier

Summary

Deep Learning and TensorFlow

Deep learning

Perceptron

Activation functions

Sigmoid

Hyperbolic tangent

Rectified linear unit 

Neural network

One-hot encoding

Softmax

Cross-entropy

Training neural networks

Backpropagation

Gradient descent

Stochastic gradient descent

Regularization techniques

Dropout

Batch normalization

L1 and L2 normalization

Convolutional Neural Network

Kernel

Max pooling

Recurrent neural network

Long-Short Term Memory

TensorFlow

General Purpose – Graphics Processing Unit

CUDA

cuDNN

Installation

Hello world!

Adding two numbers

TensorBoard

The Keras library

Summary

Semantic Embedding Using Shallow Models

Word vectors

The classical approach

Word2vec

The CBOW model

The skip-gram model

A comparison of skip-gram and CBOW model architectures

Building a skip-gram model

Visualization of word embeddings

From word to document embeddings

Sentence2vec

Doc2vec

Visualization of document embeddings

Summary

Text Classification Using LSTM

Data for text classification

Topic modeling 

Topic modeling versus text classification

Deep learning meta architecture for text classification

Embedding layer

Deep representation

Fully connected part

Identifying spam in YouTube video comments using RNNs

Classifying news articles by topic using a CNN

Transfer learning using GloVe embeddings

Multi-label classification

Binary relevance

Deep learning for multi-label classification

Attention networks for document classification

Summary

Searching and DeDuplicating Using CNNs

Data

Data description

Training the model

Encoding the text

Modeling with CNN

Training

Inference

Summary

Named Entity Recognition Using Character LSTM

NER with deep learning

Data

Model

Word embeddings

Walking through the code

Input

Word embedding

The effects of different pretrained word embeddings

Neural network architecture

Decoding predictions

The training step

Scope for improvement

Summary

Text Generation and Summarization Using GRUs

Generating text using RNNs

Generating Linux kernel code with a GRU

Text summarization

Extractive summarization

Summarization using gensim

Abstractive summarization

Encoder-decoder architecture

Encoder

Decoder

News summarization using GRU

Data preparation

Encoder network

Decoder network

Sequence to sequence

Building the graph

Training

Inference

TensorBoard visualization

State-of-the-art abstractive text summarization

Summary

Question-Answering and Chatbots Using Memory Networks

The Question-Answering task

Question-Answering datasets

Memory networks for Question-Answering

Memory network pipeline overview

Writing a memory network in TensorFlow

Class constructor

Input module

Question module

Memory module  

Output module

Putting it together

Extending memory networks for dialog modeling

Dialog datasets

The bAbI dialog dataset

Raw data format

Writing a chatbot in TensorFlow

Loading dialog datasets in the QA format

Vectorizing the data

Wrapping the memory network model in a chatbot class

Class constructor

Building a vocabulary for word embedding lookup

Training the chatbot model

Evaluating the chatbot on the testing set

Interacting with the chatbot

Putting it all together

Example of an interactive conversation

Literature on and related to memory networks

Summary

Machine Translation Using the Attention-Based Model

Overview of machine translation

Statistical machine translation

English to French using NLTK SMT models

Neural machine translation

Encoder-decoder network

Encoder-decoder with attention

NMT for French to English using attention

Data preparation

Encoder network

Decoder network

Sequence-to-sequence model

Building the graph

Training

Inference

TensorBoard visualization

Summary

Speech Recognition Using DeepSpeech

Overview of speech recognition

Building an RNN model for speech recognition

Audio signal representation

LSTM model for spoken digit recognition

TensorBoard visualization

Speech to text using the DeepSpeech architecture

Overview of the DeepSpeech model

Speech recordings dataset

Preprocessing the audio data

Creating the model

TensorBoard visualization

State-of-the-art in speech recognition

Summary

Text-to-Speech Using Tacotron

Overview of text to speech

Naturalness versus intelligibility 

How is the performance of a TTS system evaluated?

Traditional techniques – concatenative and parametric models

A few reminders on spectrograms and the mel scale

TTS in deep learning

WaveNet, in brief

Tacotron

The encoder

The attention-based decoder

The Griffin-Lim-based postprocessing module

Details of the architecture

Limitations

Implementation of Tacotron with Keras

The dataset

Data preparation

Preparation of text data

Preparation of audio data

Implementation of the architecture

Pre-net 

Encoder and postprocessing CBHG

Attention RNN

Decoder RNN

The attention mechanism

Full architecture, with attention

Training and testing

Summary

Deploying Trained Models

Increasing performance

Quantizing the weights

MobileNets

TensorFlow Serving

Exporting the trained model

Serving the exported model

Deploying in the cloud

Amazon Web Services

Google Cloud Platform

Deploying on mobile devices

iPhone

Android

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

Preface

Before the advent of deep learning, traditional natural language processing (NLP) approaches had been widely used in tasks such as spam filtering, sentiment classification, and part of speech (POS) tagging. These classic approaches utilized statistical characteristics of sequences such as word count and co-occurrence, as well as simple linguistic features. However, the main disadvantage of these techniques was that they could not capture complex linguistic characteristics, such as context and intra-word dependencies.

Recent developments in neural networks and deep learning have given us powerful new tools to match human-level performance on NLP tasks and build products that deal with natural language. Deep learning for NLP is centered around the concept of word embeddings or vectors, also known as Word2vec, which encapsulate the meanings of words and phrases as dense vector representations. Word vectors, which are able to capture semantic information about words better than traditional one-hot representations, allow us to handle the temporal nature of language in an intuitive way when used in combination with a class of neural networks known as recurrent neural networks (RNNs). While RNNs can capture only local word dependencies, recently proposed vector-based operations for attention and alignment over word vector sequences allow neural networks to model global intra-word dependencies, including context. Due to their capability to model the syntax and semantics of language, strong empirical performance, and ability to generalize to new data, neural networks have become the go-to model for building highly sophisticated commercial products, such as search engines, translation services, and dialog systems.

This book introduces the basic building blocks of deep learning models for NLP and explores cutting-edge techniques from recent literature. We take a problem-based approach, where we introduce new models as solutions to various NLP tasks. Our focus is on providing practical code implementations in Python that can be applied to your use cases to bring human capabilities into your applications.

Who this book is for

This book is intended for developers who want to leverage NLP techniques to develop intelligent applications with rich human-centric interfaces. The book assumes introductory knowledge of machine learning (ML) or deep learning and intermediate Python programming skills. Our aim is to introduce cutting-edge techniques for NLP tasks, such as sentiment detection, conversational systems, language translation, speech-to-text, and much more, using the TensorFlow framework and Python.

The reader will go from the basic concepts of deep learning to state-of-the-art algorithms and best practices for dealing with natural language. Our focus is on implementing applications using real-world data and deploying deep learning models to add human capabilities to commercial applications in a production environment.

What this book covers

Chapter 1, Getting Started, explores the basic concepts of NLP and the various problems it tries to solve. We also look at some of the real-world applications to give the reader the feeling of the wide range of applications that leverage NLP.

Chapter 2, Text Classification and POS Tagging Using NLTK, introduces the popular NLTK Python library. We will be using NLTK to describe basic NLP tasks, such as tokenizing, stemming, tagging, and classic text classification. We also explore POS tagging with NLTK. We provide the reader with the tools and techniques necessary to prepare data for input into deep learning models.

Chapter 3, Deep Learning and TensorFlow, introduces the basic concepts of deep learning. This chapter will also help the reader to set up the environment and tools such as TensorFlow. At the end of the chapter, the reader will get an understanding of basic deep learning concepts, such as CNN, RNN, LSTM, attention-based models, and problems in NLP.

Chapter 4, Semantic Embedding Using Shallow Models, explores how to identify semantic relationships between words in a document, and in the process, we obtain a vector representation for words in a corpus. The chapter describes developing word embedding models, such as CBOW using neural networks. It also describes techniques for developing neural network models to obtain document vectors. At the end of this chapter, the reader will get familiar with training embeddings for word, sentence, and document; and visualize simple networks.

Chapter 5, Text Classification Using LSTM, discusses various approaches for classifying text, a specific application of which is to classify sentiments of words or phrases in a document. The chapter introduces the problem of text classification. Following this, we describe techniques for developing deep learning models using CNNs and LSTMs. The chapter also explains transfer learning for text classification using pretrained word embeddings. At the end, the reader will get familiar with implementing deep learning models for sentiment classification, spam detection, and using pretrained word embeddings for his/her classification task.

Chapter 6, Searching and Deduplicating Using CNNs, covers the problems of searching, matching and deduplicating documents and approaches used in solving them. The chapter describes developing deep learning models for searching text in a corpus. At the end of this chapter, you will learn to implement a CNN-based deep learning model for searching and deduplicating text.

Chapter 7, Named Entity Recognition Using Character LSTM, describes methods and approaches to perform Named Entity Recognition (NER), a sub-task of information extraction, to locate and classify entities in text of a document. The chapter introduces the problem of NER and the applications where it can be used. We then explain the implementation of a deep learning model using character-based LSTM for identifying named entities trained using labeled datasets.

Chapter 8, Text Generation and Summarization Using GRUs, covers the methods used for the task of generating text, an extension of which can be used to create summaries from text data. We then explain the implementation of a deep learning model for generating text. This is followed by a description of implementing GRU-based deep learning models to summarize text. At the end of this chapter, the reader will learn the techniques of implementing deep learning models for text generation and summarization.

Chapter 9, Question-Answering and Chatbots Using Memory Networks, describes how to train a deep learning model to answer questions and extend it to build a chatbot. The chapter introduces the problem of question answering and the approaches used in building an answering engine using deep learning models. We then describe how to leverage a question-answering engine to build a chatbot capable of answering questions like a conversation. At the end of this chapter, you will be able to implement an interactive chatbot.

Chapter 10, Machine Translation Using Attention-Based Models, covers various methods for translating text from one language to another, without the need to learn the grammar structure of either language. The chapter introduces traditional machine translation approaches, such as Hidden Markov Model (HMM) based methods. We then explain the implementation of an encoder-decoder model with attention for translating text from French to the English language. At the end of this chapter, the reader will be able to implement deep learning models for translating text.

Chapter 11, Speech Recognition Using Deep Speech, describes the problem of converting voice to text, as a beginning of a conversational interface. The chapter begins with feature extraction from speech data. This is followed by a brief introduction of the deep speech architecture. We then explain the detailed implementation of the Deep Speech architecture to transcribe speech to text. At the end of this chapter, the reader will be equipped with the knowledge to implement a speech-to-text deep learning model.

Chapter 12, Text to Speech Using Tacotron, describes the problem of converting text to speech. The chapter describes the implementation of the Tacotron model to convert text to voice. At the end, the reader will get familiar with the implementation of a text-to-speech model based on the Tacotron architecture.

Chapter 13, Deploying Trained Models, is the concluding chapter and describes model deployments in various cloud and mobile platforms.

To get the most out of this book

The prerequisites for the book are basic knowledge of ML or deep learning and intermediate Python skills, although both are not mandatory. We have given a brief introduction to deep learning, touching upon topics such as multi-layer perceptrons, Convolutional Neural Networks (CNNs), and RNNs in Chapter 1, Getting Started. It would be helpful if the reader knows general ML concepts, such as overfitting and model regularization, and classical models, such as linear regression and random forest. In more advanced chapters, the reader might encounter in-depth code walkthroughs that expect at least a basic level of Python programming experience.

All the code examples in the book can be downloaded from the code book repository as described in the next section. The examples mainly utilize open source tools and open data repositories, and were written for Python 3.5 or higher. The major libraries that are extensively used throughout the book are TensorFlow and NLTK. Detailed installation instructions for these packages can be found in Chapter 1, Getting Started, and Chapter 2, Text Classification and POS Tagging Using NLTK, respectively. Though a GPU is not required for the examples to run, it is advisable to have a system that has one. We recommend training models from the second half of the book on a GPU, as more complicated tasks involve bigger models and larger datasets.

Download the example code files

You can download the example code files for this book from your account at www.packtpub.com. If you purchased this book elsewhere, you can visit www.packtpub.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

www.packtpub.com

Select the

SUPPORT

tab.

Click on

Code Downloads & Errata

Enter the name of the book in the

box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for Mac

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Hands-On-Natural-Language-Processing-with-Python. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://www.packtpub.com/sites/default/files/downloads/HandsOnNaturalLanguageProcessingwithPython_ColorImages.pdf.

Get in touch

Feedback from our readers is always welcome.

General feedback: Email [email protected] and mention the book title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packtpub.com.

Getting Started

Natural language processing (NLP) is the field of understanding human language using computers. It involves the analysis and of large volumes of natural language data using computers to glean meaning and value for consumption in real-world applications. While NLP has been around since the 1950s, there has been tremendous growth in practical applications in the area, due to recent advances in machine learning (ML) and deep learning. The majority of this book will focus on various real-world applications of NLP, such as text classification, and sub-tasks of NLP, such as Named Entity Recognition (NER), with a particular emphasis on deep learning approaches. In this chapter, we will first introduce the basic concepts and terms in NLP. Following this, we will discuss some of the currently used applications that leverage NLP.

Basic concepts and terminologies in NLP

The following are some of the important terminologies and concepts in NLP mostly related to the language data. Getting familiar with these terms and concepts will help the reader in getting up to speed in understanding the contents in later chapters of the book:

Text corpus or corpora

Paragraph

Sentences

Phrases and words

N-grams

Bag-of-words

We will explain these in the following sections.

Text corpus or corpora

The language data that all NLP tasks depend upon is called the text corpus or simply corpus. A corpus is a large set of text data that can be in one of the languages like English, French, and so on. The corpus can consist of a single document or a bunch of documents. The source of the text corpus can be social network sites like Twitter, blog sites, open discussion forums like Stack Overflow, books, and several others. In some of the tasks like machine translation, we would require a multilingual corpus. For example we might need both the English and French translations of the same document content for developing a machine translation model. For speech tasks, we would also need human voice recordings and the corresponding transcribed corpus.

In most of the later chapters, we will be using text corpus and speech recordings available from the internet or open source data repositories. For many of the NLP task, the corpus is split into chunks for further analysis. These chunks could be at the paragraph, sentence, or word level. We will touch upon these in the following sections.

Paragraph

A paragraph is the largest unit of text handled by an NLP task. Paragraph level boundaries by itself may not be much use unless broken down into sentences. Though sometimes the paragraph may be considered as context boundaries. Tokenizers that can split a document into paragraphs are available in some of the Python libraries. We will look at such tokenizers in later chapters.

Sentences

Sentences are the next level of lexical unit of language data. A sentence encapsulates a complete meaning or thought and context. It is usually extracted from a paragraph based on boundaries determined by punctuations like period. The sentence may also convey opinion or sentiment expressed in it. In general, sentences consists of parts of speech (POS) entities like nouns, verbs, adjectives, and so on. There are tokenizers available to split paragraphs to sentences based on punctuations.

Phrases and words

Phrases are a group of consecutive words within a sentence that can convey a specific meaning. For example, in the sentence Tomorrow is going to be a rainy day the part going to be a rainy day expresses a specific thought. Some of the NLP tasks extract key phrases from sentences for search and retrieval applications. The next smallest unit of text is the word. The common tokenizers split sentences into text based on punctuations like spaces and comma. One of the problems with NLP is ambiguity in the meaning of same words used in different context. We will later see how this is handled well when we discuss word embeddings.

N-grams

A sequence of characters or words forms an N-gram. For example, character unigram consists of a single character, a bigram consists of a sequence of two characters and so on. Similarly word N-grams consists of a sequence of n words. In NLP, N-grams are used as features for tasks like text classification.

Bag-of-words

Bag-of-words in contrast to N-grams does not consider word order or sequence. It captures the word occurrence frequencies in the text corpus. Bag-of-words is also used as features in tasks like sentiment analysis and topic identification.

In the following sections, we will look at an overview of the following applications of NLP:

Analyzing sentiment

Recognizing named entities

Linking entities

Translating text

Natural language interfaces

Semantic Role Labeling

Relation extraction

SQL query generation, or semantic parsing

Machine Comprehension

Textual entailment

Coreference resolution

Searching

Question answering and chatbots

Converting text to voice

Converting voice to text

Speaker identification

Spoken dialog systems

Other applications

Applications of NLP

In this section, we will provide an overview of the major applications of NLP. While the topics listed here are not quite exhaustive, they will give the reader a sense of the wide range of applications where NLP is used.

Analyzing sentiment

The sentiment in a sentence or text reflects the overall positive, negative, or neutral opinion or thought of the person who produces or consumes it. It indicates whether a person is happy, unhappy, or neutral about the subject or context that describes the text. It can be quantified as a discrete value, such as 1 for happy, -1 for unhappy, and 0 for neutral, or it can be quantified on a continuous scale of values, from 0-1. Sentiment analysis, therefore, is the process of deriving this value from a piece of text that can be obtained from different data sources, such as social networks, product reviews, news articles, and so on. One real-world application of sentiment analysis is in social network data to derive actionable insights, such as customer satisfaction, product or brand popularity, fashion trends, and so on. The screenshot that follows shows one of the applications of sentiment analysis, in capturing the overall opinion of a particular news article about Google. The reader may refer to the application, or API, from Google Cloud at https://cloud.google.com/natural-language/:

The preceding screenshot indicates that sentiment data is captured for the whole document, as well as at the individual sentence level.