Python Natural Language Processing - Jalaj Thanaki - E-Book

Python Natural Language Processing E-Book

Jalaj Thanaki

0,0
45,59 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Leverage the power of machine learning and deep learning to extract information from text data

About This Book

  • Implement Machine Learning and Deep Learning techniques for efficient natural language processing
  • Get started with NLTK and implement NLP in your applications with ease
  • Understand and interpret human languages with the power of text analysis via Python

Who This Book Is For

This book is intended for Python developers who wish to start with natural language processing and want to make their applications smarter by implementing NLP in them.

What You Will Learn

  • Focus on Python programming paradigms, which are used to develop NLP applications
  • Understand corpus analysis and different types of data attribute.
  • Learn NLP using Python libraries such as NLTK, Polyglot, SpaCy, Standford CoreNLP and so on
  • Learn about Features Extraction and Feature selection as part of Features Engineering.
  • Explore the advantages of vectorization in Deep Learning.
  • Get a better understanding of the architecture of a rule-based system.
  • Optimize and fine-tune Supervised and Unsupervised Machine Learning algorithms for NLP problems.
  • Identify Deep Learning techniques for Natural Language Processing and Natural Language Generation problems.

In Detail

This book starts off by laying the foundation for Natural Language Processing and why Python is one of the best options to build an NLP-based expert system with advantages such as Community support, availability of frameworks and so on. Later it gives you a better understanding of available free forms of corpus and different types of dataset. After this, you will know how to choose a dataset for natural language processing applications and find the right NLP techniques to process sentences in datasets and understand their structure. You will also learn how to tokenize different parts of sentences and ways to analyze them.

During the course of the book, you will explore the semantic as well as syntactic analysis of text. You will understand how to solve various ambiguities in processing human language and will come across various scenarios while performing text analysis.

You will learn the very basics of getting the environment ready for natural language processing, move on to the initial setup, and then quickly understand sentences and language parts. You will learn the power of Machine Learning and Deep Learning to extract information from text data.

By the end of the book, you will have a clear understanding of natural language processing and will have worked on multiple examples that implement NLP in the real world.

Style and approach

This book teaches the readers various aspects of natural language Processing using NLTK. It takes the reader from the basic to advance level in a smooth way.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 548

Veröffentlichungsjahr: 2017

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Python Natural Language Processing

 

 

 

 

 

 

 

 

 

 

Explore NLP with machine learning and deep learning techniques

 

 

 

 

 

Jalaj Thanaki

BIRMINGHAM - MUMBAI

< html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">

Python Natural Language Processing

 

Copyright © 2017 Packt Publishing

 

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

 

First published: July 2017

Production reference: 1280717

 

Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.

ISBN 978-1-78712-142-3

www.packtpub.com

Credits

Author

 

Jalaj Thanaki

Copy Editor

 

Safis Editing

Reviewers

Devesh Raj

Gayetri Thakur

Prabhanjan Tattar

Chirag Mahapatra

Project Coordinator

 

Manthan Patel

Commissioning Editor

 

Veena Pagare

Proofreader

 

Safis Editing

Acquisition Editor

 

Aman Singh

Indexer

 

Tejal Daruwale Soni

ContentDevelopmentEditor

 

Jagruti Babaria

Production Coordinator

 

Deepika Naik

Technical Editor

 

Sayli Nikalje

 

Foreword

Data science is rapidly changing the world and the way we do business --be it retail, banking and financial services, publishing, pharmaceutical, manufacturing, and so on. Data of all forms is growing exponentially--quantitative, qualitative, structured, unstructured, speech, video, and so on. It is imperative to make use of this data to leverage all functions--avoid risk and fraud, enhance customer experience, increase revenues, and streamline operations.

Organizations are moving fast to embrace data science and investing a lot into high-end data science teams. Having spent more than 12 years in the BFSI domain, I get overwhelmed with the transition that the BFSI industry has seen in embracing analytics as a business and no longer a support function. This holds especially true for the fin-tech and digital lending world of which Jalaj and myself are a part of.

I have known Jalaj since her college days and am impressed with her exuberance and self-motivation. Her research skills, perseverance, commitment, discipline, and quickness to grasp even the most difficult concepts have made her achieve success in a short span of 4 years on her corporate journey.

Jalaj is a gifted intellectual with a strong mathematical and statistical understanding and demonstrates a continuous passion for learning the new and complex analytical and statistical techniques that are emerging in the industry. She brings experience to the data science domain and I have seen her deliver impressive projects around NLP, machine learning, basic linguistic analysis, neural networks, and deep learning. The blistering pace of the work schedule that she sets for herself, coupled with the passion she puts into her work, leads to definite and measurable results for her organization.

One of her most special qualities is her readiness to solve the most basic to the most complex problem in the interest of the business. She is an excellent team player and ensures that the organization gains the maximum benefit of her exceptional talent.

In this book, Jalaj takes us on an exciting and insightful journey through the natural language processing domain. She starts with the basic concepts and moves on to the most advanced concepts, such as how machine learning and deep learning are used in NLP.

I wish Jalaj all the best in all her future endeavors.

Sarita Arora Chief Analytics Officer, SMECorner Mumbai, India

 

About the Author

Jalaj Thanaki is a data scientist by profession and data science researcher by practice. She likes to deal with data science related problems. She wants to make the world a better place using data science and artificial intelligence related technologies. Her research interest lies in natural language processing, machine learning, deep learning, and big data analytics. Besides being a data scientist, Jalaj is also a social activist, traveler, and nature-lover.

 

Acknowledgement

I would like to dedicate this book to my husband, Shetul Thanaki, for his constant support, encouragement, and creative suggestions.

I give deep thanks and gratitude to my parents, my in-laws, my family, and my friends, who have helped me at every stage of my life. I would also like to thank all the mentors that I've had over the years. I really appreciate the efforts by technical reviewers for reviewing this book. I would also like to thank my current organization, SMECorner, for its support. I am a big fan of open source communities and education communities, so I really want to thank communities such as Kaggel, Udacity, and Coursera who have helped me, in a direct or indirect manner, to understand the various concepts of data science. Without learning from these communities, there is not a chance I could be doing what I do today.

I would like to thank Packt Publishing and Aman Singh, who approached me to write this book. I really appreciate the effort put in by the entire Packt editorial team to make this book as good as possible. Special thanks to Aman Singh, Jagruti Babaria, Menka Bohra, Manthan Patel, Nidhi Joshi, Sayli Nikalje, Manisha Sinha, Safis, and Tania Dutta.

I would like to recognize the efforts of technical editing team, strategy and management team, marketing team, sales team, graphics designer team, pre-production team, post production team, layout coordinators team, and indexer team for making my authoring journey so smooth.

I feel really compelled to pass my knowledge on to those willing to learn.

Thank you God for being kind to me!

Cheers and Happy Reading!

About the Reviewers

Devesh Raj is a data scientist with 10 years of experience in developing algorithms and solving problems in various domains--healthcare, manufacturing, automotive, production, and so on, applying machine learning (supervised and unsupervised machine learning techniques) and deep learning on structured and unstructured data (computer vision and NLP).

 

 

 

Gayetri Thakur is a linguist working in the area of natural language processing. She has worked on co-developing NLP tools such as automatic grammar checker, named entity recognizer, and text-to-speech and speech-to-text systems. She currently works for Google India Pvt.Ltd. India.

She is pursuing a PhD in linguistics and has completed her masters in linguistics from Banaras Hindu University.

 

 

 

Prabhanjan Tattar has over 9 years of experience as a statistical analyst. Survival analysis and statistical inference are his main areas of research/interest, and he has published several research papers in peer-reviewed journals and authored three books on R: R Statistical Application Development by Example, Packt Publishing, A Course in Statistics with R, Wiley, and Practical Data Science Cookbook, Packt Publishing. He also maintains the R packages gpk, RSADBE, and ACSWR.

 

 

 

Chirag Mahapatra is a software engineer who works on applying machine learning and natural language processing to problems in trust and safety. He currently works at Trooly (acquired by Airbnb). In the past, he has worked at A9.com on the ads data platform.

www.PacktPub.com

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by Packt

Copy and paste, print, and bookmark content

On demand and accessible via a web browser

Customer Feedback

Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1787121429.

If you'd like to join our team of regular reviewers, you can e-mail us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!

Table of Contents

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

Introduction

Understanding natural language processing

Understanding basic applications

Understanding advanced applications

Advantages of togetherness - NLP and Python

Environment setup for NLTK

Tips for readers

Summary

Practical Understanding of a Corpus and Dataset

What is a corpus?

Why do we need a corpus?

Understanding corpus analysis

Exercise

Understanding types of data attributes

Categorical or qualitative data attributes

Numeric or quantitative data attributes

Exploring different file formats for corpora

Resources for accessing free corpora

Preparing a dataset for NLP applications

Selecting data

Preprocessing the dataset

Formatting

Cleaning

Sampling

Transforming data

Web scraping

Summary

Understanding the Structure of a Sentences

Understanding components of NLP

Natural language understanding

Natural language generation

Differences between NLU and NLG

Branches of NLP

Defining context-free grammar

Exercise

Morphological analysis

What is morphology?

What are morphemes?

What is a stem?

What is morphological analysis?

What is a word?

Classification of morphemes

Free morphemes

Bound morphemes

Derivational morphemes

Inflectional morphemes

What is the difference between a stem and a root?

Exercise

Lexical analysis

What is a token?

What are part of speech tags?

Process of deriving tokens

Difference between stemming and lemmatization

Applications

Syntactic analysis

What is syntactic analysis?

Semantic analysis

What is semantic analysis?

Lexical semantics

Hyponymy and hyponyms

Homonymy

Polysemy

What is the difference between polysemy and homonymy?

Application of semantic analysis

Handling ambiguity

Lexical ambiguity

Syntactic ambiguity

Approach to handle syntactic ambiguity

Semantic ambiguity

Pragmatic ambiguity

Discourse integration

Applications

Pragmatic analysis

Summary

Preprocessing

Handling corpus-raw text

Getting raw text

Lowercase conversion

Sentence tokenization

Challenges of sentence tokenization

Stemming for raw text

Challenges of stemming for raw text

Lemmatization of raw text

Challenges of lemmatization of raw text

Stop word removal

Exercise

Handling corpus-raw sentences

Word tokenization

Challenges for word tokenization

Word lemmatization

Challenges for word lemmatization

Basic preprocessing

Regular expressions

Basic level regular expression

Basic flags

Advanced level regular expression

Positive lookahead

Positive lookbehind

Negative lookahead

Negative lookbehind

Practical and customized preprocessing

Decide by yourself

Is preprocessing required?

What kind of preprocessing is required?

Understanding case studies of preprocessing

Grammar correction system

Sentiment analysis

Machine translation

Spelling correction

Approach

Summary

Feature Engineering and NLP Algorithms

Understanding feature engineering

What is feature engineering?

What is the purpose of feature engineering?

Challenges

Basic feature of NLP

Parsers and parsing

Understanding the basics of parsers

Understanding the concept of parsing

Developing a parser from scratch

Types of grammar

Context-free grammar

Probabilistic context-free grammar

Calculating the probability of a tree

Calculating the probability of a string

Grammar transformation

Developing a parser with the Cocke-Kasami-Younger Algorithm

Developing parsers step-by-step

Existing parser tools

The Stanford parser

The spaCy parser

Extracting and understanding the features

Customizing parser tools

Challenges

POS tagging and POS taggers

Understanding the concept of POS tagging and POS taggers

Developing POS taggers step-by-step

Plug and play with existing POS taggers

A Stanford POS tagger example

Using polyglot to generate POS tagging

Exercise

Using POS tags as features

Challenges

Name entity recognition

Classes of NER

Plug and play with existing NER tools

A Stanford NER example

A Spacy NER example

Extracting and understanding the features

Challenges

n-grams

Understanding n-gram using a practice example

Application

Bag of words

Understanding BOW

Understanding BOW using a practical example

Comparing n-grams and BOW

Applications

Semantic tools and resources

Basic statistical features for NLP

Basic mathematics

Basic concepts of linear algebra for NLP

Basic concepts of the probabilistic theory for NLP

Probability

Independent event and dependent event

Conditional probability

TF-IDF

Understanding TF-IDF

Understanding TF-IDF with a practical example

Using textblob

Using scikit-learn

Application

Vectorization

Encoders and decoders

One-hot encoding

Understanding a practical example for one-hot encoding

Application

Normalization

The linguistics aspect of normalization

The statistical aspect of normalization

Probabilistic models

Understanding probabilistic language modeling

Application of LM

Indexing

Application

Ranking

Advantages of features engineering

Challenges of features engineering

Summary

Advanced Feature Engineering and NLP Algorithms

Recall word embedding

Understanding the basics of word2vec

Distributional semantics

Defining word2vec

Necessity of unsupervised distribution semantic model - word2vec

Challenges

Converting the word2vec model from black box to white box

Distributional similarity based representation

Understanding the components of the word2vec model

Input of the word2vec

Output of word2vec

Construction components of the word2vec model

Architectural component

Understanding the logic of the word2vec model

Vocabulary builder

Context builder

Neural network with two layers

Structural details of a word2vec neural network

Word2vec neural network layer's details

Softmax function

Main processing algorithms

Continuous bag of words

Skip-gram

Understanding algorithmic techniques and the mathematics behind the word2vec model

Understanding the basic mathematics for the word2vec algorithm

Techniques used at the vocabulary building stage

Lossy counting

Using it at the stage of vocabulary building

Applications

Techniques used at the context building stage

Dynamic window scaling

Understanding dynamic context window techniques

Subsampling

Pruning

Algorithms used by neural networks

Structure of the neurons

Basic neuron structure

Training a simple neuron

Define error function

Understanding gradient descent in word2vec

Single neuron application

Multi-layer neural networks

Backpropagation

Mathematics behind the word2vec model

Techniques used to generate final vectors and probability prediction stage

Hierarchical softmax

Negative sampling

Some of the facts related to word2vec

Applications of word2vec

Implementation of simple examples

Famous example (king - man + woman)

Advantages of word2vec

Challenges of word2vec

How is word2vec used in real-life applications?

When should you use word2vec?

Developing something interesting

Exercise

Extension of the word2vec concept

Para2Vec

Doc2Vec

Applications of Doc2vec

GloVe

Exercise

Importance of vectorization in deep learning

Summary

Rule-Based System for NLP

Understanding of the rule-based system

What does the RB system mean?

Purpose of having the rule-based system

Why do we need the rule-based system?

Which kind of applications can use the RB approach over the other approaches?

Exercise

What kind of resources do you need if you want to develop a rule-based system?

Architecture of the RB system

General architecture of the rule-based system as an expert system

Practical architecture of the rule-based system for NLP applications

Custom architecture - the RB system for NLP applications

Exercise

Apache UIMA - the RB system for NLP applications

Understanding the RB system development life cycle

Applications

NLP applications using the rule-based system

Generalized AI applications using the rule-based system

Developing NLP applications using the RB system

Thinking process for making rules

Start with simple rules

Scraping the text data

Defining the rule for our goal

Coding our rule and generating a prototype and result

Exercise

Python for pattern-matching rules for a proofreading application

Exercise

Grammar correction

Template-based chatbot application

Flow of code

Advantages of template-based chatbot

Disadvantages of template-based chatbot

Exercise

Comparing the rule-based approach with other approaches

Advantages of the rule-based system

Disadvantages of the rule-based system

Challenges for the rule-based system

Understanding word-sense disambiguation basics

Discussing recent trends for the rule-based system

Summary

Machine Learning for NLP Problems

Understanding the basics of machine learning

Types of ML

Supervised learning

Unsupervised learning

Reinforcement learning

Development steps for NLP applications

Development step for the first iteration

Development steps for the second to nth iteration

Understanding ML algorithms and other concepts

Supervised ML

Regression

Classification

ML algorithms

Exercise

Unsupervised ML

k-means clustering

Document clustering

Advantages of k-means clustering

Disadvantages of k-means clustering

Exercise

Semi-supervised ML

Other important concepts

Bias-variance trade-off

Underfitting

Overfitting

Evaluation matrix

Exercise

Feature selection

Curse of dimensionality

Feature selection techniques

Dimensionality reduction

Hybrid approaches for NLP applications

Post-processing

Summary

Deep Learning for NLU and NLG Problems

An overview of artificial intelligence

The basics of AI

Components of AI

Automation

Intelligence

Stages of AI

Machine learning

Machine intelligence

Machine consciousness

Types of artificial intelligence

Artificial narrow intelligence

Artificial general intelligence

Artificial superintelligence

Goals and applications of AI

AI-enabled applications

Comparing NLU and NLG

Natural language understanding

Natural language generation

A brief overview of deep learning

Basics of neural networks

The first computation model of the neuron

Perceptron

Understanding mathematical concepts for ANN

Gradient descent

Calculating error or loss

Calculating gradient descent

Activation functions

Sigmoid

TanH

ReLu and its variants

Loss functions

Implementation of ANN

Single-layer NN with backpropagation

Backpropagation

Exercise

Deep learning and deep neural networks

Revisiting DL

The basic architecture of DNN

Deep learning in NLP

Difference between classical NLP and deep learning NLP techniques

Deep learning techniques and NLU

Machine translation

Deep learning techniques and NLG

Exercise

Recipe summarizer and title generation

Gradient descent-based optimization

Artificial intelligence versus human intelligence

Summary

Advanced Tools

Apache Hadoop as a storage framework

Apache Spark as a processing framework

Apache Flink as a real-time processing framework

Visualization libraries in Python

Summary

How to Improve Your NLP Skills

Beginning a new career journey with NLP

Cheat sheets

Choose your area

Agile way of working to achieve success

Useful blogs for NLP and data science

Grab public datasets

Mathematics needed for data science

Summary

Installation Guide

Installing Python, pip, and NLTK

Installing the PyCharm IDE

Installing dependencies

Framework installation guides

Drop your queries

Summary

Preface

The book title, Python Natural Language Processing, gives you a broad idea about the book. As a reader, you will get the chance to learn about all the aspects of natural language processing (NLP) from scratch. In this book, I have specified NLP concepts in a very simple language, and there are some really cool practical examples that enhance your understanding of this domain. By implementing these examples, you can improve your NLP skills. Don't you think that sounds interesting?

Now let me answer some of the most common questions I have received from my friends and colleagues about the NLP domain. These questions really inspired me to write this book. For me, it's really important that all my readers understand why I am writing this book. Let's find out!

Here, I would like answer some of the questions that I feel are critical to my readers. So, I'll begin with some of the questions, followed by the answers. The first question I usually get asked is--what is NLP? The second one is--why is Python mainly used for developing NLP applications? And last but not least, the most critical question is--what are the resources I can use for learning NLP? Now let's look at the answers!

The answer to the first question is that NLP, simply put, is the language you speak, write, read, or understand as a human; natural language is, thus, a medium of communication. Using computer science algorithms, mathematical concepts, and statistical techniques, we try to process the language so machines can also understand language as humans do; this is called NLP.

Now let's answer the second question--why do people mainly use Python to develop NLP applications? So, there are some facts that I want to share with you. The very simple and straightforward thing is that Python has a lot of libraries that make your life easy when you develop NLP applications. The second reason is that if you are coming from a C or C++ coding background, you don't need to worry about memory leakage. The Python interpreter will handle this for you, so you can just focus on the main coding part. Besides, Python is a coder-friendly language. You can do much more by writing just a few lines of codes, compared to other object-oriented languages. So all these facts drive people to use Python for developing NLP and other data science-related applications for rapid prototyping.

The last question is critical to me because I used to explain the previous answers to my friends, but after hearing all these and other fascinating things, they would come to me and say that they want to learn NLP, so what are the resources available? I used to recommend books, blogs, YouTube videos, education platforms such as Udacity and Coursera, and a lot more, but after a few days, they would ask me if there is a single resource in the form of book, blog, or anything that they could use. Unfortunately, for them, my answer was no. At that stage, I really felt that juggling all these resources would always be difficult for them, and that painful realization became my inspiration to write this book.

So in this book, I have tried to cover most of the essential parts of NLP, which will be useful for everyone. The great news is that I have provided practical examples using Python so readers can understand all the concepts theoretically as well as practically. Reading, understanding, and coding are the three main processes that I have followed in this book to make readers lives easier.

What this book covers

Chapter 1, Introduction, provides an introduction to NLP and the various branches involved in the NLP domain. We will see the various stages of building NLP applications and discuss NLTK installation.

Chapter 2, Practical Understanding of Corpus and Dataset, shows all the aspects of corpus analysis. We will see the different types of corpus and data attributes present in corpuses. We will touch upon different corpus formats such as CSV, JSON, XML, LibSVM, and so on. We will see a web scraping example.

Chapter 3, Understanding Structure of Sentences, helps you understand the most essential aspect of natural language, which is linguistics. We will see the concepts of lexical analysis, syntactic analysis, semantic analysis, handling ambiguities, and so on. We will use NLTK to understand all the concepts practically.

Chapter 4, Preprocessing, helps you get to know the various types of preprocessing techniques and how you can customize them. We will see the stages of preprocessing such as data preparation, data processing, and data transformation. Apart from this, you will understand the practical aspects of preprocessing.

Chapter 5, Feature Engineering and NLP Algorithms, is the core part of an NLP application. We will see how different algorithms and tools are used to generate input for machine learning algorithms, which we will be using to develop NLP applications. We will also understand the statistical concepts used in feature engineering, and we will get into the customization of tools and algorithms.

Chapter 6, Advance Feature Engineering and NLP Algorithms, gives you an understanding of the most recent concepts in NLP, which are used to deal with semantic issues. We will see word2vec, doc2vec, GloVe, and so on, as well as some practical implementations of word2vec by generating vectors from a Game of Thrones dataset.

Chapter 7, Rule-Based System for NLP, details how we can build a rule-based system and all the aspects you need to keep in mind while developing the same for NLP. We will see the rule-making process and code the rules too. We will also see how we can develop a template-based chatbot.

Chapter 8, Machine Learning for NLP Problems, provides you fresh aspects of machine learning techniques. We will see the various algorithms used to develop NLP applications. We will also implement some great NLP applications using machine learning.

Chapter 9, Deep Learning for NLU and NLG Problems, introduces you to various aspects of artificial intelligence. We will look at the basic concepts of artificial neural networks (ANNs) and how you can build your own ANN. We will understand hardcore deep learning, develop the mathematical aspect of deep learning, and see how deep learning is used for natural language understanding(NLU) and natural language generation (NLG). You can expect some cool practical examples here as well.

Appendix A, Advance Tools, gives you a brief introduction to various frameworks such as Apache Hadoop, Apache Spark, and Apache Flink.

Appendix B, How to Improve Your NLP Skills, is about suggestions from my end on how to keep your NLP skills up to date and how constant learning will help you acquire new NLP skills.

Appendix C, Installation Guide, has instructions for installations required.

What you need for this book

Let's discuss some prerequisites for this book. Don't worry, it's not math or statistics, just basic Python coding syntax is all you need to know. Apart from that, you need Python 2.7.X or Python 3.5.X installed on your computer; I recommend using any Linux operating system as well.

The list of Python dependencies can be found at GitHub repository at https://github.com/jalajthanaki/NLPython/blob/master/pip-requirements.txt.

Now let's look at the hardware required for this book. A computer with 4 GB RAM and at least a two-core CPU is good enough to execute the code, but for machine learning and deep learning examples, you may have more RAM, perhaps 8 GB or 16 GB, and computational power that uses GPU(s).

Who this book is for

This book is intended for Python developers who wish to start with NLP and want to make their applications smarter by implementing NLP in them.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, path names, dummy URLs, user input, and Twitter handles are shown as follows: "The nltk library provides some inbuilt corpuses."

A block of code is set as follows:

importnltkfromnltk.corpusimportbrownascbfromnltk.corpusimportgutenbergascg

Any command-line input or output is written as follows:

pip install nltk

or

sudo pip install nltk

New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "This will open an additional dialog window, where you can choose specific libraries, but in our case, click on All packages, and you can choose the path where the packages reside. Wait till all the packages are downloaded."

Warnings or important notes appear like this.
Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files emailed directly to you. You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.

Hover the mouse pointer on the

SUPPORT

tab at the top.

Click on

Code Downloads & Errata

.

Enter the name of the book in the

Search

box.

Select the book for which you're looking to download the code files.

Choose from the drop-down menu where you purchased this book from.

Click on

Code Download

.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for Windows

Zipeg / iZip / UnRarX for Mac

7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Python-Natural-Language-Processing. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/PythonNaturalLanguageProcessing_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title. To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy. Please contact us at [email protected] with a link to the suspected pirated material. We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.

Introduction

In this chapter, we'll have a gentle introduction tonatural language processing (NLP) and how natural language processing concepts are used in real-life artificial intelligence applications. We will focus mainly on Python programming paradigms, which are used to develop NLP applications. Later on, the chapter has a tips section for readers. If you are really interested in finding out about the comparison of various programming paradigms for NLP and why Python is the best programming paradigm then, as a reader, you should go through the Preface of this book. As an industry professional, I have tried most of the programming paradigms for NLP. I have used Java, R, and Python for NLP applications. Trust me, guys, Python is quite easy and efficient for developing applications that use NLP concepts.

We will cover following topics in this chapter:

Understanding NLP

Understanding basic applications

Understanding advance applications

Advantages of the togetherness--NLP and Python

Environment setup for NLTK

Tips for readers

Understanding natural language processing

In the last few years, branches of artificial intelligence (AI) have created a lot of buzz, and those branches are data science, data analytics, predictive analysis, NLP, and so on.

As mentioned in the Preface of this book, we are focusing on Python and natural language processing. Let me ask you some questions--Do you really know what natural language is? What is natural language processing? What are the other branches involved in building expert systems using various concepts of natural language processing? How can we build intelligent systems using the concept of NLP?

Let's begin our roller coaster ride of understanding NLP.

What is natural language?

As a human being, we express our thoughts or feelings via a language

Whatever you speak, read, write, or listen to is mostly in the form of natural language, so it is commonly expressed as natural language

For example:

The content of this book is a source of natural language

Whatever you speak, listen, and write in your daily life is also in the form of natural language

Movie dialogues are also a source of natural language

Your WhatsApp conversations are also considered a form of natural language

What is natural language processing?

Now you have an understanding of what natural language is. NLP is a sub-branch of AI. Let's consider an example and understand the concept of NLP. Let's say you want to build a machine that interacts with humans in the form of natural language. This kind of an intelligent system needs computational technologies and computational linguistics to build it, and the system processes natural language like humans.

You can relate the aforementioned concept of NLP to the existing NLP products from the world's top tech companies, such as Google Assistant from Google, Siri speech assistance from Apple, and so on.

Now you will able to understand the definitions of NLP, which are as follows:

Natural language processing is the ability of computational technologies and/or computational linguistics to process human natural language

Natural language processing is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages

Natural language processing can be defined as the automatic (or semi-automatic) processing of human natural language

What are the other branches involved in building expert systems using, various concepts of NLP? Figure 1.1 is the best way to know how many other branches are involved when you are building an expert system using NLP concepts:

Figure 1.1: NLP concepts

Figures 1.2 and 1.3 convey all the subtopics that are included in every branch given in Figure 1.1:

Figure 1.2: Sub-branches of NLP concepts
Figure 1.3 depicts the rest of the sub-branches:
Figure 1.3: Sub-branches of NLP concepts

How can we build an intelligent system using concepts of NLP? Figure 1.4 is the basic model, which indicates how an expert system can be built for NLP applications. The development life cycle is defined in the following figure:

Figure 1.4: Development life cycle

Let's see some of the details of the development life cycle of NLP-related problems:

If you are solving an NLP problem, you first need to understand the problem statement.

Once you understand your problem statement, think about what kind of data or corpus you need to solve the problem. So, data collection is the basic activity toward solving the problem.

After you have collected a sufficient amount of data, you can start analyzing your data. What is the quality and quantity of our corpus? According to the quality of the data and your problem statement, you need to do preprocessing.

Once you are done with preprocessing, you need to start with the process of feature engineering. Feature engineering is the most important aspect of NLP and data science related applications. We will be covering feature engineering related aspects in much more detail in

Chapter 5

,

Feature Engineering and NLP Algorithms

and

Chapter 6

,

Advance Feature Engineering and NLP Algorithms.

Having decided on and extracted features from the raw preprocessed data, you are to decide which computational technique is useful to solve your problem statement, for example, do you want to apply machine learning techniques or rule-based techniques?.

Now, depending on what techniques you are going to use, you should ready the feature files that you are going to provide as an input to your decided algorithm.

Run your logic, then generate the output.

Test and evaluate your system's output.

Tune the parameters for optimization, and continue till you get satisfactory results.

We will be covering a lot of information very quickly in this chapter, so if you see something that doesn't immediately make sense, please do not feel lost and bear with me. We will explore all the details and examples from the next chapter onward, and that will definitely help you connect the dots.

Understanding basic applications

NLP is a sub-branch of AI. Concepts from NLP are used in the following expert systems:

Speech recognition system

Question answering system

Translation from one specific language to another specific language

Text summarization

Sentiment analysis

Template-based chatbots

Text classification

Topic

segmentation

We will learn about most of the NLP concepts that are used in the preceding applications in the further chapters.

Understanding advanced applications

Advanced applications include the following:

Human robots who understand natural language commands and interact with humans in natural language.

Building a universal machine translation system is the long-term goal in the NLP domain because you could easily build a machine translation system which can convert one specific language to another specific language, but that system may not help you to translate other languages. With the help of deep learning, we can develop a universal machine translation system and Google recently announced that they are very close to achieving this goal. We will build our own machine translation system using deep learning in

Chapter 9

,

Deep Learning for NLP and NLG Problems.

The NLP system, which generates the logical title for the given document is one of the advance applications. Also, with the help of deep learning, you can generate the title of document and perform summarization on top of that. This kind of application, you will see in

Chapter 9

,

Deep Learning for NLP and NLG Problems.

The NLP system, which generates text for specific topics or for an image is also considered an advanced NLP application.

Advanced chatbots, which generate personalized text for humans and ignore mistakes in human writing is also a goal we are trying to achieve.

There are many other NLP applications, which you can see in

Figure 1.5:

Figure 1.5: Applications In NLP domain

Advantages of togetherness - NLP and Python

The following points illustrate why Python is one of the best options to build an NLP-based expert system:

Developing prototypes for the NLP-based expert system using Python is very easy and efficient

A large variety of open source NLP libraries are available for Python programmers

Community support is very strong

Easy to use and less complex for beginners

Rapid development: testing, and evaluation are easy and less complex

Many of the new frameworks, such as Apache Spark, Apache Flink, TensorFlow, and so on, provide API for Python

Optimization of the NLP-based system is less complex compared to other programming paradigms

Environment setup for NLTK

I would like to suggest to all my readers that they pull the NLPython repository on GitHub. The repository URL is https://github.com/jalajthanaki/NLPython

I'm using Linux (Ubuntu) as the operating system, so if you are not familiar with Linux, it's better for you to make yourself comfortable with it, because most of the advanced frameworks, such as Apache Hadoop, Apache Spark, Apache Flink, Google TensorFlow, and so on, require a Linux operating system.

The GitHub repository contains instructions on how to install Linux, as well as basic Linux commands which we will use throughout this book. On GitHub, you can also find basic commands for GitHub if you are new to Git as well. The URL is https://github.com/jalajthanaki/NLPython/tree/master/ch1/documentation

I'm providing an installation guide for readers to set up the environment for these chapters. The URL is https://github.com/jalajthanaki/NLPython/tree/master/ch1/installation_guide

Steps for installing nltk are as follows (or you can follow the URL: https://github.com/jalajthanaki/NLPython/blob/master/ch1/installation_guide/NLTK%2BSetup.md):

Install Python 2.7.x manually, but on Linux Ubuntu 14.04, it has already been installed; otherwise, you can check your Python version using the

python -V

command.

Configure pip for installing Python libraries (

https://github.com/jalajthanaki/NLPython/blob/master/ch1/installation_guide/NLTK%2BSetup.md

).

Open the terminal, and execute the following command:

pip install nltk

or

sudo pip install nltk

Open the terminal, and execute the

python

command.

Inside the Python shell, execute the

import nltk

command.

If your nltk module is successfully installed on your system, the system will not throw any messages.

Inside the Python shell, execute the

nltk.download()

command.

This will open an additional dialog window, where you can choose specific libraries, but in our case, click on

All packages

, and you can choose the path where the packages reside. Wait till all the packages are downloaded. It may take a long time to download. After completion of the download, you can find the folder named

nltk_data

at the path specified by you earlier. Take a look at the NLTK Downloader in the following screenshot:

Figure 1.6: NLTK Downloader

This repository contains an installation guide, codes, wiki page, and so on. If readers have questions and queries, they can post their queries on the Gitter group. The Gitter group URL is https://gitter.im/NLPython/Lobby?utm_source=share-link&utm_medium=link&utm_campaign=share-link

Tips for readers

This book is a practical guide. As an industry professional, I strongly recommend all my readers replicate the code that is already available on GitHub and perform the exercises given in the book. This will improve your understanding of NLP concepts. Without performing the practicals, it will be nearly impossible for you to get all the NLP concepts thoroughly. By the way, I promise that it will be fun to implement them.

The flow of upcoming chapters is as follows:

Explanation of the concepts

Application of the concepts

Needs of the concepts

Possible ways to implement the concepts (code is on GitHub)

Challenges of the concepts

Tips to overcome challenges

Exercises

Summary

This chapter gave you an introduction to NLP. You now have a brief idea about what kind of branches are involved in NLP and the various stages for building an expert system using NLP concepts. Lastly, we set up the environment for NLTK. All the installation guidelines and codes are available on GitHub.

In the next chapter, we will see what kind of corpus is used on NLP-related applications and what all the critical points we should keep in mind are when we analyze a corpus. We will deal with the different types of file formats and datasets. Let's explore this together!

Practical Understanding of a Corpus and Dataset

In this chapter, we'll explore the first building block of natural language processing. We are going to cover the following topics to get a practical understanding of a corpus or dataset:

What is corpus?

Why do we need corpus?

Understanding corpus analysis

Understanding types of data attributes

Exploring different file formats of datasets

Resources for access free corpus

Preparing datasets for NLP applications

Developing the web scrapping application

What is a corpus?

Natural language processing related applications are built using a huge amount of data. In layman's terms, you can say that a large collection of data is called corpus. So, more formally and technically, corpus can be defined as follows:

Corpus is a collection of written or spoken natural language material, stored on computer, and used to find out how language is used. So more precisely, a corpus is a systematic computerized collection of authentic language that is used for linguistic analysis as well as corpus analysis. If you have more than one corpus, it is called corpora.

In order to develop NLP applications, we need corpus that is written or spoken natural language material. We use this material or data as input data and try to find out the facts that can help us develop NLP applications. Sometimes, NLP applications use a single corpus as the input, and at other times, they use multiple corpora as input.

There are many reasons of using corpus for developing NLP applications, some of which are as follows:

With the help of corpus, we can perform some statistical analysis such as frequency distribution, co-occurrences of words, and so on. Don't worry, we will see some basic statistical analysis for corpus later in this chapter.

We can define and validate linguistics rules for various NLP applications. If you are building a grammar correction system, you will use the text corpus and try to find out the grammatically incorrect instances, and then you will define the grammar rules that help us to correct those instances.

We can define some specific linguistic rules that depend on the usage of the language. With the help of the rule-based system, you can define the linguistic rules and validate the rules using corpus.

In a corpus, the large collection of data can be in the following formats:

Text data, meaning written material

Speech data, meaning spoken material

Let's see what exactly text data is and how can we collect the text data. Text data is a collection of written information. There are several resources that can be used for getting written information such as news articles, books, digital libraries, email messages, web pages, blogs, and so on. Right now, we all are living in a digital world, so the amount of text information is growing rapidly. So, we can use all the given resources to get the text data and then make our own corpus. Let's take an example: if you want to build a system that summarizes news articles, you will first gather various news articles present on the web and generate a collection of new articles so that the collection is your corpus for news articles and has text data. You can use web scraping tools to get information from raw HTML pages. In this chapter, we will develop one.

Now we will see how speech data is collected. A speech data corpus generally has two things: one is an audio file, and the other one is its text transcription. Generally, we can obtain speech data from audio recordings. This audio recording may have dialogues or conversations of people. Let me give you an example: in India, when you call a bank customer care department, if you pay attention, you get to know that each and every call is recorded. This is the way you can generate speech data or speech corpus. For this book, we are concentrating just on text data and not on speech data.

A corpus is also referred to as a dataset in some cases.

There are three types of corpus:

Monolingual corpus:

This type of corpus has one language

Bilingual corpus:

This type of corpus has two languages

Multilingual corpus:

This type of corpus has more than one language

A few examples of the available corpora are given as follows:

Google Books Ngram corpus

Brown corpus

American National corpus

Why do we need a corpus?

In any NLP application, we need data or corpus to building NLP tools and applications. A corpus is the most critical and basic building block of any NLP-related application. It provides us with quantitative data that is used to build NLP applications. We can also use some part of the data to test and challenge our ideas and intuitions about the language. Corpus plays a very big role in NLP applications. Challenges regarding creating a corpus for NLP applications are as follows:

Deciding the type of data we need in order to solve the problem statement

Availability of data

Quality of the data

Adequacy of the data in terms of amount

Now you may want to know the details of all the preceding questions; for that, I will take an example that can help you to understand all the previous points easily. Consider that you want to make an NLP tool that understands the medical state of a particular patient and can help generate a diagnosis after proper medical analysis.

Here, our aspect is more biased toward the corpus level and generalized. If you look at the preceding example as an NLP learner, you should process the problem statement as stated here:

What kind of data do I need if I want to solve the problem statement?

Clinical notes or patient history

Audio recording of the conversation between doctor and patient

Do you have this kind of corpus or data with you?

If yes, great! You are in a good position, so you can proceed to the next question.

If not, OK! No worries. You need to process one more question, which is probably a difficult but interesting one.

Is there an open source corpus available?

If yes, download it, and continue to the next question.

If not, think of how you can access the data and build the corpus. Think of web scraping tools and techniques. But you have to explore the ethical as well as legal aspects of your web scraping tool.

What is the quality level of the corpus?

Go through the corpus, and try to figure out the following things:

If you can't understand the dataset at all, then what to do?

Spend more time with your dataset.

Think like a machine, and try to think of all the things you would process if you were fed with this kind of a dataset. Don't think that you will throw an error!

Find one thing that you feel you can begin with.

Suppose your NLP tool has diagnosed a human disease, think of what you would ask the patient if you were the doctor's machine. Now you can start understanding your dataset and then think about the preprocessing part. Do not rush to the it.

If you can understand the dataset, then what to do?

Do you need each and every thing that is in the corpus to build an NLP system?

If yes, then proceed to the next level, which we will look at in

Chapter 5

,

Feature Engineering and NLP Algorithms

.

If not, then proceed to the next level, which we will look at in

Chapter 4

,

Preprocessing

.

Will the amount of data be sufficient for solving the problem statement on at least a

proof of concept

(

POC

) basis?

According to my experience, I would prefer to have at least 500 MB to 1 GB of data for a small POC.

For startups, to collect 500 MB to 1 GB data is also a challenge for the following reasons:

Startups are new in business.

Sometimes they are very innovative, and there is no ready-made dataset available.

Even if they manage to build a POC, to validate their product in real life is also challenging.

Refer to Figure 2.1 for a description of the preceding process:

Figure 2.1: Description of the process defined under why do we need corpus?

Understanding corpus analysis

In this section, we will first understand what corpus analysis is. After this, we will briefly touch upon speech analysis. We will also understand how we can analyze text corpus for different NLP applications. At the end, we will do some practical corpus analysis for text corpus. Let's begin!

Corpus analysis can be defined as a methodology for pursuing in-depth investigations of linguistic concepts as grounded in the context of authentic and communicative situations. Here, we are talking about the digitally stored language corpora, which is made available for access, retrieval, and analysis via computer.

Corpus analysis for speech data needs the analysis of phonetic understanding of each of the data instances. Apart from phonetic analysis, we also need to do conversation analysis, which gives us an idea of how social interaction happens in day-to-day life in a specific language. Suppose in real life, if you are doing conversational analysis for casual English language, maybe you find a sentence such as What's up, dude? more frequently used in conversations compared to How are you, sir (or madam)?.

Corpus analysis for text data consists in statistically probing, manipulating, and generalizing the dataset. So for a text dataset, we generally perform analysis of how many different words are present in the corpus and what the frequency of certain words in the corpus is. If the corpus contains any noise, we try to remove that noise. In almost every NLP application, we need to do some basic corpus analysis so we can understand our corpus well. nltk provides us with some inbuilt corpus. So, we perform corpus analysis using this inbuilt corpus. Before jumping to the practical part, it is very important to know what type of corpora is present in nltk.

nltk has four types of corpora. Let's look at each of them:

Isolate corpus

: This type of corpus is a collection of text or natural language. Examples of this kind of corpus are

gutenberg

,

webtext

, and so on.

Categorized corpus

: This type of corpus is a collection of texts that are grouped into different types of categories. An example of this kind of corpus is the

brown

corpus, which contains data for different categories such as news, hobbies, humor, and so on.

Overlapping corpus

: This type of corpus is a collection of texts that are categorized, but the categories overlap with each other. An example of this kind of corpus is the

reuters

corpus, which contains data that is categorized, but the defined categories overlap with each other. More explicitly, I want to define the example of the

reuters

corpus. For example, if you consider different types of coconuts as one category, you can see subcategories of coconut-oil, and you also have cotton oil. So, in the

reuters

corpus, the various data categories are overlapped.

Temporal corpus

: This type of corpus is a collection of the usages of natural language over a period of time. An example of this kind of corpus is the

inaugural address

corpus. Suppose you recode the usage of a language in any city of India in 1950. Then you repeat the same activity to see the usage of the language in that particular city in 1980 and then again in 2017. You will have recorded the various data attributes regarding how people used the language and what the changes over a period of time were.

Now enough of theory, let's jump to the practical stuff. You can access the following links to see the codes:

This chapter code is on the GitHub directory URL at https://github.com/jalajthanaki/NLPython/tree/master/ch2.

Follow the Python code on this URL: https://nbviewer.jupyter.org/github/jalajthanaki/NLPython/blob/master/ch2/2_1_Basic_corpus_analysis.html

The Python code has basic commands of how to access corpus using the nltk API. We are using the brown and gutenberg corpora. We touch upon some of the basic corpus-related APIs.

A description of the basic API attributes is given in the following table:

API Attributes

Description

fileids()

This results in files of the corpus

fileids([categories])

This results in files of the corpus corresponding to these categories

categories()

This lists categories of the corpus

categories([fileids])

This shows categories of the corpus corresponding to these files

raw()

This shows the raw content of the corpus

raw(fileids=[f1,f2,f3])

This shows the raw content of the specified files

raw(categories=[c1,c2])

This shows the raw content of the specified categories

words()

This shows the words of the whole corpus

words(fileids=[f1,f2,f3])

This shows the words of specified fileids

words(categories=[c1,c2])

This shows the words of the specified categories

sents()

This shows the sentences of the whole corpus

sents(fileids=[f1,f2,f3])

This shows the sentences of specified fileids

sents(categories=[c1,c2])

This shows the sentences of the specified categories

abspath(fileid)

This shows the location of the given file on disk

encoding(fileid)

This shows the encoding of the file (if known)

open(fileid)

This basically opens a stream for reading the given corpus file

root

This shows a path, if it is the path to the root of the locally installed corpus

readme()

This shows the contents of the README file of the corpus

We have seen the code for loading your customized corpus using nltk as well as done the frequency distribution for the available corpus and our custom corpus.

The FreqDist class is used to encode frequency distributions, which count the number of times each word occurs in a corpus.

All nltk corpora are not that noisy. A basic kind of preprocessing is required for them to generate features out of them. Using a basic corpus-loading API of nltk helps you identify the extreme level of junk data. Suppose you have a bio-chemistry corpus, then you may have a lot of equations and other complex names of chemicals that cannot be parsed accurately using the existing parsers. You can then, according to your problem statement, make a decision as to whether you should remove them in the preprocessing stage or keep them and do some customization on parsing in the part-of-speech tagging (POS) level.