45,59 €
Leverage the power of machine learning and deep learning to extract information from text data
This book is intended for Python developers who wish to start with natural language processing and want to make their applications smarter by implementing NLP in them.
This book starts off by laying the foundation for Natural Language Processing and why Python is one of the best options to build an NLP-based expert system with advantages such as Community support, availability of frameworks and so on. Later it gives you a better understanding of available free forms of corpus and different types of dataset. After this, you will know how to choose a dataset for natural language processing applications and find the right NLP techniques to process sentences in datasets and understand their structure. You will also learn how to tokenize different parts of sentences and ways to analyze them.
During the course of the book, you will explore the semantic as well as syntactic analysis of text. You will understand how to solve various ambiguities in processing human language and will come across various scenarios while performing text analysis.
You will learn the very basics of getting the environment ready for natural language processing, move on to the initial setup, and then quickly understand sentences and language parts. You will learn the power of Machine Learning and Deep Learning to extract information from text data.
By the end of the book, you will have a clear understanding of natural language processing and will have worked on multiple examples that implement NLP in the real world.
This book teaches the readers various aspects of natural language Processing using NLTK. It takes the reader from the basic to advance level in a smooth way.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 548
Veröffentlichungsjahr: 2017
BIRMINGHAM - MUMBAI
< html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
Copyright © 2017 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: July 2017
Production reference: 1280717
ISBN 978-1-78712-142-3
www.packtpub.com
Author
Jalaj Thanaki
Copy Editor
Safis Editing
Reviewers
Devesh Raj
Gayetri Thakur
Prabhanjan Tattar
Chirag Mahapatra
Project Coordinator
Manthan Patel
Commissioning Editor
Veena Pagare
Proofreader
Safis Editing
Acquisition Editor
Aman Singh
Indexer
Tejal Daruwale Soni
ContentDevelopmentEditor
Jagruti Babaria
Production Coordinator
Deepika Naik
Technical Editor
Sayli Nikalje
Data science is rapidly changing the world and the way we do business --be it retail, banking and financial services, publishing, pharmaceutical, manufacturing, and so on. Data of all forms is growing exponentially--quantitative, qualitative, structured, unstructured, speech, video, and so on. It is imperative to make use of this data to leverage all functions--avoid risk and fraud, enhance customer experience, increase revenues, and streamline operations.
Organizations are moving fast to embrace data science and investing a lot into high-end data science teams. Having spent more than 12 years in the BFSI domain, I get overwhelmed with the transition that the BFSI industry has seen in embracing analytics as a business and no longer a support function. This holds especially true for the fin-tech and digital lending world of which Jalaj and myself are a part of.
I have known Jalaj since her college days and am impressed with her exuberance and self-motivation. Her research skills, perseverance, commitment, discipline, and quickness to grasp even the most difficult concepts have made her achieve success in a short span of 4 years on her corporate journey.
Jalaj is a gifted intellectual with a strong mathematical and statistical understanding and demonstrates a continuous passion for learning the new and complex analytical and statistical techniques that are emerging in the industry. She brings experience to the data science domain and I have seen her deliver impressive projects around NLP, machine learning, basic linguistic analysis, neural networks, and deep learning. The blistering pace of the work schedule that she sets for herself, coupled with the passion she puts into her work, leads to definite and measurable results for her organization.
One of her most special qualities is her readiness to solve the most basic to the most complex problem in the interest of the business. She is an excellent team player and ensures that the organization gains the maximum benefit of her exceptional talent.
In this book, Jalaj takes us on an exciting and insightful journey through the natural language processing domain. She starts with the basic concepts and moves on to the most advanced concepts, such as how machine learning and deep learning are used in NLP.
I wish Jalaj all the best in all her future endeavors.
Sarita Arora Chief Analytics Officer, SMECorner Mumbai, India
Jalaj Thanaki is a data scientist by profession and data science researcher by practice. She likes to deal with data science related problems. She wants to make the world a better place using data science and artificial intelligence related technologies. Her research interest lies in natural language processing, machine learning, deep learning, and big data analytics. Besides being a data scientist, Jalaj is also a social activist, traveler, and nature-lover.
I would like to dedicate this book to my husband, Shetul Thanaki, for his constant support, encouragement, and creative suggestions.
I give deep thanks and gratitude to my parents, my in-laws, my family, and my friends, who have helped me at every stage of my life. I would also like to thank all the mentors that I've had over the years. I really appreciate the efforts by technical reviewers for reviewing this book. I would also like to thank my current organization, SMECorner, for its support. I am a big fan of open source communities and education communities, so I really want to thank communities such as Kaggel, Udacity, and Coursera who have helped me, in a direct or indirect manner, to understand the various concepts of data science. Without learning from these communities, there is not a chance I could be doing what I do today.
I would like to thank Packt Publishing and Aman Singh, who approached me to write this book. I really appreciate the effort put in by the entire Packt editorial team to make this book as good as possible. Special thanks to Aman Singh, Jagruti Babaria, Menka Bohra, Manthan Patel, Nidhi Joshi, Sayli Nikalje, Manisha Sinha, Safis, and Tania Dutta.
I would like to recognize the efforts of technical editing team, strategy and management team, marketing team, sales team, graphics designer team, pre-production team, post production team, layout coordinators team, and indexer team for making my authoring journey so smooth.
I feel really compelled to pass my knowledge on to those willing to learn.
Thank you God for being kind to me!
Cheers and Happy Reading!
Devesh Raj is a data scientist with 10 years of experience in developing algorithms and solving problems in various domains--healthcare, manufacturing, automotive, production, and so on, applying machine learning (supervised and unsupervised machine learning techniques) and deep learning on structured and unstructured data (computer vision and NLP).
Gayetri Thakur is a linguist working in the area of natural language processing. She has worked on co-developing NLP tools such as automatic grammar checker, named entity recognizer, and text-to-speech and speech-to-text systems. She currently works for Google India Pvt.Ltd. India.
She is pursuing a PhD in linguistics and has completed her masters in linguistics from Banaras Hindu University.
Prabhanjan Tattar has over 9 years of experience as a statistical analyst. Survival analysis and statistical inference are his main areas of research/interest, and he has published several research papers in peer-reviewed journals and authored three books on R: R Statistical Application Development by Example, Packt Publishing, A Course in Statistics with R, Wiley, and Practical Data Science Cookbook, Packt Publishing. He also maintains the R packages gpk, RSADBE, and ACSWR.
Chirag Mahapatra is a software engineer who works on applying machine learning and natural language processing to problems in trust and safety. He currently works at Trooly (acquired by Airbnb). In the past, he has worked at A9.com on the ads data platform.
For support files and downloads related to your book, please visit www.PacktPub.com.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www.packtpub.com/mapt
Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1787121429.
If you'd like to join our team of regular reviewers, you can e-mail us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
Introduction
Understanding natural language processing
Understanding basic applications
Understanding advanced applications
Advantages of togetherness - NLP and Python
Environment setup for NLTK
Tips for readers
Summary
Practical Understanding of a Corpus and Dataset
What is a corpus?
Why do we need a corpus?
Understanding corpus analysis
Exercise
Understanding types of data attributes
Categorical or qualitative data attributes
Numeric or quantitative data attributes
Exploring different file formats for corpora
Resources for accessing free corpora
Preparing a dataset for NLP applications
Selecting data
Preprocessing the dataset
Formatting
Cleaning
Sampling
Transforming data
Web scraping
Summary
Understanding the Structure of a Sentences
Understanding components of NLP
Natural language understanding
Natural language generation
Differences between NLU and NLG
Branches of NLP
Defining context-free grammar
Exercise
Morphological analysis
What is morphology?
What are morphemes?
What is a stem?
What is morphological analysis?
What is a word?
Classification of morphemes
Free morphemes
Bound morphemes
Derivational morphemes
Inflectional morphemes
What is the difference between a stem and a root?
Exercise
Lexical analysis
What is a token?
What are part of speech tags?
Process of deriving tokens
Difference between stemming and lemmatization
Applications
Syntactic analysis
What is syntactic analysis?
Semantic analysis
What is semantic analysis?
Lexical semantics
Hyponymy and hyponyms
Homonymy
Polysemy
What is the difference between polysemy and homonymy?
Application of semantic analysis
Handling ambiguity
Lexical ambiguity
Syntactic ambiguity
Approach to handle syntactic ambiguity
Semantic ambiguity
Pragmatic ambiguity
Discourse integration
Applications
Pragmatic analysis
Summary
Preprocessing
Handling corpus-raw text
Getting raw text
Lowercase conversion
Sentence tokenization
Challenges of sentence tokenization
Stemming for raw text
Challenges of stemming for raw text
Lemmatization of raw text
Challenges of lemmatization of raw text
Stop word removal
Exercise
Handling corpus-raw sentences
Word tokenization
Challenges for word tokenization
Word lemmatization
Challenges for word lemmatization
Basic preprocessing
Regular expressions
Basic level regular expression
Basic flags
Advanced level regular expression
Positive lookahead
Positive lookbehind
Negative lookahead
Negative lookbehind
Practical and customized preprocessing
Decide by yourself
Is preprocessing required?
What kind of preprocessing is required?
Understanding case studies of preprocessing
Grammar correction system
Sentiment analysis
Machine translation
Spelling correction
Approach
Summary
Feature Engineering and NLP Algorithms
Understanding feature engineering
What is feature engineering?
What is the purpose of feature engineering?
Challenges
Basic feature of NLP
Parsers and parsing
Understanding the basics of parsers
Understanding the concept of parsing
Developing a parser from scratch
Types of grammar
Context-free grammar
Probabilistic context-free grammar
Calculating the probability of a tree
Calculating the probability of a string
Grammar transformation
Developing a parser with the Cocke-Kasami-Younger Algorithm
Developing parsers step-by-step
Existing parser tools
The Stanford parser
The spaCy parser
Extracting and understanding the features
Customizing parser tools
Challenges
POS tagging and POS taggers
Understanding the concept of POS tagging and POS taggers
Developing POS taggers step-by-step
Plug and play with existing POS taggers
A Stanford POS tagger example
Using polyglot to generate POS tagging
Exercise
Using POS tags as features
Challenges
Name entity recognition
Classes of NER
Plug and play with existing NER tools
A Stanford NER example
A Spacy NER example
Extracting and understanding the features
Challenges
n-grams
Understanding n-gram using a practice example
Application
Bag of words
Understanding BOW
Understanding BOW using a practical example
Comparing n-grams and BOW
Applications
Semantic tools and resources
Basic statistical features for NLP
Basic mathematics
Basic concepts of linear algebra for NLP
Basic concepts of the probabilistic theory for NLP
Probability
Independent event and dependent event
Conditional probability
TF-IDF
Understanding TF-IDF
Understanding TF-IDF with a practical example
Using textblob
Using scikit-learn
Application
Vectorization
Encoders and decoders
One-hot encoding
Understanding a practical example for one-hot encoding
Application
Normalization
The linguistics aspect of normalization
The statistical aspect of normalization
Probabilistic models
Understanding probabilistic language modeling
Application of LM
Indexing
Application
Ranking
Advantages of features engineering
Challenges of features engineering
Summary
Advanced Feature Engineering and NLP Algorithms
Recall word embedding
Understanding the basics of word2vec
Distributional semantics
Defining word2vec
Necessity of unsupervised distribution semantic model - word2vec
Challenges
Converting the word2vec model from black box to white box
Distributional similarity based representation
Understanding the components of the word2vec model
Input of the word2vec
Output of word2vec
Construction components of the word2vec model
Architectural component
Understanding the logic of the word2vec model
Vocabulary builder
Context builder
Neural network with two layers
Structural details of a word2vec neural network
Word2vec neural network layer's details
Softmax function
Main processing algorithms
Continuous bag of words
Skip-gram
Understanding algorithmic techniques and the mathematics behind the word2vec model
Understanding the basic mathematics for the word2vec algorithm
Techniques used at the vocabulary building stage
Lossy counting
Using it at the stage of vocabulary building
Applications
Techniques used at the context building stage
Dynamic window scaling
Understanding dynamic context window techniques
Subsampling
Pruning
Algorithms used by neural networks
Structure of the neurons
Basic neuron structure
Training a simple neuron
Define error function
Understanding gradient descent in word2vec
Single neuron application
Multi-layer neural networks
Backpropagation
Mathematics behind the word2vec model
Techniques used to generate final vectors and probability prediction stage
Hierarchical softmax
Negative sampling
Some of the facts related to word2vec
Applications of word2vec
Implementation of simple examples
Famous example (king - man + woman)
Advantages of word2vec
Challenges of word2vec
How is word2vec used in real-life applications?
When should you use word2vec?
Developing something interesting
Exercise
Extension of the word2vec concept
Para2Vec
Doc2Vec
Applications of Doc2vec
GloVe
Exercise
Importance of vectorization in deep learning
Summary
Rule-Based System for NLP
Understanding of the rule-based system
What does the RB system mean?
Purpose of having the rule-based system
Why do we need the rule-based system?
Which kind of applications can use the RB approach over the other approaches?
Exercise
What kind of resources do you need if you want to develop a rule-based system?
Architecture of the RB system
General architecture of the rule-based system as an expert system
Practical architecture of the rule-based system for NLP applications
Custom architecture - the RB system for NLP applications
Exercise
Apache UIMA - the RB system for NLP applications
Understanding the RB system development life cycle
Applications
NLP applications using the rule-based system
Generalized AI applications using the rule-based system
Developing NLP applications using the RB system
Thinking process for making rules
Start with simple rules
Scraping the text data
Defining the rule for our goal
Coding our rule and generating a prototype and result
Exercise
Python for pattern-matching rules for a proofreading application
Exercise
Grammar correction
Template-based chatbot application
Flow of code
Advantages of template-based chatbot
Disadvantages of template-based chatbot
Exercise
Comparing the rule-based approach with other approaches
Advantages of the rule-based system
Disadvantages of the rule-based system
Challenges for the rule-based system
Understanding word-sense disambiguation basics
Discussing recent trends for the rule-based system
Summary
Machine Learning for NLP Problems
Understanding the basics of machine learning
Types of ML
Supervised learning
Unsupervised learning
Reinforcement learning
Development steps for NLP applications
Development step for the first iteration
Development steps for the second to nth iteration
Understanding ML algorithms and other concepts
Supervised ML
Regression
Classification
ML algorithms
Exercise
Unsupervised ML
k-means clustering
Document clustering
Advantages of k-means clustering
Disadvantages of k-means clustering
Exercise
Semi-supervised ML
Other important concepts
Bias-variance trade-off
Underfitting
Overfitting
Evaluation matrix
Exercise
Feature selection
Curse of dimensionality
Feature selection techniques
Dimensionality reduction
Hybrid approaches for NLP applications
Post-processing
Summary
Deep Learning for NLU and NLG Problems
An overview of artificial intelligence
The basics of AI
Components of AI
Automation
Intelligence
Stages of AI
Machine learning
Machine intelligence
Machine consciousness
Types of artificial intelligence
Artificial narrow intelligence
Artificial general intelligence
Artificial superintelligence
Goals and applications of AI
AI-enabled applications
Comparing NLU and NLG
Natural language understanding
Natural language generation
A brief overview of deep learning
Basics of neural networks
The first computation model of the neuron
Perceptron
Understanding mathematical concepts for ANN
Gradient descent
Calculating error or loss
Calculating gradient descent
Activation functions
Sigmoid
TanH
ReLu and its variants
Loss functions
Implementation of ANN
Single-layer NN with backpropagation
Backpropagation
Exercise
Deep learning and deep neural networks
Revisiting DL
The basic architecture of DNN
Deep learning in NLP
Difference between classical NLP and deep learning NLP techniques
Deep learning techniques and NLU
Machine translation
Deep learning techniques and NLG
Exercise
Recipe summarizer and title generation
Gradient descent-based optimization
Artificial intelligence versus human intelligence
Summary
Advanced Tools
Apache Hadoop as a storage framework
Apache Spark as a processing framework
Apache Flink as a real-time processing framework
Visualization libraries in Python
Summary
How to Improve Your NLP Skills
Beginning a new career journey with NLP
Cheat sheets
Choose your area
Agile way of working to achieve success
Useful blogs for NLP and data science
Grab public datasets
Mathematics needed for data science
Summary
Installation Guide
Installing Python, pip, and NLTK
Installing the PyCharm IDE
Installing dependencies
Framework installation guides
Drop your queries
Summary
The book title, Python Natural Language Processing, gives you a broad idea about the book. As a reader, you will get the chance to learn about all the aspects of natural language processing (NLP) from scratch. In this book, I have specified NLP concepts in a very simple language, and there are some really cool practical examples that enhance your understanding of this domain. By implementing these examples, you can improve your NLP skills. Don't you think that sounds interesting?
Now let me answer some of the most common questions I have received from my friends and colleagues about the NLP domain. These questions really inspired me to write this book. For me, it's really important that all my readers understand why I am writing this book. Let's find out!
Here, I would like answer some of the questions that I feel are critical to my readers. So, I'll begin with some of the questions, followed by the answers. The first question I usually get asked is--what is NLP? The second one is--why is Python mainly used for developing NLP applications? And last but not least, the most critical question is--what are the resources I can use for learning NLP? Now let's look at the answers!
The answer to the first question is that NLP, simply put, is the language you speak, write, read, or understand as a human; natural language is, thus, a medium of communication. Using computer science algorithms, mathematical concepts, and statistical techniques, we try to process the language so machines can also understand language as humans do; this is called NLP.
Now let's answer the second question--why do people mainly use Python to develop NLP applications? So, there are some facts that I want to share with you. The very simple and straightforward thing is that Python has a lot of libraries that make your life easy when you develop NLP applications. The second reason is that if you are coming from a C or C++ coding background, you don't need to worry about memory leakage. The Python interpreter will handle this for you, so you can just focus on the main coding part. Besides, Python is a coder-friendly language. You can do much more by writing just a few lines of codes, compared to other object-oriented languages. So all these facts drive people to use Python for developing NLP and other data science-related applications for rapid prototyping.
The last question is critical to me because I used to explain the previous answers to my friends, but after hearing all these and other fascinating things, they would come to me and say that they want to learn NLP, so what are the resources available? I used to recommend books, blogs, YouTube videos, education platforms such as Udacity and Coursera, and a lot more, but after a few days, they would ask me if there is a single resource in the form of book, blog, or anything that they could use. Unfortunately, for them, my answer was no. At that stage, I really felt that juggling all these resources would always be difficult for them, and that painful realization became my inspiration to write this book.
So in this book, I have tried to cover most of the essential parts of NLP, which will be useful for everyone. The great news is that I have provided practical examples using Python so readers can understand all the concepts theoretically as well as practically. Reading, understanding, and coding are the three main processes that I have followed in this book to make readers lives easier.
Chapter 1, Introduction, provides an introduction to NLP and the various branches involved in the NLP domain. We will see the various stages of building NLP applications and discuss NLTK installation.
Chapter 2, Practical Understanding of Corpus and Dataset, shows all the aspects of corpus analysis. We will see the different types of corpus and data attributes present in corpuses. We will touch upon different corpus formats such as CSV, JSON, XML, LibSVM, and so on. We will see a web scraping example.
Chapter 3, Understanding Structure of Sentences, helps you understand the most essential aspect of natural language, which is linguistics. We will see the concepts of lexical analysis, syntactic analysis, semantic analysis, handling ambiguities, and so on. We will use NLTK to understand all the concepts practically.
Chapter 4, Preprocessing, helps you get to know the various types of preprocessing techniques and how you can customize them. We will see the stages of preprocessing such as data preparation, data processing, and data transformation. Apart from this, you will understand the practical aspects of preprocessing.
Chapter 5, Feature Engineering and NLP Algorithms, is the core part of an NLP application. We will see how different algorithms and tools are used to generate input for machine learning algorithms, which we will be using to develop NLP applications. We will also understand the statistical concepts used in feature engineering, and we will get into the customization of tools and algorithms.
Chapter 6, Advance Feature Engineering and NLP Algorithms, gives you an understanding of the most recent concepts in NLP, which are used to deal with semantic issues. We will see word2vec, doc2vec, GloVe, and so on, as well as some practical implementations of word2vec by generating vectors from a Game of Thrones dataset.
Chapter 7, Rule-Based System for NLP, details how we can build a rule-based system and all the aspects you need to keep in mind while developing the same for NLP. We will see the rule-making process and code the rules too. We will also see how we can develop a template-based chatbot.
Chapter 8, Machine Learning for NLP Problems, provides you fresh aspects of machine learning techniques. We will see the various algorithms used to develop NLP applications. We will also implement some great NLP applications using machine learning.
Chapter 9, Deep Learning for NLU and NLG Problems, introduces you to various aspects of artificial intelligence. We will look at the basic concepts of artificial neural networks (ANNs) and how you can build your own ANN. We will understand hardcore deep learning, develop the mathematical aspect of deep learning, and see how deep learning is used for natural language understanding(NLU) and natural language generation (NLG). You can expect some cool practical examples here as well.
Appendix A, Advance Tools, gives you a brief introduction to various frameworks such as Apache Hadoop, Apache Spark, and Apache Flink.
Appendix B, How to Improve Your NLP Skills, is about suggestions from my end on how to keep your NLP skills up to date and how constant learning will help you acquire new NLP skills.
Appendix C, Installation Guide, has instructions for installations required.
Let's discuss some prerequisites for this book. Don't worry, it's not math or statistics, just basic Python coding syntax is all you need to know. Apart from that, you need Python 2.7.X or Python 3.5.X installed on your computer; I recommend using any Linux operating system as well.
The list of Python dependencies can be found at GitHub repository at https://github.com/jalajthanaki/NLPython/blob/master/pip-requirements.txt.
Now let's look at the hardware required for this book. A computer with 4 GB RAM and at least a two-core CPU is good enough to execute the code, but for machine learning and deep learning examples, you may have more RAM, perhaps 8 GB or 16 GB, and computational power that uses GPU(s).
This book is intended for Python developers who wish to start with NLP and want to make their applications smarter by implementing NLP in them.
In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.
Code words in text, database table names, folder names, filenames, file extensions, path names, dummy URLs, user input, and Twitter handles are shown as follows: "The nltk library provides some inbuilt corpuses."
A block of code is set as follows:
Any command-line input or output is written as follows:
pip install nltk
or
sudo pip install nltk
New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "This will open an additional dialog window, where you can choose specific libraries, but in our case, click on All packages, and you can choose the path where the packages reside. Wait till all the packages are downloaded."
Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files emailed directly to you. You can download the code files by following these steps:
Log in or register to our website using your e-mail address and password.
Hover the mouse pointer on the
SUPPORT
tab at the top.
Click on
Code Downloads & Errata
.
Enter the name of the book in the
Search
box.
Select the book for which you're looking to download the code files.
Choose from the drop-down menu where you purchased this book from.
Click on
Code Download
.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR / 7-Zip for Windows
Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Python-Natural-Language-Processing. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/PythonNaturalLanguageProcessing_ColorImages.pdf.
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title. To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy. Please contact us at [email protected] with a link to the suspected pirated material. We appreciate your help in protecting our authors and our ability to bring you valuable content.
If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.
In this chapter, we'll have a gentle introduction tonatural language processing (NLP) and how natural language processing concepts are used in real-life artificial intelligence applications. We will focus mainly on Python programming paradigms, which are used to develop NLP applications. Later on, the chapter has a tips section for readers. If you are really interested in finding out about the comparison of various programming paradigms for NLP and why Python is the best programming paradigm then, as a reader, you should go through the Preface of this book. As an industry professional, I have tried most of the programming paradigms for NLP. I have used Java, R, and Python for NLP applications. Trust me, guys, Python is quite easy and efficient for developing applications that use NLP concepts.
We will cover following topics in this chapter:
Understanding NLP
Understanding basic applications
Understanding advance applications
Advantages of the togetherness--NLP and Python
Environment setup for NLTK
Tips for readers
In the last few years, branches of artificial intelligence (AI) have created a lot of buzz, and those branches are data science, data analytics, predictive analysis, NLP, and so on.
As mentioned in the Preface of this book, we are focusing on Python and natural language processing. Let me ask you some questions--Do you really know what natural language is? What is natural language processing? What are the other branches involved in building expert systems using various concepts of natural language processing? How can we build intelligent systems using the concept of NLP?
Let's begin our roller coaster ride of understanding NLP.
What is natural language?
As a human being, we express our thoughts or feelings via a language
Whatever you speak, read, write, or listen to is mostly in the form of natural language, so it is commonly expressed as natural language
For example:
The content of this book is a source of natural language
Whatever you speak, listen, and write in your daily life is also in the form of natural language
Movie dialogues are also a source of natural language
Your WhatsApp conversations are also considered a form of natural language
What is natural language processing?
Now you have an understanding of what natural language is. NLP is a sub-branch of AI. Let's consider an example and understand the concept of NLP. Let's say you want to build a machine that interacts with humans in the form of natural language. This kind of an intelligent system needs computational technologies and computational linguistics to build it, and the system processes natural language like humans.
You can relate the aforementioned concept of NLP to the existing NLP products from the world's top tech companies, such as Google Assistant from Google, Siri speech assistance from Apple, and so on.
Now you will able to understand the definitions of NLP, which are as follows:
Natural language processing is the ability of computational technologies and/or computational linguistics to process human natural language
Natural language processing is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages
Natural language processing can be defined as the automatic (or semi-automatic) processing of human natural language
What are the other branches involved in building expert systems using, various concepts of NLP? Figure 1.1 is the best way to know how many other branches are involved when you are building an expert system using NLP concepts:
Figures 1.2 and 1.3 convey all the subtopics that are included in every branch given in Figure 1.1:
How can we build an intelligent system using concepts of NLP? Figure 1.4 is the basic model, which indicates how an expert system can be built for NLP applications. The development life cycle is defined in the following figure:
Let's see some of the details of the development life cycle of NLP-related problems:
If you are solving an NLP problem, you first need to understand the problem statement.
Once you understand your problem statement, think about what kind of data or corpus you need to solve the problem. So, data collection is the basic activity toward solving the problem.
After you have collected a sufficient amount of data, you can start analyzing your data. What is the quality and quantity of our corpus? According to the quality of the data and your problem statement, you need to do preprocessing.
Once you are done with preprocessing, you need to start with the process of feature engineering. Feature engineering is the most important aspect of NLP and data science related applications. We will be covering feature engineering related aspects in much more detail in
Chapter 5
,
Feature Engineering and NLP Algorithms
and
Chapter 6
,
Advance Feature Engineering and NLP Algorithms.
Having decided on and extracted features from the raw preprocessed data, you are to decide which computational technique is useful to solve your problem statement, for example, do you want to apply machine learning techniques or rule-based techniques?.
Now, depending on what techniques you are going to use, you should ready the feature files that you are going to provide as an input to your decided algorithm.
Run your logic, then generate the output.
Test and evaluate your system's output.
Tune the parameters for optimization, and continue till you get satisfactory results.
We will be covering a lot of information very quickly in this chapter, so if you see something that doesn't immediately make sense, please do not feel lost and bear with me. We will explore all the details and examples from the next chapter onward, and that will definitely help you connect the dots.
NLP is a sub-branch of AI. Concepts from NLP are used in the following expert systems:
Speech recognition system
Question answering system
Translation from one specific language to another specific language
Text summarization
Sentiment analysis
Template-based chatbots
Text classification
Topic
segmentation
We will learn about most of the NLP concepts that are used in the preceding applications in the further chapters.
Advanced applications include the following:
Human robots who understand natural language commands and interact with humans in natural language.
Building a universal machine translation system is the long-term goal in the NLP domain because you could easily build a machine translation system which can convert one specific language to another specific language, but that system may not help you to translate other languages. With the help of deep learning, we can develop a universal machine translation system and Google recently announced that they are very close to achieving this goal. We will build our own machine translation system using deep learning in
Chapter 9
,
Deep Learning for NLP and NLG Problems.
The NLP system, which generates the logical title for the given document is one of the advance applications. Also, with the help of deep learning, you can generate the title of document and perform summarization on top of that. This kind of application, you will see in
Chapter 9
,
Deep Learning for NLP and NLG Problems.
The NLP system, which generates text for specific topics or for an image is also considered an advanced NLP application.
Advanced chatbots, which generate personalized text for humans and ignore mistakes in human writing is also a goal we are trying to achieve.
There are many other NLP applications, which you can see in
Figure 1.5:
The following points illustrate why Python is one of the best options to build an NLP-based expert system:
Developing prototypes for the NLP-based expert system using Python is very easy and efficient
A large variety of open source NLP libraries are available for Python programmers
Community support is very strong
Easy to use and less complex for beginners
Rapid development: testing, and evaluation are easy and less complex
Many of the new frameworks, such as Apache Spark, Apache Flink, TensorFlow, and so on, provide API for Python
Optimization of the NLP-based system is less complex compared to other programming paradigms
I would like to suggest to all my readers that they pull the NLPython repository on GitHub. The repository URL is https://github.com/jalajthanaki/NLPython
I'm using Linux (Ubuntu) as the operating system, so if you are not familiar with Linux, it's better for you to make yourself comfortable with it, because most of the advanced frameworks, such as Apache Hadoop, Apache Spark, Apache Flink, Google TensorFlow, and so on, require a Linux operating system.
The GitHub repository contains instructions on how to install Linux, as well as basic Linux commands which we will use throughout this book. On GitHub, you can also find basic commands for GitHub if you are new to Git as well. The URL is https://github.com/jalajthanaki/NLPython/tree/master/ch1/documentation
I'm providing an installation guide for readers to set up the environment for these chapters. The URL is https://github.com/jalajthanaki/NLPython/tree/master/ch1/installation_guide
Steps for installing nltk are as follows (or you can follow the URL: https://github.com/jalajthanaki/NLPython/blob/master/ch1/installation_guide/NLTK%2BSetup.md):
Install Python 2.7.x manually, but on Linux Ubuntu 14.04, it has already been installed; otherwise, you can check your Python version using the
python -V
command.
Configure pip for installing Python libraries (
https://github.com/jalajthanaki/NLPython/blob/master/ch1/installation_guide/NLTK%2BSetup.md
).
Open the terminal, and execute the following command:
pip install nltk
or
sudo pip install nltk
Open the terminal, and execute the
python
command.
Inside the Python shell, execute the
import nltk
command.
If your nltk module is successfully installed on your system, the system will not throw any messages.
Inside the Python shell, execute the
nltk.download()
command.
This will open an additional dialog window, where you can choose specific libraries, but in our case, click on
All packages
, and you can choose the path where the packages reside. Wait till all the packages are downloaded. It may take a long time to download. After completion of the download, you can find the folder named
nltk_data
at the path specified by you earlier. Take a look at the NLTK Downloader in the following screenshot:
This repository contains an installation guide, codes, wiki page, and so on. If readers have questions and queries, they can post their queries on the Gitter group. The Gitter group URL is https://gitter.im/NLPython/Lobby?utm_source=share-link&utm_medium=link&utm_campaign=share-link
This book is a practical guide. As an industry professional, I strongly recommend all my readers replicate the code that is already available on GitHub and perform the exercises given in the book. This will improve your understanding of NLP concepts. Without performing the practicals, it will be nearly impossible for you to get all the NLP concepts thoroughly. By the way, I promise that it will be fun to implement them.
The flow of upcoming chapters is as follows:
Explanation of the concepts
Application of the concepts
Needs of the concepts
Possible ways to implement the concepts (code is on GitHub)
Challenges of the concepts
Tips to overcome challenges
Exercises
This chapter gave you an introduction to NLP. You now have a brief idea about what kind of branches are involved in NLP and the various stages for building an expert system using NLP concepts. Lastly, we set up the environment for NLTK. All the installation guidelines and codes are available on GitHub.
In the next chapter, we will see what kind of corpus is used on NLP-related applications and what all the critical points we should keep in mind are when we analyze a corpus. We will deal with the different types of file formats and datasets. Let's explore this together!
In this chapter, we'll explore the first building block of natural language processing. We are going to cover the following topics to get a practical understanding of a corpus or dataset:
What is corpus?
Why do we need corpus?
Understanding corpus analysis
Understanding types of data attributes
Exploring different file formats of datasets
Resources for access free corpus
Preparing datasets for NLP applications
Developing the web scrapping application
Natural language processing related applications are built using a huge amount of data. In layman's terms, you can say that a large collection of data is called corpus. So, more formally and technically, corpus can be defined as follows:
Corpus is a collection of written or spoken natural language material, stored on computer, and used to find out how language is used. So more precisely, a corpus is a systematic computerized collection of authentic language that is used for linguistic analysis as well as corpus analysis. If you have more than one corpus, it is called corpora.
In order to develop NLP applications, we need corpus that is written or spoken natural language material. We use this material or data as input data and try to find out the facts that can help us develop NLP applications. Sometimes, NLP applications use a single corpus as the input, and at other times, they use multiple corpora as input.
There are many reasons of using corpus for developing NLP applications, some of which are as follows:
With the help of corpus, we can perform some statistical analysis such as frequency distribution, co-occurrences of words, and so on. Don't worry, we will see some basic statistical analysis for corpus later in this chapter.
We can define and validate linguistics rules for various NLP applications. If you are building a grammar correction system, you will use the text corpus and try to find out the grammatically incorrect instances, and then you will define the grammar rules that help us to correct those instances.
We can define some specific linguistic rules that depend on the usage of the language. With the help of the rule-based system, you can define the linguistic rules and validate the rules using corpus.
In a corpus, the large collection of data can be in the following formats:
Text data, meaning written material
Speech data, meaning spoken material
Let's see what exactly text data is and how can we collect the text data. Text data is a collection of written information. There are several resources that can be used for getting written information such as news articles, books, digital libraries, email messages, web pages, blogs, and so on. Right now, we all are living in a digital world, so the amount of text information is growing rapidly. So, we can use all the given resources to get the text data and then make our own corpus. Let's take an example: if you want to build a system that summarizes news articles, you will first gather various news articles present on the web and generate a collection of new articles so that the collection is your corpus for news articles and has text data. You can use web scraping tools to get information from raw HTML pages. In this chapter, we will develop one.
Now we will see how speech data is collected. A speech data corpus generally has two things: one is an audio file, and the other one is its text transcription. Generally, we can obtain speech data from audio recordings. This audio recording may have dialogues or conversations of people. Let me give you an example: in India, when you call a bank customer care department, if you pay attention, you get to know that each and every call is recorded. This is the way you can generate speech data or speech corpus. For this book, we are concentrating just on text data and not on speech data.
A corpus is also referred to as a dataset in some cases.
There are three types of corpus:
Monolingual corpus:
This type of corpus has one language
Bilingual corpus:
This type of corpus has two languages
Multilingual corpus:
This type of corpus has more than one language
A few examples of the available corpora are given as follows:
Google Books Ngram corpus
Brown corpus
American National corpus
In any NLP application, we need data or corpus to building NLP tools and applications. A corpus is the most critical and basic building block of any NLP-related application. It provides us with quantitative data that is used to build NLP applications. We can also use some part of the data to test and challenge our ideas and intuitions about the language. Corpus plays a very big role in NLP applications. Challenges regarding creating a corpus for NLP applications are as follows:
Deciding the type of data we need in order to solve the problem statement
Availability of data
Quality of the data
Adequacy of the data in terms of amount
Now you may want to know the details of all the preceding questions; for that, I will take an example that can help you to understand all the previous points easily. Consider that you want to make an NLP tool that understands the medical state of a particular patient and can help generate a diagnosis after proper medical analysis.
Here, our aspect is more biased toward the corpus level and generalized. If you look at the preceding example as an NLP learner, you should process the problem statement as stated here:
What kind of data do I need if I want to solve the problem statement?
Clinical notes or patient history
Audio recording of the conversation between doctor and patient
Do you have this kind of corpus or data with you?
If yes, great! You are in a good position, so you can proceed to the next question.
If not, OK! No worries. You need to process one more question, which is probably a difficult but interesting one.
Is there an open source corpus available?
If yes, download it, and continue to the next question.
If not, think of how you can access the data and build the corpus. Think of web scraping tools and techniques. But you have to explore the ethical as well as legal aspects of your web scraping tool.
What is the quality level of the corpus?
Go through the corpus, and try to figure out the following things:
If you can't understand the dataset at all, then what to do?
Spend more time with your dataset.
Think like a machine, and try to think of all the things you would process if you were fed with this kind of a dataset. Don't think that you will throw an error!
Find one thing that you feel you can begin with.
Suppose your NLP tool has diagnosed a human disease, think of what you would ask the patient if you were the doctor's machine. Now you can start understanding your dataset and then think about the preprocessing part. Do not rush to the it.
If you can understand the dataset, then what to do?
Do you need each and every thing that is in the corpus to build an NLP system?
If yes, then proceed to the next level, which we will look at in
Chapter 5
,
Feature Engineering and NLP Algorithms
.
If not, then proceed to the next level, which we will look at in
Chapter 4
,
Preprocessing
.
Will the amount of data be sufficient for solving the problem statement on at least a
proof of concept
(
POC
) basis?
According to my experience, I would prefer to have at least 500 MB to 1 GB of data for a small POC.
For startups, to collect 500 MB to 1 GB data is also a challenge for the following reasons:
Startups are new in business.
Sometimes they are very innovative, and there is no ready-made dataset available.
Even if they manage to build a POC, to validate their product in real life is also challenging.
Refer to Figure 2.1 for a description of the preceding process:
In this section, we will first understand what corpus analysis is. After this, we will briefly touch upon speech analysis. We will also understand how we can analyze text corpus for different NLP applications. At the end, we will do some practical corpus analysis for text corpus. Let's begin!
Corpus analysis can be defined as a methodology for pursuing in-depth investigations of linguistic concepts as grounded in the context of authentic and communicative situations. Here, we are talking about the digitally stored language corpora, which is made available for access, retrieval, and analysis via computer.
Corpus analysis for speech data needs the analysis of phonetic understanding of each of the data instances. Apart from phonetic analysis, we also need to do conversation analysis, which gives us an idea of how social interaction happens in day-to-day life in a specific language. Suppose in real life, if you are doing conversational analysis for casual English language, maybe you find a sentence such as What's up, dude? more frequently used in conversations compared to How are you, sir (or madam)?.
Corpus analysis for text data consists in statistically probing, manipulating, and generalizing the dataset. So for a text dataset, we generally perform analysis of how many different words are present in the corpus and what the frequency of certain words in the corpus is. If the corpus contains any noise, we try to remove that noise. In almost every NLP application, we need to do some basic corpus analysis so we can understand our corpus well. nltk provides us with some inbuilt corpus. So, we perform corpus analysis using this inbuilt corpus. Before jumping to the practical part, it is very important to know what type of corpora is present in nltk.
nltk has four types of corpora. Let's look at each of them:
Isolate corpus
: This type of corpus is a collection of text or natural language. Examples of this kind of corpus are
gutenberg
,
webtext
, and so on.
Categorized corpus
: This type of corpus is a collection of texts that are grouped into different types of categories. An example of this kind of corpus is the
brown
corpus, which contains data for different categories such as news, hobbies, humor, and so on.
Overlapping corpus
: This type of corpus is a collection of texts that are categorized, but the categories overlap with each other. An example of this kind of corpus is the
reuters
corpus, which contains data that is categorized, but the defined categories overlap with each other. More explicitly, I want to define the example of the
reuters
corpus. For example, if you consider different types of coconuts as one category, you can see subcategories of coconut-oil, and you also have cotton oil. So, in the
reuters
corpus, the various data categories are overlapped.
Temporal corpus
: This type of corpus is a collection of the usages of natural language over a period of time. An example of this kind of corpus is the
inaugural address
corpus. Suppose you recode the usage of a language in any city of India in 1950. Then you repeat the same activity to see the usage of the language in that particular city in 1980 and then again in 2017. You will have recorded the various data attributes regarding how people used the language and what the changes over a period of time were.
Now enough of theory, let's jump to the practical stuff. You can access the following links to see the codes:
This chapter code is on the GitHub directory URL at https://github.com/jalajthanaki/NLPython/tree/master/ch2.
Follow the Python code on this URL: https://nbviewer.jupyter.org/github/jalajthanaki/NLPython/blob/master/ch2/2_1_Basic_corpus_analysis.html
The Python code has basic commands of how to access corpus using the nltk API. We are using the brown and gutenberg corpora. We touch upon some of the basic corpus-related APIs.
A description of the basic API attributes is given in the following table:
API Attributes
Description
fileids()
This results in files of the corpus
fileids([categories])
This results in files of the corpus corresponding to these categories
categories()
This lists categories of the corpus
categories([fileids])
This shows categories of the corpus corresponding to these files
raw()
This shows the raw content of the corpus
raw(fileids=[f1,f2,f3])
This shows the raw content of the specified files
raw(categories=[c1,c2])
This shows the raw content of the specified categories
words()
This shows the words of the whole corpus
words(fileids=[f1,f2,f3])
This shows the words of specified fileids
words(categories=[c1,c2])
This shows the words of the specified categories
sents()
This shows the sentences of the whole corpus
sents(fileids=[f1,f2,f3])
This shows the sentences of specified fileids
sents(categories=[c1,c2])
This shows the sentences of the specified categories
abspath(fileid)
This shows the location of the given file on disk
encoding(fileid)
This shows the encoding of the file (if known)
open(fileid)
This basically opens a stream for reading the given corpus file
root
This shows a path, if it is the path to the root of the locally installed corpus
readme()
This shows the contents of the README file of the corpus
We have seen the code for loading your customized corpus using nltk as well as done the frequency distribution for the available corpus and our custom corpus.
All nltk corpora are not that noisy. A basic kind of preprocessing is required for them to generate features out of them. Using a basic corpus-loading API of nltk helps you identify the extreme level of junk data. Suppose you have a bio-chemistry corpus, then you may have a lot of equations and other complex names of chemicals that cannot be parsed accurately using the existing parsers. You can then, according to your problem statement, make a decision as to whether you should remove them in the preprocessing stage or keep them and do some customization on parsing in the part-of-speech tagging (POS) level.
