Natural Language Processing: Python and NLTK - Nitin Hardeniya - E-Book

Natural Language Processing: Python and NLTK E-Book

Nitin Hardeniya

0,0
77,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Learn to build expert NLP and machine learning projects using NLTK and other Python libraries

About This Book

  • Break text down into its component parts for spelling correction, feature extraction, and phrase transformation
  • Work through NLP concepts with simple and easy-to-follow programming recipes
  • Gain insights into the current and budding research topics of NLP

Who This Book Is For

If you are an NLP or machine learning enthusiast and an intermediate Python programmer who wants to quickly master NLTK for natural language processing, then this Learning Path will do you a lot of good. Students of linguistics and semantic/sentiment analysis professionals will find it invaluable.

What You Will Learn

  • The scope of natural language complexity and how they are processed by machines
  • Clean and wrangle text using tokenization and chunking to help you process data better
  • Tokenize text into sentences and sentences into words
  • Classify text and perform sentiment analysis
  • Implement string matching algorithms and normalization techniques
  • Understand and implement the concepts of information retrieval and text summarization
  • Find out how to implement various NLP tasks in Python

In Detail

Natural Language Processing is a field of computational linguistics and artificial intelligence that deals with human-computer interaction. It provides a seamless interaction between computers and human beings and gives computers the ability to understand human speech with the help of machine learning. The number of human-computer interaction instances are increasing so it's becoming imperative that computers comprehend all major natural languages.

The first NLTK Essentials module is an introduction on how to build systems around NLP, with a focus on how to create a customized tokenizer and parser from scratch. You will learn essential concepts of NLP, be given practical insight into open source tool and libraries available in Python, shown how to analyze social media sites, and be given tools to deal with large scale text. This module also provides a workaround using some of the amazing capabilities of Python libraries such as NLTK, scikit-learn, pandas, and NumPy.

The second Python 3 Text Processing with NLTK 3 Cookbook module teaches you the essential techniques of text and language processing with simple, straightforward examples. This includes organizing text corpora, creating your own custom corpus, text classification with a focus on sentiment analysis, and distributed text processing methods.

The third Mastering Natural Language Processing with Python module will help you become an expert and assist you in creating your own NLP projects using NLTK. You will be guided through model development with machine learning tools, shown how to create training data, and given insight into the best practices for designing and building NLP-based applications using Python.

This Learning Path combines some of the best that Packt has to offer in one complete, curated package and is designed to help you quickly learn text processing with Python and NLTK. It includes content from the following Packt products:

  • NTLK essentials by Nitin Hardeniya
  • Python 3 Text Processing with NLTK 3 Cookbook by Jacob Perkins
  • Mastering Natural Language Processing with Python by Deepti Chopra, Nisheeth Joshi, and Iti Mathur

Style and approach

This comprehensive course creates a smooth learning path that teaches you how to get started with Natural Language Processing using Python and NLTK. You'll learn to create effective NLP and machine learning projects using Python and NLTK.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 788

Veröffentlichungsjahr: 2016

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Natural Language Processing: Python and NLTK
Natural Language Processing: Python and NLTK
Credits
Preface
What this learning path covers
What you need for this learning path
Who this learning path is for
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Module 1
1. Introduction to Natural Language Processing
Why learn NLP?
Let's start playing with Python!
Lists
Helping yourself
Regular expressions
Dictionaries
Writing functions
Diving into NLTK
Your turn
Summary
2. Text Wrangling and Cleansing
What is text wrangling?
Text cleansing
Sentence splitter
Tokenization
Stemming
Lemmatization
Stop word removal
Rare word removal
Spell correction
Your turn
Summary
3. Part of Speech Tagging
What is Part of speech tagging
Stanford tagger
Diving deep into a tagger
Sequential tagger
N-gram tagger
Regex tagger
Brill tagger
Machine learning based tagger
Named Entity Recognition (NER)
NER tagger
Your Turn
Summary
4. Parsing Structure in Text
Shallow versus deep parsing
The two approaches in parsing
Why we need parsing
Different types of parsers
A recursive descent parser
A shift-reduce parser
A chart parser
A regex parser
Dependency parsing
Chunking
Information extraction
Named-entity recognition (NER)
Relation extraction
Summary
5. NLP Applications
Building your first NLP application
Other NLP applications
Machine translation
Statistical machine translation
Information retrieval
Boolean retrieval
Vector space model
The probabilistic model
Speech recognition
Text classification
Information extraction
Question answering systems
Dialog systems
Word sense disambiguation
Topic modeling
Language detection
Optical character recognition
Summary
6. Text Classification
Machine learning
Text classification
Sampling
Naive Bayes
Decision trees
Stochastic gradient descent
Logistic regression
Support vector machines
The Random forest algorithm
Text clustering
K-means
Topic modeling in text
Installing gensim
References
Summary
7. Web Crawling
Web crawlers
Writing your first crawler
Data flow in Scrapy
The Scrapy shell
Items
The Sitemap spider
The item pipeline
External references
Summary
8. Using NLTK with Other Python Libraries
NumPy
ndarray
Indexing
Basic operations
Extracting data from an array
Complex matrix operations
Reshaping and stacking
Random numbers
SciPy
Linear algebra
eigenvalues and eigenvectors
The sparse matrix
Optimization
pandas
Reading data
Series data
Column transformation
Noisy data
matplotlib
Subplot
Adding an axis
A scatter plot
A bar plot
3D plots
External references
Summary
9. Social Media Mining in Python
Data collection
Twitter
Data extraction
Trending topics
Geovisualization
Influencers detection
Facebook
Influencer friends
Summary
10. Text Mining at Scale
Different ways of using Python on Hadoop
Python streaming
Hive/Pig UDF
Streaming wrappers
NLTK on Hadoop
A UDF
Python streaming
Scikit-learn on Hadoop
PySpark
Summary
2. Module 2
1. Tokenizing Text and WordNet Basics
Introduction
Tokenizing text into sentences
Getting ready
How to do it...
How it works...
There's more...
Tokenizing sentences in other languages
See also
Tokenizing sentences into words
How to do it...
How it works...
There's more...
Separating contractions
PunktWordTokenizer
WordPunctTokenizer
See also
Tokenizing sentences using regular expressions
Getting ready
How to do it...
How it works...
There's more...
Simple whitespace tokenizer
See also
Training a sentence tokenizer
Getting ready
How to do it...
How it works...
There's more...
See also
Filtering stopwords in a tokenized sentence
Getting ready
How to do it...
How it works...
There's more...
See also
Looking up Synsets for a word in WordNet
Getting ready
How to do it...
How it works...
There's more...
Working with hypernyms
Part of speech (POS)
See also
Looking up lemmas and synonyms in WordNet
How to do it...
How it works...
There's more...
All possible synonyms
Antonyms
See also
Calculating WordNet Synset similarity
How to do it...
How it works...
There's more...
Comparing verbs
Path and Leacock Chordorow (LCH) similarity
See also
Discovering word collocations
Getting ready
How to do it...
How it works...
There's more...
Scoring functions
Scoring ngrams
See also
2. Replacing and Correcting Words
Introduction
Stemming words
How to do it...
How it works...
There's more...
The LancasterStemmer class
The RegexpStemmer class
The SnowballStemmer class
See also
Lemmatizing words with WordNet
Getting ready
How to do it...
How it works...
There's more...
Combining stemming with lemmatization
See also
Replacing words matching regular expressions
Getting ready
How to do it...
How it works...
There's more...
Replacement before tokenization
See also
Removing repeating characters
Getting ready
How to do it...
How it works...
There's more...
See also
Spelling correction with Enchant
Getting ready
How to do it...
How it works...
There's more...
The en_GB dictionary
Personal word lists
See also
Replacing synonyms
Getting ready
How to do it...
How it works...
There's more...
CSV synonym replacement
YAML synonym replacement
See also
Replacing negations with antonyms
How to do it...
How it works...
There's more...
See also
3. Creating Custom Corpora
Introduction
Setting up a custom corpus
Getting ready
How to do it...
How it works...
There's more...
Loading a YAML file
See also
Creating a wordlist corpus
Getting ready
How to do it...
How it works...
There's more...
Names wordlist corpus
English words corpus
See also
Creating a part-of-speech tagged word corpus
Getting ready
How to do it...
How it works...
There's more...
Customizing the word tokenizer
Customizing the sentence tokenizer
Customizing the paragraph block reader
Customizing the tag separator
Converting tags to a universal tagset
See also
Creating a chunked phrase corpus
Getting ready
How to do it...
How it works...
There's more...
Tree leaves
Treebank chunk corpus
CoNLL2000 corpus
See also
Creating a categorized text corpus
Getting ready
How to do it...
How it works...
There's more...
Category file
Categorized tagged corpus reader
Categorized corpora
See also
Creating a categorized chunk corpus reader
Getting ready
How to do it...
How it works...
There's more...
Categorized CoNLL chunk corpus reader
See also
Lazy corpus loading
How to do it...
How it works...
There's more...
Creating a custom corpus view
How to do it...
How it works...
There's more...
Block reader functions
Pickle corpus view
Concatenated corpus view
See also
Creating a MongoDB-backed corpus reader
Getting ready
How to do it...
How it works...
There's more...
See also
Corpus editing with file locking
Getting ready
How to do it...
How it works...
4. Part-of-speech Tagging
Introduction
Default tagging
Getting ready
How to do it...
How it works...
There's more...
Evaluating accuracy
Tagging sentences
Untagging a tagged sentence
See also
Training a unigram part-of-speech tagger
How to do it...
How it works...
There's more...
Overriding the context model
Minimum frequency cutoff
See also
Combining taggers with backoff tagging
How to do it...
How it works...
There's more...
Saving and loading a trained tagger with pickle
See also
Training and combining ngram taggers
Getting ready
How to do it...
How it works...
There's more...
Quadgram tagger
See also
Creating a model of likely word tags
How to do it...
How it works...
There's more...
See also
Tagging with regular expressions
Getting ready
How to do it...
How it works...
There's more...
See also
Affix tagging
How to do it...
How it works...
There's more...
Working with min_stem_length
See also
Training a Brill tagger
How to do it...
How it works...
There's more...
Tracing
See also
Training the TnT tagger
How to do it...
How it works...
There's more...
Controlling the beam search
Significance of capitalization
See also
Using WordNet for tagging
Getting ready
How to do it...
How it works...
See also
Tagging proper names
How to do it...
How it works...
See also
Classifier-based tagging
How to do it...
How it works...
There's more...
Detecting features with a custom feature detector
Setting a cutoff probability
Using a pre-trained classifier
See also
Training a tagger with NLTK-Trainer
How to do it...
How it works...
There's more...
Saving a pickled tagger
Training on a custom corpus
Training with universal tags
Analyzing a tagger against a tagged corpus
Analyzing a tagged corpus
See also
5. Extracting Chunks
Introduction
Chunking and chinking with regular expressions
Getting ready
How to do it...
How it works...
There's more...
Parsing different chunk types
Parsing alternative patterns
Chunk rule with context
See also
Merging and splitting chunks with regular expressions
How to do it...
How it works...
There's more...
Specifying rule descriptions
See also
Expanding and removing chunks with regular expressions
How to do it...
How it works...
There's more...
See also
Partial parsing with regular expressions
How to do it...
How it works...
There's more...
The ChunkScore metrics
Looping and tracing chunk rules
See also
Training a tagger-based chunker
How to do it...
How it works...
There's more...
Using different taggers
See also
Classification-based chunking
How to do it...
How it works...
There's more...
Using a different classifier builder
See also
Extracting named entities
How to do it...
How it works...
There's more...
Binary named entity extraction
See also
Extracting proper noun chunks
How to do it...
How it works...
There's more...
See also
Extracting location chunks
How to do it...
How it works...
There's more...
See also
Training a named entity chunker
How to do it...
How it works...
There's more...
See also
Training a chunker with NLTK-Trainer
How to do it...
How it works...
There's more...
Saving a pickled chunker
Training a named entity chunker
Training on a custom corpus
Training on parse trees
Analyzing a chunker against a chunked corpus
Analyzing a chunked corpus
See also
6. Transforming Chunks and Trees
Introduction
Filtering insignificant words from a sentence
Getting ready
How to do it...
How it works...
There's more...
See also
Correcting verb forms
Getting ready
How to do it...
How it works...
See also
Swapping verb phrases
How to do it...
How it works...
There's more...
See also
Swapping noun cardinals
How to do it...
How it works...
See also
Swapping infinitive phrases
How to do it...
How it works...
There's more...
See also
Singularizing plural nouns
How to do it...
How it works...
See also
Chaining chunk transformations
How to do it...
How it works...
There's more...
See also
Converting a chunk tree to text
How to do it...
How it works...
There's more...
See also
Flattening a deep tree
Getting ready
How to do it...
How it works...
There's more...
The cess_esp and cess_cat treebank
See also
Creating a shallow tree
How to do it...
How it works...
See also
Converting tree labels
Getting ready
How to do it...
How it works...
See also
7. Text Classification
Introduction
Bag of words feature extraction
How to do it...
How it works...
There's more...
Filtering stopwords
Including significant bigrams
See also
Training a Naive Bayes classifier
Getting ready
How to do it...
How it works...
There's more...
Classification probability
Most informative features
Training estimator
Manual training
See also
Training a decision tree classifier
How to do it...
How it works...
There's more...
Controlling uncertainty with entropy_cutoff
Controlling tree depth with depth_cutoff
Controlling decisions with support_cutoff
See also
Training a maximum entropy classifier
Getting ready
How to do it...
How it works...
There's more...
Megam algorithm
See also
Training scikit-learn classifiers
Getting ready
How to do it...
How it works...
There's more...
Comparing Naive Bayes algorithms
Training with logistic regression
Training with LinearSVC
See also
Measuring precision and recall of a classifier
How to do it...
How it works...
There's more...
F-measure
See also
Calculating high information words
How to do it...
How it works...
There's more...
The MaxentClassifier class with high information words
The DecisionTreeClassifier class with high information words
The SklearnClassifier class with high information words
See also
Combining classifiers with voting
Getting ready
How to do it...
How it works...
See also
Classifying with multiple binary classifiers
Getting ready
How to do it...
How it works...
There's more...
See also
Training a classifier with NLTK-Trainer
How to do it...
How it works...
There's more...
Saving a pickled classifier
Using different training instances
The most informative features
The Maxent and LogisticRegression classifiers
SVMs
Combining classifiers
High information words and bigrams
Cross-fold validation
Analyzing a classifier
See also
8. Distributed Processing and Handling Large Datasets
Introduction
Distributed tagging with execnet
Getting ready
How to do it...
How it works...
There's more...
Creating multiple channels
Local versus remote gateways
See also
Distributed chunking with execnet
Getting ready
How to do it...
How it works...
There's more...
Python subprocesses
See also
Parallel list processing with execnet
How to do it...
How it works...
There's more...
See also
Storing a frequency distribution in Redis
Getting ready
How to do it...
How it works...
There's more...
See also
Storing a conditional frequency distribution in Redis
Getting ready
How to do it...
How it works...
There's more...
See also
Storing an ordered dictionary in Redis
Getting ready
How to do it...
How it works...
There's more...
See also
Distributed word scoring with Redis and execnet
Getting ready
How to do it...
How it works...
There's more...
See also
9. Parsing Specific Data Types
Introduction
Parsing dates and times with dateutil
Getting ready
How to do it...
How it works...
There's more...
See also
Timezone lookup and conversion
Getting ready
How to do it...
How it works...
There's more...
Local timezone
Custom offsets
See also
Extracting URLs from HTML with lxml
Getting ready
How to do it...
How it works...
There's more...
Extracting links directly
Parsing HTML from URLs or files
Extracting links with XPaths
See also
Cleaning and stripping HTML
Getting ready
How to do it...
How it works...
There's more...
See also
Converting HTML entities with BeautifulSoup
Getting ready
How to do it...
How it works...
There's more...
Extracting URLs with BeautifulSoup
See also
Detecting and converting character encodings
Getting ready
How to do it...
How it works...
There's more...
Converting to ASCII
UnicodeDammit conversion
See also
A. Penn Treebank Part-of-speech Tags
3. Module 3
1. Working with Strings
Tokenization
Tokenization of text into sentences
Tokenization of text in other languages
Tokenization of sentences into words
Tokenization using TreebankWordTokenizer
Tokenization using regular expressions
Normalization
Eliminating punctuation
Conversion into lowercase and uppercase
Dealing with stop words
Calculate stopwords in English
Substituting and correcting tokens
Replacing words using regular expressions
Example of the replacement of a text with another text
Performing substitution before tokenization
Dealing with repeating characters
Example of deleting repeating characters
Replacing a word with its synonym
Example of substituting word a with its synonym
Applying Zipf's law to text
Similarity measures
Applying similarity measures using Ethe edit distance algorithm
Applying similarity measures using Jaccard's Coefficient
Applying similarity measures using the Smith Waterman distance
Other string similarity metrics
Summary
2. Statistical Language Modeling
Understanding word frequency
Develop MLE for a given text
Hidden Markov Model estimation
Applying smoothing on the MLE model
Add-one smoothing
Good Turing
Kneser Ney estimation
Witten Bell estimation
Develop a back-off mechanism for MLE
Applying interpolation on data to get mix and match
Evaluate a language model through perplexity
Applying metropolis hastings in modeling languages
Applying Gibbs sampling in language processing
Summary
3. Morphology – Getting Our Feet Wet
Introducing morphology
Understanding stemmer
Understanding lemmatization
Developing a stemmer for non-English language
Morphological analyzer
Morphological generator
Search engine
Summary
4. Parts-of-Speech Tagging – Identifying Words
Introducing parts-of-speech tagging
Default tagging
Creating POS-tagged corpora
Selecting a machine learning algorithm
Statistical modeling involving the n-gram approach
Developing a chunker using pos-tagged corpora
Summary
5. Parsing – Analyzing Training Data
Introducing parsing
Treebank construction
Extracting Context Free Grammar (CFG) rules from Treebank
Creating a probabilistic Context Free Grammar from CFG
CYK chart parsing algorithm
Earley chart parsing algorithm
Summary
6. Semantic Analysis – Meaning Matters
Introducing semantic analysis
Introducing NER
A NER system using Hidden Markov Model
Training NER using Machine Learning Toolkits
NER using POS tagging
Generation of the synset id from Wordnet
Disambiguating senses using Wordnet
Summary
7. Sentiment Analysis – I Am Happy
Introducing sentiment analysis
Sentiment analysis using NER
Sentiment analysis using machine learning
Evaluation of the NER system
Summary
8. Information Retrieval – Accessing Information
Introducing information retrieval
Stop word removal
Information retrieval using a vector space model
Vector space scoring and query operator interaction
Developing an IR system using latent semantic indexing
Text summarization
Question-answering system
Summary
9. Discourse Analysis – Knowing Is Believing
Introducing discourse analysis
Discourse analysis using Centering Theory
Anaphora resolution
Summary
10. Evaluation of NLP Systems – Analyzing Performance
The need for evaluation of NLP systems
Evaluation of NLP tools (POS taggers, stemmers, and morphological analyzers)
Parser evaluation using gold data
Evaluation of IR system
Metrics for error identification
Metrics based on lexical matching
Metrics based on syntactic matching
Metrics using shallow semantic matching
Summary
B. Bibliography
Index

Natural Language Processing: Python and NLTK

Natural Language Processing: Python and NLTK

Learn to build expert NLP and machine learning projects using NLTK and other Python libraries

A course in three modules

BIRMINGHAM - MUMBAI

Natural Language Processing: Python and NLTK

Copyright © 2016 Packt Publishing

All rights reserved. No part of this course may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this course to ensure the accuracy of the information presented. However, the information contained in this course is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this course.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this course by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Published on: November 2016

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78728-510-1

www.packtpub.com

Credits

Authors

Nitin Hardeniya

Jacob Perkins

Deepti Chopra

Nisheeth Joshi

Iti Mathur

Reviewers

Afroz Hussain

Sujit Pal

Kumar Raj

Patrick Chan

Mohit Goenka

Lihang Li

Maurice HT Ling

Jing (Dave) Tian

Arturo Argueta

Content Development Editor

Aishwarya Pandere

Production Coordinator

Arvindkumar Gupta

Preface

NLTK is one of the most popular and widely used library in the natural language processing (NLP) community. The beauty of NLTK lies in its simplicity, where most of the complex NLP tasks can be implemented using a few lines of code. Start off by learning how to tokenize text into component words. Explore and make use of the WordNet language dictionary. Learn how and when to stem or lemmatize words. Discover various ways to replace words and perform spelling correction. Create your own custom text corpora and corpus readers, including a MongoDB backed corpus. Use part-of-speech taggers to annotate words with their parts of speech. Create and transform chunked phrase trees using partial parsing. Dig into feature extraction for text classification and sentiment analysis. Learn how to do parallel and distributed text processing, and to store word distributions in Redis.

This learning path will teach you all that and more, in a hands-on learn-by-doing manner. Become an expert in using NLTK for Natural Language Processing with this useful companion.

What this learning path covers

Module 1, NLTK Essentials, talks about all the preprocessing steps required in any text mining/NLP task. In this module, we discuss tokenization, stemming, stop word removal, and other text cleansing processes in detail and how easy it is to implement these in NLTK.

Module 2, Python 3 Text Processing with NLTK 3 Cookbook, explains how to use corpus readers and create custom corpora. It also covers how to use some of the corpora that come with NLTK. It covers the chunking process, also known as partial parsing, which can identify phrases and named entities in a sentence. It also explains how to train your own custom chunker and create specific named entity recognizers.

Module 3, Mastering Natural Language Processing with Python, covers how to calculate word frequencies and perform various language modeling techniques. It also talks about the concept and application of Shallow Semantic Analysis (that is, NER) and WSD using Wordnet.

It will help you understand and apply the concepts of Information Retrieval and text summarization.

What you need for this learning path

Module 1:

We need the following software for this module:

Chapter number

Software required (with version)

Free/Proprietary

Download links to the software

Hardware specifications

OS required

1-5

Python/Anaconda

NLTK

Free

https://www.python.org/

http://continuum.io/downloads

http://www.nltk.org/

Common Unix Printing System

any

6

scikit-learn and gensim

Free

http://scikit-learn.org/stable/

https://radimrehurek.com/gensim/

Common Unix Printing System

any

7

Scrapy

Free

http://scrapy.org/

Common Unix Printing System

any

8

NumPy, SciPy, pandas, and matplotlib

Free

http://www.numpy.org/

http://www.scipy.org/

http://pandas.pydata.org/

http://matplotlib.org/

Common Unix Printing System

any

9

Twitter Python APIs and Facebook python APIs

Free

https://dev.twitter.com/overview/api/twitter-libraries

https://developers.facebook.com

Common Unix Printing System

any

Module 2:

You will need Python 3 and the listed Python packages. For this learning path, the author used Python 3.3.5. To install the packages, you can use pip (https://pypi.python.org/pypi/pip/). The following is the list of the packages in requirements format with the version number used while writing this learning path:

NLTK>=3.0a4pyenchant>=1.6.5lockfile>=0.9.1numpy>=1.8.0scipy>=0.13.0scikit-learn>=0.14.1execnet>=1.1pymongo>=2.6.3redis>=2.8.0lxml>=3.2.3beautifulsoup4>=4.3.2python-dateutil>=2.0charade>=1.0.3

You will also need NLTK-Trainer, which is available at https://github.com/japerk/nltk-trainer.

Beyond Python, there are a couple recipes that use MongoDB and Redis, both NoSQL databases. These can be downloaded at http://www.mongodb.org/ and http://redis.io/, respectively.

Module 3:

For all the chapters, Python 2.7 or 3.2+ is used. NLTK 3.0 must be installed either on 32-bit machine or 64-bit machine. Operating System required is Windows/Mac/Unix.

Who this learning path is for

If you are an NLP or machine learning enthusiast and an intermediate Python programmer who wants to quickly master NLTK for natural language processing, then this Learning Path will do you a lot of good. Students of linguistics and semantic/sentiment analysis professionals will find it invaluable.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this course—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail <[email protected]>, and mention the course's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a course, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt course, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this course from your account at http://www.packtpub.com. If you purchased this course elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.Hover the mouse pointer on the SUPPORT tab at the top.Click on Code Downloads & Errata.Enter the name of the course in the Search box.Select the course for which you're looking to download the code files.Choose from the drop-down menu where you purchased this course from.Click on Code Download.

You can also download the code files by clicking on the Code Files button on the course's webpage at the Packt Publishing website. This page can be accessed by entering the course's name in the Search box. Please note that you need to be logged in to your Packt account.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for WindowsZipeg / iZip / UnRarX for Mac7-Zip / PeaZip for Linux

The code bundle for the course is also hosted on GitHub at https://github.com/PacktPublishing/Natural-Language-Processing-Python-and-NLTK. We also have other code bundles from our rich catalog of books, videos and courses available at https://github.com/PacktPublishing/. Check them out!

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this course. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your course, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this course, you can contact us at <[email protected]>, and we will do our best to address the problem.

Part 1. Module 1

NLTK Essentials

Build cool NLP and machine learning applications using NLTK and other Python libraries

Chapter 1. Introduction to Natural Language Processing

I will start with the introduction to Natural Language Processing (NLP). Language is a central part of our day to day life, and it's so interesting to work on any problem related to languages. I hope this book will give you a flavor of NLP, will motivate you to learn some amazing concepts of NLP, and will inspire you to work on some of the challenging NLP applications.

In my own language, the study of language processing is called NLP. People who are deeply involved in the study of language are linguists, while the term 'computational linguist' applies to the study of processing languages with the application of computation. Essentially, a computational linguist will be a computer scientist who has enough understanding of languages, and can apply his computational skills to model different aspects of the language. While computational linguists address the theoretical aspect of language, NLP is nothing but the application of computational linguistics.

NLP is more about the application of computers on different language nuances, and building real-world applications using NLP techniques. In a practical context, NLP is analogous to teaching a language to a child. Some of the most common tasks like understanding words, sentences, and forming grammatically and structurally correct sentences, are very natural to humans. In NLP, some of these tasks translate to tokenization, chunking, part of speech tagging, parsing, machine translation, speech recognition, and most of them are still the toughest challenges for computers. I will be talking more on the practical side of NLP, assuming that we all have some background in NLP. The expectation for the reader is to have minimal understanding of any programming language and an interest in NLP and Language.

By end of the chapter we want readers

A brief introduction to NLP and related concepts.Install Python, NLTK and other libraries.Write some very basic Python and NLTK code snippets.

If you have never heard the term NLP, then please take some time to read any of the books mentioned here—just for an initial few chapters. A quick reading of at least the Wikipedia page relating to NLP is a must:

Speech and Language Processing by Daniel Jurafsky and James H. MartinStatistical Natural Language Processing by Christopher D. Manning and Hinrich Schütze

Why learn NLP?

I start my discussion with the Gartner's new hype cycle and you can clearly see NLP on top of the cycle. Currently, NLP is one of the rarest skill sets that is required in the industry. After the advent of big data, the major challenge is that we need more people who are good with not just structured, but also with semi or unstructured data. We are generating petabytes of Weblogs, tweets, Facebook feeds, chats, e-mails, and reviews. Companies are collecting all these different kind of data for better customer targeting and meaningful insights. To process all these unstructured data source we need people who understand NLP.

We are in the age of information; we can't even imagine our life without Google. We use Siri for the most of basic stuff. We use spam filters for filtering spam emails. We need spell checker on our Word document. There are many examples of real world NLP applications around us.

Image is taken from http://www.gartner.com/newsroom/id/2819918

Let me also give you some examples of the amazing NLP applications that you can use, but are not aware that they are built on NLP:

Spell correction (MS Word/ any other editor)Search engines (Google, Bing, Yahoo, wolframalpha)Speech engines (Siri, Google Voice)Spam classifiers (All e-mail services)News feeds (Google, Yahoo!, and so on)Machine translation (Google Translate, and so on)IBM Watson

Building these applications requires a very specific skill set with a great understanding of language and tools to process the language efficiently. So it's not just hype that makes NLP one of the most niche areas, but it's the kind of application that can be created using NLP that makes it one of the most unique skills to have.

To achieve some of the above applications and other basic NLP preprocessing, there are many open source tools available. Some of them are developed by organizations to build their own NLP applications, while some of them are open-sourced. Here is a small list of available NLP tools:

GATEMalletOpen NLPUIMAStanford toolkitGenismNatural Language Tool Kit (NLTK)

Most of the tools are written in Java and have similar functionalities. Some of them are robust and have a different variety of NLP tools available. However, when it comes to the ease of use and explanation of the concepts, NLTK scores really high. NLTK is also a very good learning kit because the learning curve of Python (on which NLTK is written) is very fast. NLTK has incorporated most of the NLP tasks, it's very elegant and easy to work with. For all these reasons, NLTK has become one of the most popular libraries in the NLP community:

I am assuming all you guys know Python. If not, I urge you to learn Python. There are many basic tutorials on Python available online. There are lots of books also available that give you a quick overview of the language. We will also look into some of the features of Python, while going through the different topics. But for now, even if you only know the basics of Python, such as lists, strings, regular expressions, and basic I/O, you should be good to go.

Note

Python can be installed from the following website:

https://www.python.org/downloads/

http://continuum.io/downloads

https://store.enthought.com/downloads/

I would recommend using Anaconda or Canopy Python distributions. The reason being that these distributions come with bundled libraries, such as scipy, numpy, scikit, and so on, which are used for data analysis and other applications related to NLP and related fields. Even NLTK is part of this distribution.

Note

Please follow the instructions and install NLTK and NLTK data:

http://www.nltk.org/install.html

Let's test everything.

Open the terminal on your respective operating systems. Then run:

$ python

This should open the Python interpreter:

Python 2.6.6 (r266:84292, Oct 15 2013, 07:32:41)[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2Type "help", "copyright", "credits" or "license" for more information.>>>

I hope you got a similar looking output here. There is a chance that you will have received a different looking output, but ideally you will get the latest version of Python (I recommend that to be 2.7), the compiler GCC, and the operating system details. I know the latest version of Python will be in 3.0+ range, but as with any other open source systems, we should tries to hold back to a more stable version as opposed to jumping on to the latest version. If you have moved to Python 3.0+, please have a look at the link below to gain an understanding about what new features have been added:

https://docs.python.org/3/whatsnew/3.4.html.

UNIX based systems will have Python as a default program. Windows users can set the path to get Python working. Let's check whether we have installed NLTK correctly:

>>>import nltk>>>print "Python and NLTK installed successfully"Python and NLTK installed successfully

Hey, we are good to go!

Your turn

Please try the same exercise for different URLs.Try to reach the word cloud.

Summary

To summarize, this chapter was intended to give you a brief introduction to Natural Language Processing. The book does assume some background in NLP and programming in Python, but we have tried to give a very quick head start to Python and NLP. We have installed all the related packages that are require for us to work with NLTK. We wanted to give you, with a few simple lines of code, an idea of how to use NLTK. We were able to deliver an amazing word cloud, which is a great way of visualizing the topics in a large amount of unstructured text, and is quite popular in the industry for text analytics. I think the goal was to set up everything around NLTK, and to get Python working smoothly on your system. You should also be able to write and run basic Python programs. I wanted the reader to feel the power of the NLTK library, and build a small running example that will involve a basic application around word cloud. If the reader is able to generate the word cloud, I think we were successful.

In the next few chapters, we will learn more about Python as a language, and its features related to process natural language. We will explore some of the basic NLP preprocessing steps and learn about some of basic concepts related to NLP.

Text cleansing

Once we have parsed the text from a variety of data sources, the challenge is to make sense of this raw data. Text cleansing is loosely used for most of the cleaning to be done on text, depending on the data source, parsing performance, external noise and so on. In that sense, what we did in Chapter 1, Introduction to Natural Language Processing for cleaning the html using html_clean, can be labeled as text cleansing. In another case, where we are parsing a PDF, there could be unwanted noisy characters, non ASCII characters to be removed, and so on. Before going on to next steps we want to remove these to get a clean text to process further. With a data source like xml, we might only be interested in some specific elements of the tree, with databases we may have to manipulate splitters, and sometimes we are only interested in specific columns. In summary, any process that is done with the aim to make the text cleaner and to remove all the noise surrounding the text can be termed as text cleansing. There are no clear boundaries between the terms data munging, text cleansing, and data wrangling they can be used interchangeably in a similar context. In the next few sections, we will talk about some of the most common pre-processing steps while doing any NLP task.