E-Book
77,99 €

Natural Language Processing: Python and NLTK E-Book

Nitin Hardeniya

0,0

77,99 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

Learn to build expert NLP and machine learning projects using NLTK and other Python libraries

About This Book

Break text down into its component parts for spelling correction, feature extraction, and phrase transformation
Work through NLP concepts with simple and easy-to-follow programming recipes
Gain insights into the current and budding research topics of NLP

Who This Book Is For

If you are an NLP or machine learning enthusiast and an intermediate Python programmer who wants to quickly master NLTK for natural language processing, then this Learning Path will do you a lot of good. Students of linguistics and semantic/sentiment analysis professionals will find it invaluable.

What You Will Learn

The scope of natural language complexity and how they are processed by machines
Clean and wrangle text using tokenization and chunking to help you process data better
Tokenize text into sentences and sentences into words
Classify text and perform sentiment analysis
Implement string matching algorithms and normalization techniques
Understand and implement the concepts of information retrieval and text summarization
Find out how to implement various NLP tasks in Python

In Detail

Natural Language Processing is a field of computational linguistics and artificial intelligence that deals with human-computer interaction. It provides a seamless interaction between computers and human beings and gives computers the ability to understand human speech with the help of machine learning. The number of human-computer interaction instances are increasing so it's becoming imperative that computers comprehend all major natural languages.

The first NLTK Essentials module is an introduction on how to build systems around NLP, with a focus on how to create a customized tokenizer and parser from scratch. You will learn essential concepts of NLP, be given practical insight into open source tool and libraries available in Python, shown how to analyze social media sites, and be given tools to deal with large scale text. This module also provides a workaround using some of the amazing capabilities of Python libraries such as NLTK, scikit-learn, pandas, and NumPy.

The second Python 3 Text Processing with NLTK 3 Cookbook module teaches you the essential techniques of text and language processing with simple, straightforward examples. This includes organizing text corpora, creating your own custom corpus, text classification with a focus on sentiment analysis, and distributed text processing methods.

The third Mastering Natural Language Processing with Python module will help you become an expert and assist you in creating your own NLP projects using NLTK. You will be guided through model development with machine learning tools, shown how to create training data, and given insight into the best practices for designing and building NLP-based applications using Python.

This Learning Path combines some of the best that Packt has to offer in one complete, curated package and is designed to help you quickly learn text processing with Python and NLTK. It includes content from the following Packt products:

NTLK essentials by Nitin Hardeniya
Python 3 Text Processing with NLTK 3 Cookbook by Jacob Perkins
Mastering Natural Language Processing with Python by Deepti Chopra, Nisheeth Joshi, and Iti Mathur

Style and approach

This comprehensive course creates a smooth learning path that teaches you how to get started with Natural Language Processing using Python and NLTK. You'll learn to create effective NLP and machine learning projects using Python and NLTK.

Details

Sie lesen das E-Book in den Legimi-Apps auf:

Android

iOS

von Legimi
zertifizierten E-Readern

Seitenzahl: 788

Veröffentlichungsjahr: 2016

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Natural Language Processing: Python and NLTK

Credits

Preface

What this learning path covers

What you need for this learning path

Who this learning path is for

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

1. Module 1

1. Introduction to Natural Language Processing

Why learn NLP?

Let's start playing with Python!

Lists

Helping yourself

Regular expressions

Dictionaries

Writing functions

Diving into NLTK

Your turn

Summary

2. Text Wrangling and Cleansing

What is text wrangling?

Text cleansing

Sentence splitter

Tokenization

Stemming

Lemmatization

Stop word removal

Rare word removal

Spell correction

Your turn

Summary

3. Part of Speech Tagging

What is Part of speech tagging

Stanford tagger

Diving deep into a tagger

Sequential tagger

N-gram tagger

Regex tagger

Brill tagger

Machine learning based tagger

Named Entity Recognition (NER)

NER tagger

Your Turn

Summary

4. Parsing Structure in Text

Shallow versus deep parsing

The two approaches in parsing

Why we need parsing

Different types of parsers

A recursive descent parser

A shift-reduce parser

A chart parser

A regex parser

Dependency parsing

Chunking

Information extraction

Named-entity recognition (NER)

Relation extraction

Summary

5. NLP Applications

Building your first NLP application

Other NLP applications

Machine translation

Statistical machine translation

Information retrieval

Boolean retrieval

Vector space model

The probabilistic model

Speech recognition

Text classification

Information extraction

Question answering systems

Dialog systems

Word sense disambiguation

Topic modeling

Language detection

Optical character recognition

Summary

6. Text Classification

Machine learning

Text classification

Sampling

Naive Bayes

Decision trees

Stochastic gradient descent

Logistic regression

Support vector machines

The Random forest algorithm

Text clustering

K-means

Topic modeling in text

Installing gensim

References

Summary

7. Web Crawling

Web crawlers

Writing your first crawler

Data flow in Scrapy

The Scrapy shell

Items

The Sitemap spider

The item pipeline

External references

Summary

8. Using NLTK with Other Python Libraries

NumPy

ndarray

Indexing

Basic operations

Extracting data from an array

Complex matrix operations

Reshaping and stacking

Random numbers

SciPy

Linear algebra

eigenvalues and eigenvectors

The sparse matrix

Optimization

pandas

Reading data

Series data

Column transformation

Noisy data

matplotlib

Subplot

Adding an axis

A scatter plot

A bar plot

3D plots

External references

Summary

9. Social Media Mining in Python

Data collection

Twitter

Data extraction

See also

Tokenizing sentences into words

How to do it...

How it works...

There's more...

Separating contractions

PunktWordTokenizer

WordPunctTokenizer

See also

Tokenizing sentences using regular expressions

Getting ready

How to do it...

How it works...

There's more...

Simple whitespace tokenizer

See also

Training a sentence tokenizer

Getting ready

How to do it...

How it works...

There's more...

See also

Filtering stopwords in a tokenized sentence

Getting ready

How to do it...

How it works...

There's more...

See also

Looking up Synsets for a word in WordNet

Getting ready

How to do it...

How it works...

There's more...

Working with hypernyms

Part of speech (POS)

See also

Looking up lemmas and synonyms in WordNet

How to do it...

How it works...

There's more...

All possible synonyms

Antonyms

See also

Calculating WordNet Synset similarity

How to do it...

How it works...

There's more...

Comparing verbs

Path and Leacock Chordorow (LCH) similarity

See also

Discovering word collocations

Getting ready

How to do it...

How it works...

There's more...

Scoring functions

Scoring ngrams

See also

2. Replacing and Correcting Words

Introduction

Stemming words

How to do it...

How it works...

There's more...

The LancasterStemmer class

The RegexpStemmer class

The SnowballStemmer class

See also

Lemmatizing words with WordNet

Getting ready

How to do it...

How it works...

There's more...

Combining stemming with lemmatization

See also

Replacing words matching regular expressions

Getting ready

How to do it...

How it works...

There's more...

Replacement before tokenization

See also

Removing repeating characters

Getting ready

How to do it...

How it works...

There's more...

See also

Spelling correction with Enchant

Getting ready

How to do it...

How it works...

There's more...

The en_GB dictionary

Personal word lists

See also

Replacing synonyms

Getting ready

How to do it...

How it works...

There's more...

CSV synonym replacement

YAML synonym replacement

See also

Replacing negations with antonyms

How to do it...

How it works...

There's more...

See also

3. Creating Custom Corpora

Introduction

Setting up a custom corpus

Getting ready

How to do it...

How it works...

There's more...

Loading a YAML file

See also

Creating a wordlist corpus

Getting ready

How to do it...

How it works...

There's more...

Names wordlist corpus

English words corpus

See also

Creating a part-of-speech tagged word corpus

Getting ready

How to do it...

How it works...

There's more...

Customizing the word tokenizer

Customizing the sentence tokenizer

Customizing the paragraph block reader

Customizing the tag separator

Converting tags to a universal tagset

See also

Creating a chunked phrase corpus

Getting ready

How to do it...

How it works...

There's more...

Tree leaves

Treebank chunk corpus

CoNLL2000 corpus

See also

Creating a categorized text corpus

Getting ready

How to do it...

How it works...

There's more...

Category file

Categorized tagged corpus reader

Categorized corpora

See also

Creating a categorized chunk corpus reader

Getting ready

How to do it...

How it works...

There's more...

Categorized CoNLL chunk corpus reader

See also

Lazy corpus loading

How to do it...

How it works...

There's more...

Creating a custom corpus view

How to do it...

How it works...

There's more...

Block reader functions

Pickle corpus view

Concatenated corpus view

See also

Creating a MongoDB-backed corpus reader

Getting ready

How to do it...

How it works...

There's more...

See also

Corpus editing with file locking

Getting ready

How to do it...

How it works...

4. Part-of-speech Tagging

Introduction

Default tagging

Getting ready

How to do it...

How it works...

There's more...

Evaluating accuracy

Tagging sentences

Untagging a tagged sentence

See also

Training a unigram part-of-speech tagger

How to do it...

How it works...

There's more...

Overriding the context model

Minimum frequency cutoff

See also

Combining taggers with backoff tagging

How to do it...

How it works...

There's more...

Saving and loading a trained tagger with pickle

See also

Training and combining ngram taggers

Getting ready

How to do it...

How it works...

There's more...

Quadgram tagger

See also

Creating a model of likely word tags

How to do it...

How it works...

There's more...

See also

Tagging with regular expressions

Getting ready

How to do it...

How it works...

There's more...

See also

Affix tagging

How to do it...

How it works...

There's more...

Working with min_stem_length

See also

Training a Brill tagger

How to do it...

How it works...

There's more...

Tracing

See also

Training the TnT tagger

How to do it...

How it works...

There's more...

Controlling the beam search

Significance of capitalization

See also

Using WordNet for tagging

Getting ready

How to do it...

How it works...

See also

Tagging proper names

How to do it...

How it works...

See also

Classifier-based tagging

How to do it...

How it works...

There's more...

Detecting features with a custom feature detector

Setting a cutoff probability

Using a pre-trained classifier

See also

Training a tagger with NLTK-Trainer

How to do it...

How it works...

There's more...

Saving a pickled tagger

Training on a custom corpus

Training with universal tags

Analyzing a tagger against a tagged corpus

Analyzing a tagged corpus

See also

5. Extracting Chunks

Introduction

Chunking and chinking with regular expressions

Getting ready

How to do it...

How it works...

There's more...

Parsing different chunk types

Parsing alternative patterns

Chunk rule with context

See also

Merging and splitting chunks with regular expressions

How to do it...

How it works...

There's more...

Specifying rule descriptions

See also

Expanding and removing chunks with regular expressions

How to do it...

How it works...

There's more...

See also

Partial parsing with regular expressions

How to do it...

How it works...

There's more...

The ChunkScore metrics

Looping and tracing chunk rules

See also

Training a tagger-based chunker

How to do it...

How it works...

There's more...

Using different taggers

See also

Classification-based chunking

How to do it...

How it works...

There's more...

Using a different classifier builder

See also

Extracting named entities

How to do it...

How it works...

There's more...

Binary named entity extraction

See also

Extracting proper noun chunks

How to do it...

How it works...

There's more...

See also

Extracting location chunks

How to do it...

How it works...

There's more...

See also

Training a named entity chunker

How to do it...

How it works...

There's more...

See also

Training a chunker with NLTK-Trainer

How to do it...

How it works...

There's more...

Saving a pickled chunker

Training a named entity chunker

Training on a custom corpus

Training on parse trees

Analyzing a chunker against a chunked corpus

Analyzing a chunked corpus

See also

6. Transforming Chunks and Trees

Introduction

Filtering insignificant words from a sentence

Getting ready

How to do it...

How it works...

There's more...

See also

Correcting verb forms

Getting ready

How to do it...

How it works...

See also

Swapping verb phrases

How to do it...

How it works...

There's more...

See also

Swapping noun cardinals

How to do it...

How it works...

See also

Swapping infinitive phrases

How to do it...

How it works...

There's more...

See also

Singularizing plural nouns

How to do it...

How it works...

See also

Chaining chunk transformations

How to do it...

How it works...

There's more...

See also

Converting a chunk tree to text

How to do it...

How it works...

There's more...

See also

Flattening a deep tree

Getting ready

How to do it...

How it works...

There's more...

The cess_esp and cess_cat treebank

See also

Creating a shallow tree

How to do it...

How it works...

See also

Converting tree labels

Getting ready

How to do it...

How it works...

See also

7. Text Classification

Introduction

Bag of words feature extraction

How to do it...

How it works...

There's more...

Filtering stopwords

Including significant bigrams

See also

Training a Naive Bayes classifier

Getting ready

How to do it...

How it works...

There's more...

Classification probability

Most informative features

Training estimator

Manual training

See also

Training a decision tree classifier

How to do it...

How it works...

There's more...

Controlling uncertainty with entropy_cutoff

Controlling tree depth with depth_cutoff

Controlling decisions with support_cutoff

See also

Training a maximum entropy classifier

Getting ready

How to do it...

How it works...

There's more...

Megam algorithm

See also

Training scikit-learn classifiers

Getting ready

How to do it...

How it works...

There's more...

Comparing Naive Bayes algorithms

Training with logistic regression

Training with LinearSVC

See also

Measuring precision and recall of a classifier

How to do it...

How it works...

There's more...

F-measure

See also

Calculating high information words

How to do it...

How it works...

There's more...

The MaxentClassifier class with high information words

The DecisionTreeClassifier class with high information words

The SklearnClassifier class with high information words

See also

Combining classifiers with voting

Getting ready

How to do it...

How it works...

See also

Classifying with multiple binary classifiers

Getting ready

How to do it...

How it works...

There's more...

See also

Training a classifier with NLTK-Trainer

How to do it...

How it works...

There's more...

Saving a pickled classifier

Using different training instances

The most informative features

The Maxent and LogisticRegression classifiers

SVMs

Combining classifiers

High information words and bigrams

Cross-fold validation

Analyzing a classifier

See also

8. Distributed Processing and Handling Large Datasets

Introduction

Distributed tagging with execnet

Getting ready

How to do it...

How it works...

There's more...

Creating multiple channels

Local versus remote gateways

See also

Distributed chunking with execnet

Getting ready

How to do it...

How it works...

There's more...

Python subprocesses

See also

Parallel list processing with execnet

How to do it...

How it works...

There's more...

See also

Storing a frequency distribution in Redis

Getting ready

How to do it...

How it works...

There's more...

See also

Storing a conditional frequency distribution in Redis

Getting ready

How to do it...

How it works...

There's more...

See also

Storing an ordered dictionary in Redis

Getting ready

How to do it...

How it works...

There's more...

See also

Distributed word scoring with Redis and execnet

Getting ready

How to do it...

How it works...

There's more...

See also

9. Parsing Specific Data Types

Introduction

Parsing dates and times with dateutil

Getting ready

How to do it...

How it works...

There's more...

See also

Timezone lookup and conversion

Getting ready

How to do it...

How it works...

There's more...

Local timezone

Custom offsets

See also

Extracting URLs from HTML with lxml

Getting ready

How to do it...

How it works...

There's more...

Extracting links directly

Parsing HTML from URLs or files

Extracting links with XPaths

See also

Cleaning and stripping HTML

Getting ready

How to do it...

How it works...

There's more...

See also

Converting HTML entities with BeautifulSoup

Getting ready

How to do it...

How it works...

There's more...

Extracting URLs with BeautifulSoup

See also

Detecting and converting character encodings

Getting ready

How to do it...

How it works...

There's more...

Converting to ASCII

UnicodeDammit conversion

See also

A. Penn Treebank Part-of-speech Tags

3. Module 3

1. Working with Strings

Tokenization

Tokenization of text into sentences

Tokenization of text in other languages

Tokenization of sentences into words

Tokenization using TreebankWordTokenizer

Tokenization using regular expressions

Normalization

Eliminating punctuation

Conversion into lowercase and uppercase

Dealing with stop words

Calculate stopwords in English

Substituting and correcting tokens

Replacing words using regular expressions

Example of the replacement of a text with another text

Performing substitution before tokenization

Dealing with repeating characters

Example of deleting repeating characters

Replacing a word with its synonym

Example of substituting word a with its synonym

Applying Zipf's law to text

Similarity measures

Applying similarity measures using Ethe edit distance algorithm

Applying similarity measures using Jaccard's Coefficient

Applying similarity measures using the Smith Waterman distance

Other string similarity metrics

Summary

2. Statistical Language Modeling

Understanding word frequency

Develop MLE for a given text

Hidden Markov Model estimation

Applying smoothing on the MLE model

Add-one smoothing

Good Turing

Kneser Ney estimation

Witten Bell estimation

Develop a back-off mechanism for MLE

Applying interpolation on data to get mix and match

Evaluate a language model through perplexity

Applying metropolis hastings in modeling languages

Applying Gibbs sampling in language processing

Summary

3. Morphology – Getting Our Feet Wet

Introducing morphology

Understanding stemmer

Understanding lemmatization

Developing a stemmer for non-English language

Morphological analyzer

Morphological generator

Search engine

Summary

4. Parts-of-Speech Tagging – Identifying Words

Introducing parts-of-speech tagging

Default tagging

Creating POS-tagged corpora

Selecting a machine learning algorithm

Statistical modeling involving the n-gram approach

Developing a chunker using pos-tagged corpora

Summary

5. Parsing – Analyzing Training Data

Introducing parsing

Treebank construction

Extracting Context Free Grammar (CFG) rules from Treebank

Creating a probabilistic Context Free Grammar from CFG

CYK chart parsing algorithm

Earley chart parsing algorithm

Summary

6. Semantic Analysis – Meaning Matters

Introducing semantic analysis

Introducing NER

A NER system using Hidden Markov Model

Training NER using Machine Learning Toolkits

NER using POS tagging

Generation of the synset id from Wordnet

Disambiguating senses using Wordnet

Summary

7. Sentiment Analysis – I Am Happy

Introducing sentiment analysis

Sentiment analysis using NER

Sentiment analysis using machine learning

Evaluation of the NER system

Summary

8. Information Retrieval – Accessing Information

Introducing information retrieval

Stop word removal

Information retrieval using a vector space model

Vector space scoring and query operator interaction

Developing an IR system using latent semantic indexing

Text summarization

Question-answering system

Summary

9. Discourse Analysis – Knowing Is Believing

Introducing discourse analysis

Discourse analysis using Centering Theory

Anaphora resolution

Summary

10. Evaluation of NLP Systems – Analyzing Performance

The need for evaluation of NLP systems

Evaluation of NLP tools (POS taggers, stemmers, and morphological analyzers)

Parser evaluation using gold data

Evaluation of IR system

Metrics for error identification

Metrics based on lexical matching

Metrics based on syntactic matching

Metrics using shallow semantic matching

Summary

B. Bibliography

Index

Natural Language Processing: Python and NLTK

Learn to build expert NLP and machine learning projects using NLTK and other Python libraries

A course in three modules

BIRMINGHAM - MUMBAI

Natural Language Processing: Python and NLTK

All rights reserved. No part of this course may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this course to ensure the accuracy of the information presented. However, the information contained in this course is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this course.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this course by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Published on: November 2016

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78728-510-1

www.packtpub.com

Credits

Authors

Nitin Hardeniya

Jacob Perkins

Deepti Chopra

Nisheeth Joshi

Iti Mathur

Reviewers

Afroz Hussain

Sujit Pal

Kumar Raj

Patrick Chan

Mohit Goenka

Lihang Li

Maurice HT Ling

Jing (Dave) Tian

Arturo Argueta

Content Development Editor

Aishwarya Pandere

Production Coordinator

Arvindkumar Gupta

Preface

NLTK is one of the most popular and widely used library in the natural language processing (NLP) community. The beauty of NLTK lies in its simplicity, where most of the complex NLP tasks can be implemented using a few lines of code. Start off by learning how to tokenize text into component words. Explore and make use of the WordNet language dictionary. Learn how and when to stem or lemmatize words. Discover various ways to replace words and perform spelling correction. Create your own custom text corpora and corpus readers, including a MongoDB backed corpus. Use part-of-speech taggers to annotate words with their parts of speech. Create and transform chunked phrase trees using partial parsing. Dig into feature extraction for text classification and sentiment analysis. Learn how to do parallel and distributed text processing, and to store word distributions in Redis.

This learning path will teach you all that and more, in a hands-on learn-by-doing manner. Become an expert in using NLTK for Natural Language Processing with this useful companion.

What this learning path covers

Module 1, NLTK Essentials, talks about all the preprocessing steps required in any text mining/NLP task. In this module, we discuss tokenization, stemming, stop word removal, and other text cleansing processes in detail and how easy it is to implement these in NLTK.

Module 2, Python 3 Text Processing with NLTK 3 Cookbook, explains how to use corpus readers and create custom corpora. It also covers how to use some of the corpora that come with NLTK. It covers the chunking process, also known as partial parsing, which can identify phrases and named entities in a sentence. It also explains how to train your own custom chunker and create specific named entity recognizers.

Module 3, Mastering Natural Language Processing with Python, covers how to calculate word frequencies and perform various language modeling techniques. It also talks about the concept and application of Shallow Semantic Analysis (that is, NER) and WSD using Wordnet.

It will help you understand and apply the concepts of Information Retrieval and text summarization.

What you need for this learning path

Module 1:

We need the following software for this module:

Chapter number

Software required (with version)

Free/Proprietary

Download links to the software

Hardware specifications

OS required

1-5

Python/Anaconda

NLTK

Free

https://www.python.org/

http://continuum.io/downloads

http://www.nltk.org/

Common Unix Printing System

any

scikit-learn and gensim

Free

http://scikit-learn.org/stable/

https://radimrehurek.com/gensim/

Common Unix Printing System

any

Scrapy

Free

http://scrapy.org/

Common Unix Printing System

any

NumPy, SciPy, pandas, and matplotlib

Free

http://www.numpy.org/

http://www.scipy.org/

http://pandas.pydata.org/

http://matplotlib.org/

Common Unix Printing System

any

Twitter Python APIs and Facebook python APIs

Free

https://dev.twitter.com/overview/api/twitter-libraries

https://developers.facebook.com

Common Unix Printing System

any

Module 2:

You will need Python 3 and the listed Python packages. For this learning path, the author used Python 3.3.5. To install the packages, you can use pip (https://pypi.python.org/pypi/pip/). The following is the list of the packages in requirements format with the version number used while writing this learning path:

NLTK>=3.0a4pyenchant>=1.6.5lockfile>=0.9.1numpy>=1.8.0scipy>=0.13.0scikit-learn>=0.14.1execnet>=1.1pymongo>=2.6.3redis>=2.8.0lxml>=3.2.3beautifulsoup4>=4.3.2python-dateutil>=2.0charade>=1.0.3

You will also need NLTK-Trainer, which is available at https://github.com/japerk/nltk-trainer.

Beyond Python, there are a couple recipes that use MongoDB and Redis, both NoSQL databases. These can be downloaded at http://www.mongodb.org/ and http://redis.io/, respectively.

Module 3:

For all the chapters, Python 2.7 or 3.2+ is used. NLTK 3.0 must be installed either on 32-bit machine or 64-bit machine. Operating System required is Windows/Mac/Unix.

Who this learning path is for

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this course—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail <[email protected]>, and mention the course's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a course, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt course, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this course from your account at http://www.packtpub.com. If you purchased this course elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.Hover the mouse pointer on the SUPPORT tab at the top.Click on Code Downloads & Errata.Enter the name of the course in the Search box.Select the course for which you're looking to download the code files.Choose from the drop-down menu where you purchased this course from.Click on Code Download.

You can also download the code files by clicking on the Code Files button on the course's webpage at the Packt Publishing website. This page can be accessed by entering the course's name in the Search box. Please note that you need to be logged in to your Packt account.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for WindowsZipeg / iZip / UnRarX for Mac7-Zip / PeaZip for Linux

The code bundle for the course is also hosted on GitHub at https://github.com/PacktPublishing/Natural-Language-Processing-Python-and-NLTK. We also have other code bundles from our rich catalog of books, videos and courses available at https://github.com/PacktPublishing/. Check them out!

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this course. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your course, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this course, you can contact us at <[email protected]>, and we will do our best to address the problem.

Part 1. Module 1

NLTK Essentials

Build cool NLP and machine learning applications using NLTK and other Python libraries

Chapter 1. Introduction to Natural Language Processing

I will start with the introduction to Natural Language Processing (NLP). Language is a central part of our day to day life, and it's so interesting to work on any problem related to languages. I hope this book will give you a flavor of NLP, will motivate you to learn some amazing concepts of NLP, and will inspire you to work on some of the challenging NLP applications.

In my own language, the study of language processing is called NLP. People who are deeply involved in the study of language are linguists, while the term 'computational linguist' applies to the study of processing languages with the application of computation. Essentially, a computational linguist will be a computer scientist who has enough understanding of languages, and can apply his computational skills to model different aspects of the language. While computational linguists address the theoretical aspect of language, NLP is nothing but the application of computational linguistics.

NLP is more about the application of computers on different language nuances, and building real-world applications using NLP techniques. In a practical context, NLP is analogous to teaching a language to a child. Some of the most common tasks like understanding words, sentences, and forming grammatically and structurally correct sentences, are very natural to humans. In NLP, some of these tasks translate to tokenization, chunking, part of speech tagging, parsing, machine translation, speech recognition, and most of them are still the toughest challenges for computers. I will be talking more on the practical side of NLP, assuming that we all have some background in NLP. The expectation for the reader is to have minimal understanding of any programming language and an interest in NLP and Language.

By end of the chapter we want readers

A brief introduction to NLP and related concepts.Install Python, NLTK and other libraries.Write some very basic Python and NLTK code snippets.

If you have never heard the term NLP, then please take some time to read any of the books mentioned here—just for an initial few chapters. A quick reading of at least the Wikipedia page relating to NLP is a must:

Speech and Language Processing by Daniel Jurafsky and James H. MartinStatistical Natural Language Processing by Christopher D. Manning and Hinrich Schütze

Why learn NLP?

I start my discussion with the Gartner's new hype cycle and you can clearly see NLP on top of the cycle. Currently, NLP is one of the rarest skill sets that is required in the industry. After the advent of big data, the major challenge is that we need more people who are good with not just structured, but also with semi or unstructured data. We are generating petabytes of Weblogs, tweets, Facebook feeds, chats, e-mails, and reviews. Companies are collecting all these different kind of data for better customer targeting and meaningful insights. To process all these unstructured data source we need people who understand NLP.

We are in the age of information; we can't even imagine our life without Google. We use Siri for the most of basic stuff. We use spam filters for filtering spam emails. We need spell checker on our Word document. There are many examples of real world NLP applications around us.

Image is taken from http://www.gartner.com/newsroom/id/2819918

Let me also give you some examples of the amazing NLP applications that you can use, but are not aware that they are built on NLP:

Spell correction (MS Word/ any other editor)Search engines (Google, Bing, Yahoo, wolframalpha)Speech engines (Siri, Google Voice)Spam classifiers (All e-mail services)News feeds (Google, Yahoo!, and so on)Machine translation (Google Translate, and so on)IBM Watson

Building these applications requires a very specific skill set with a great understanding of language and tools to process the language efficiently. So it's not just hype that makes NLP one of the most niche areas, but it's the kind of application that can be created using NLP that makes it one of the most unique skills to have.

To achieve some of the above applications and other basic NLP preprocessing, there are many open source tools available. Some of them are developed by organizations to build their own NLP applications, while some of them are open-sourced. Here is a small list of available NLP tools:

GATEMalletOpen NLPUIMAStanford toolkitGenismNatural Language Tool Kit (NLTK)

Most of the tools are written in Java and have similar functionalities. Some of them are robust and have a different variety of NLP tools available. However, when it comes to the ease of use and explanation of the concepts, NLTK scores really high. NLTK is also a very good learning kit because the learning curve of Python (on which NLTK is written) is very fast. NLTK has incorporated most of the NLP tasks, it's very elegant and easy to work with. For all these reasons, NLTK has become one of the most popular libraries in the NLP community:

I am assuming all you guys know Python. If not, I urge you to learn Python. There are many basic tutorials on Python available online. There are lots of books also available that give you a quick overview of the language. We will also look into some of the features of Python, while going through the different topics. But for now, even if you only know the basics of Python, such as lists, strings, regular expressions, and basic I/O, you should be good to go.

Note

Python can be installed from the following website:

https://www.python.org/downloads/

http://continuum.io/downloads

https://store.enthought.com/downloads/

I would recommend using Anaconda or Canopy Python distributions. The reason being that these distributions come with bundled libraries, such as scipy, numpy, scikit, and so on, which are used for data analysis and other applications related to NLP and related fields. Even NLTK is part of this distribution.

Note

Please follow the instructions and install NLTK and NLTK data:

http://www.nltk.org/install.html

Let's test everything.

Open the terminal on your respective operating systems. Then run:

$ python

This should open the Python interpreter:

Python 2.6.6 (r266:84292, Oct 15 2013, 07:32:41)[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2Type "help", "copyright", "credits" or "license" for more information.>>>

I hope you got a similar looking output here. There is a chance that you will have received a different looking output, but ideally you will get the latest version of Python (I recommend that to be 2.7), the compiler GCC, and the operating system details. I know the latest version of Python will be in 3.0+ range, but as with any other open source systems, we should tries to hold back to a more stable version as opposed to jumping on to the latest version. If you have moved to Python 3.0+, please have a look at the link below to gain an understanding about what new features have been added:

https://docs.python.org/3/whatsnew/3.4.html.

UNIX based systems will have Python as a default program. Windows users can set the path to get Python working. Let's check whether we have installed NLTK correctly:

>>>import nltk>>>print "Python and NLTK installed successfully"Python and NLTK installed successfully

Hey, we are good to go!

Your turn

Please try the same exercise for different URLs.Try to reach the word cloud.

Summary

To summarize, this chapter was intended to give you a brief introduction to Natural Language Processing. The book does assume some background in NLP and programming in Python, but we have tried to give a very quick head start to Python and NLP. We have installed all the related packages that are require for us to work with NLTK. We wanted to give you, with a few simple lines of code, an idea of how to use NLTK. We were able to deliver an amazing word cloud, which is a great way of visualizing the topics in a large amount of unstructured text, and is quite popular in the industry for text analytics. I think the goal was to set up everything around NLTK, and to get Python working smoothly on your system. You should also be able to write and run basic Python programs. I wanted the reader to feel the power of the NLTK library, and build a small running example that will involve a basic application around word cloud. If the reader is able to generate the word cloud, I think we were successful.

In the next few chapters, we will learn more about Python as a language, and its features related to process natural language. We will explore some of the basic NLP preprocessing steps and learn about some of basic concepts related to NLP.

Text cleansing

Once we have parsed the text from a variety of data sources, the challenge is to make sense of this raw data. Text cleansing is loosely used for most of the cleaning to be done on text, depending on the data source, parsing performance, external noise and so on. In that sense, what we did in Chapter 1, Introduction to Natural Language Processing for cleaning the html using html_clean, can be labeled as text cleansing. In another case, where we are parsing a PDF, there could be unwanted noisy characters, non ASCII characters to be removed, and so on. Before going on to next steps we want to remove these to get a clean text to process further. With a data source like xml, we might only be interested in some specific elements of the tree, with databases we may have to manipulate splitters, and sometimes we are only interested in specific columns. In summary, any process that is done with the aim to make the text cleaner and to remove all the noise surrounding the text can be termed as text cleansing. There are no clear boundaries between the terms data munging, text cleansing, and data wrangling they can be used interchangeably in a similar context. In the next few sections, we will talk about some of the most common pre-processing steps while doing any NLP task.