Natural Language Processing and Computational Linguistics - Bhargav Srinivasa-Desikan - E-Book

Natural Language Processing and Computational Linguistics E-Book

Bhargav Srinivasa-Desikan

0,0
31,19 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Modern text analysis is now very accessible using Python and open source tools, so discover how you can now perform modern text analysis in this era of textual data.

This book shows you how to use natural language processing, and computational linguistics algorithms, to make inferences and gain insights about data you have. These algorithms are based on statistical machine learning and artificial intelligence techniques. The tools to work with these algorithms are available to you right now - with Python, and tools like Gensim and spaCy.

You'll start by learning about data cleaning, and then how to perform computational linguistics from first concepts. You're then ready to explore the more sophisticated areas of statistical NLP and deep learning using Python, with realistic language and text samples. You'll learn to tag, parse, and model text using the best tools. You'll gain hands-on knowledge of the best frameworks to use, and you'll know when to choose a tool like Gensim for topic models, and when to work with Keras for deep learning.

This book balances theory and practical hands-on examples, so you can learn about and conduct your own natural language processing projects and computational linguistics. You'll discover the rich ecosystem of Python tools you have available to conduct NLP - and enter the interesting world of modern text analysis.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 400

Veröffentlichungsjahr: 2018

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Natural Language Processing and Computational Linguistics
A practical guide to text analysis with Python, Gensim, spaCy, and Keras
Bhargav Srinivasa-Desikan
BIRMINGHAM - MUMBAI

Natural Language Processing and Computational Linguistics

Copyright © 2018 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Acquisition Editors: Frank Pohlmann, Suresh JainProject Editor: Suzanne CoutinhoContent Development Editor: Alex SorentinhoTechnical Editor: Gaurav GavasProofreader: Tom JacobIndexer: Tejal Daruwale SoniGraphics: Tom ScariaProduction Coordinator: Sandip Tadge

First published: June 2018

Production reference: 2270718

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN 978-1-78883-853-5

www.packtpub.com

mapt.io

Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Mapt is fully searchable

Copy and paste, print, and bookmark content

PacktPub.com

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

Contributors

About the author

Bhargav Srinivasa-Desikan is a research engineer working for INRIA in Lille, France. He is part of the MODAL (Models of Data Analysis and Learning) team, and he works on metric learning, predictor aggregation, and data visualization. He is a regular contributor to the Python open source community, and he completed Google Summer of Code in 2016 with Gensim where he implemented Dynamic Topic Models. Bhargav is a regular speaker at PyCons and PyDatas across Europe and Asia, and conducts tutorials on text analysis using Python. He is the maintainer of the Python machine learning package pycobra, and has published in the Journal of Machine Learning Research.

I would like to thank the Python community for all their help, and for building such incredible packages for text analysis. I would also like to thank Lev Konstantinovskiy for introducing me to the world of open source scientific computing and Dr. Benjamin Guedj for always helping me with writing technical articles and material. I would also like to thank my parents, brother and friends for their constant support throughout the process of writing the book.

About the reviewers

Brian Sacash is a data scientist and Python developer in the Washington, DC area. He helps various organizations discover the best ways to extract value from data. His interests are in the areas of Natural Language Processing, Machine Learning, Big Data, and Statistical Methods. Brian holds a Master of Science in Quantitative Analysis from the University of Cincinnati and a Bachelor of Science in Physics from the Ohio Northern University.

Reddy Anil Kumar is a data scientist working at Imaginea technologies Inc. He has over 4 years of experience in the field of data science which includes 2 years of freelance experience. He is experienced in implementing Artificial Intelligence solutions in various domains using Machine Learning / Deep Learning, Natural Language Processing, and Big Data Analytics. In his free time, he loves to participate in data science competitions and he is also a Kaggle expert.

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Table of Contents

Title Page

Copyright and Credits

Natural Language Processing and Computational Linguistics

Packt Upsell

Why subscribe?

PacktPub.com

Contributors

About the author

About the reviewers

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

What is Text Analysis?

What is text analysis?

Where's the data at?

Garbage in, garbage out

Why should you do text analysis?

Summary

References

Python Tips for Text Analysis

Why Python?

Text manipulation in Python

Summary

References

spaCy's Language Models

spaCy

Installation

Troubleshooting

Language models

Installing language models

Installation – how and why?

Basic preprocessing with language models

Tokenizing text

Part-of-speech (POS) – tagging

Named entity recognition

Rule-based matching

Preprocessing

Summary

References

Gensim – Vectorizing Text and Transformations and n-grams

Introducing Gensim

Vectors and why we need them

Bag-of-words

TF-IDF

Other representations

Vector transformations in Gensim

n-grams and some more preprocessing

Summary

References

POS-Tagging and Its Applications

What is POS-tagging?

POS-tagging in Python

POS-tagging with spaCy

Training our own POS-taggers

POS-tagging code examples

Summary

References

NER-Tagging and Its Applications

What is NER-tagging?

NER-tagging in Python

NER-tagging with spaCy

Training our own NER-taggers

NER-tagging examples and visualization

Summary

References

Dependency Parsing

Dependency parsing

Dependency parsing in Python

Dependency parsing with spaCy

Training our dependency parsers

Summary

References

Topic Models

What are topic models?

Topic models in Gensim

Latent Dirichlet allocation

Latent semantic indexing

Hierarchical Dirichlet process

Dynamic topic models

Topic models in scikit-learn

Summary

References

Advanced Topic Modeling

Advanced training tips

Exploring documents

Topic coherence and evaluating topic models

Visualizing topic models

Summary

References

Clustering and Classifying Text

Clustering text

Starting clustering

K-means

Hierarchical clustering

Classifying text

Summary

References

Similarity Queries and Summarization

Similarity metrics

Similarity queries

Summarizing text

Summary

References

Word2Vec, Doc2Vec, and Gensim

Word2Vec

Using Word2Vec with Gensim

Doc2Vec

Other word embeddings

GloVe

FastText

WordRank

Varembed

Poincare

Summary

References

Deep Learning for Text

Deep learning

Deep learning for text (and more)

Generating text

Summary

References

Keras and spaCy for Deep Learning

Keras and spaCy

Classification with Keras

Classification with spaCy

Summary

References

Sentiment Analysis and ChatBots

Sentiment analysis

Reddit for mining data

Twitter for mining data

ChatBots

Summary

References

Other Books You May Enjoy

Leave a review - let other readers know what you think

Preface

Modern text analysis is now very accessible using Python and open source tools, so discover how you can now perform modern text analysis in this era of textual data.

This book shows you how to use natural language processing, and computational linguistics algorithms, to make inferences and gain insights about data you have. These algorithms are based on statistical machine learning and artificial intelligence techniques. The tools to work with these algorithms are available to you right now - with Python, and tools like Gensim and spaCy.

You'll start by learning about data cleaning, and then how to perform computational linguistics from first concepts. You're then ready to explore the more sophisticated areas of statistical NLP and deep learning using Python, using realistic language and text samples. You'll learn to tag, parse, and model text using the best tools. You'll gain hands-on knowledge of the best frameworks to use, and you'll know when to choose a tool like Gensim for topic models, and when to work with Keras for deep learning.

This book balances theory and practical hands-on examples, so you can learn about and conduct your own natural language processing projects and computational linguistics. You'll discover the rich ecosystem of Python tools you have available to conduct NLP - and enter the interesting world of modern text analysis.

Who this book is for

Fluency in Python is assumed, but the book attempts to be accessible to even Python beginners. Basic statistics is helpful. Given that this book introduces Natural Language Processing from first principles, it helps, although it is not a requirement, to be familiar with basic linguistics.

What this book covers

Chapter 1, What is Text Analysis?There is no time like now to do text analysis - we have an abundance of easily available data, powerful and free open source tools to conduct our analysis and research on Machine Learning, Computational Linguistics, and computing with text is progressing at a pace we have not seen before. In this chapter, we will go into details about what exactly text analysis is, and the motivations for studying and understanding text analysis.

Chapter 2, Python Tips for Text Analysis. We mentioned in Chapter 1, What is Text Analysis, that we will be using Python throughout the book because it is an easy-to-use and powerful language. In this chapter, we will substantiate these claims, while also providing a revision course in basic Python for text analysis. Why is this important? While we expect readers of the book to have a background in Python and high-school math, it is still possible that it has been a while since you’ve written Python code - and even if you have, Python code you write during text analysis and string manipulation is quite different from, say, building a website using the web framework Django.

Chapter 3, spaCy’s Language Models. While we introduced text analysis in the previous chapter, we did not discuss any of the technical details behind building a text analysis pipeline. In this chapter, we will introduce you to spaCy’s Language Model - these will serve as the first step in text analysis, and are the first building block in our pipelines. Also, we will introduce the reader to spaCy and how we can use spaCy to help us in our text analysis tasks, as well as talk about some of it’s more powerful functionalities, such as POS-tagging and NER-tagging. We will finish up with an example of how we can preprocess data quickly and efficiently using spaCy.

Chapter 4, Gensim – Vectorizing Text and Transformations and n-grams. While we have worked with raw textual data so far, any Machine Learning or information retrieval related algorithm will not accept data like this - which is why we use mathematical constructs called Vectors to help let the algorithms make sense of the text. We will introduce gensim as the tool to conduct this transformation, as well as scikit-learn, which will be used before we plug in the text to any sort of further analysis. A huge part of preprocessing is carried on over when we start our vectorization - bi-grams, tri-grams, and n-grams, as well using term frequencies to get rid of some words which we deem to not be useful.

Chapter 5, POS-Tagging and Its Applications. Chapters 1 and 2 introduced text analysis and Python, and chapters 3 and 4 helped us set-up our code for more advanced text analysis. This chapter discusses the first of such advanced techniques - part of speech tagging, popularly called POS-tagging. We will study what parts of speech exist, how to identify them in our documents, and what possible uses these POS-tags have.

Chapter 6, NER-Tagging and Its Applications. In the previous chapter, we saw how we can use spaCy’s language pipeline - POS-tagging is a very powerful tool, and we will now explore itsanother interesting usage, NER-tagging. We will discuss what exactly this is from a both linguistic and text analysis point of view, as well as detailing examples of its usage, and how to train our own NER-tagger with spaCy.

Chapter 7, Dependency Parsing. We saw in Chapters 5 and 6 how spaCy’s language pipeline performs a variety of complex Computational Linguistics algorithms, such as POS-tagging and NER-tagging. This isn’t all spaCy packs though, and in this chapter we will explore the power of dependency parsing and how it can be used in a variety of contexts and applications. We will have a look at the theory of dependency parsing before moving on to using it with spaCy, as well as training our own dependency parsers.

Chapter 8, Topic Models. Until now, we dealt with Computational Linguistics algorithms and spaCy, and understood how to use these computational linguistic algorithms to annotate our data, as well as understand sentence structure. While these algorithms helped us understand the finer details of our text, we still didn’t get a big picture of our data - what kind of words appear more often than others in our corpus? Can we group our data or find underlying themes? We will be attempting to answer these questions and more in this chapter.

Chapter 9, Advanced Topic Modeling. We saw in the previous chapter the power of topic modeling, and how intuitive a way it can be to understand our data, as well as explore it. In this chapter, we will further explore the utility of these topic models, and also on how to create more useful topic models which better encapsulate the topics that may be present in a corpus. Since topic modeling is a way to understand the documents of a corpus, it also means that we can analyze documents in ways we have not done before.

Chapter 10, Clustering and Classifying Text. In the previous chapter we studied topic models and how they can help us in organizing and better understanding our documents and its sub-structure. We will now move on to our next set of Machine Learning algorithms, and for two particular tasks - clustering and classification. We will learn what is the intuitive reasoning of these two tasks, as well as how to perform these tasks using the popular Python Machine Learning library, scikit-learn.

Chapter 11, Similarity Queries and Summarization. Once we have begun to represent text documents in the form of vector representations, it is possible to start finding the similarity or distance between documents - and that is exactly what we will learn about in this chapter. We are now aware of a variety of different vector representations, from standard bag-of-words or TF-IDF to topic model representations of text documents. We will also learn about a very useful feature implemented in gensim and how to use it - summarization and keyword extraction.

Chapter 12, Word2Vec, Doc2Vec and Gensim. We previously talked about vectors a lot throughout the book - they are used to understand and represent our textual data in a mathematical form, and the basis of all the Machine Learning methods we use rely on these representations. We will be taking this one step further, and use Machine Learning techniques to generate vector representations of words which better encapsulate the meaning of a word. This technique is generally referred to as word embeddings, and Word2Vec and Doc2Vec are two popular variations of these.

Chapter 13, Deep Learning for Text. Until now, we have explored the usage of Machine Learning for text in a variety of contexts - topic modelling, clustering, classification, text summarisation, and even our POS-taggers and NER-taggers were trained using Machine Learning. In this chapter, we will begin to explore one of the most cutting-edge forms of Machine Learning - Deep Learning. Deep Learning is a form of ML where we use biologically inspired structures to generate algorithms and architectures to perform various tasks on text. Some of these tasks are text generation, classification, and word embeddings. In this chapter, we will discuss some of the underpinnings of Deep Learning as well as how to implement our own Deep Learning models for text.

Chapter 14, Keras and spaCy for Deep Learning. In the previous chapter, we introduced Deep Learning techniques for text, and to get a taste of using Neural Networks, we attempted to generate text using an RNN. In this chapter, we will take a closer look at Deep Learning for text, and in particular, how to set up a Keras model which can perform classification, as well as how to incorporate Deep Learning into spaCy pipelines.

Chapter 15, Sentiment Analysis and ChatBots. By now, we are equipped with the skills needed to get started on text analysis projects, and to also take a shot at more complicated, meatier projects. Two common text analysis projects which encapsulate a lot of the concepts we have explored throughout the book are sentiment analysis and chatbots. In fact, we’ve already touched upon all the methods we will be using for these projects, and this chapter will serve as a guide to how one can put up such an application on their own. In this chapter, we will not be providing the code to build a chatbot or sentiment analysis pipeline from the first step to the last, but will rather introduce the reader to a variety of techniques that will help when setting up such a project.

To get the most out of this book

Follow the listed steps and commands to prepare the system environment:

Python:

Most, if not all, OS come installed with Python. It is already available on Windows, Ubuntu 14.04 onwards, and macOS.

If not, please follow the official wiki documentation:

https://wiki.python.org/moin/BeginnersGuide/Download

.

This is a good time to start migrating all of the code to Python 3.6 (http://python3statement.org/). By 2020, a lot of scientific computing packages (such as NumPy) will be dropping support for Python 2.

spaCy:

pip install spacy

Gensim:

pip install gensim

Keras:

pip install keras

scikit-learn:

pip install scikit-learn

Download the example code files

You can download the example code files for this book from your account at www.packtpub.com. If you purchased this book elsewhere, you can visit www.packtpub.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

Log in or register at

www.packtpub.com

.

Select the

SUPPORT

tab.

Click on

Code Downloads & Errata

.

Enter the name of the book in the

Search

box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for Mac

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Natural-Language-Processing-and-Computational-Linguistics. The code and the PDF version of all Jupyter notebooks is hosted at https://github.com/PacktPublishing/Natural-Language-Processing-and-Computational-Linguistics/tree/master/notebooks. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available athttps://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://www.packtpub.com/sites/default/files/downloads/NaturalLanguageProcessingandComputationalLinguistics_ColorImages.pdf.

Get in touch

Feedback from our readers is always welcome.

General feedback: Email [email protected] and mention the book title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packtpub.com.

What is Text Analysis?

There is no time like now to do text analysis – we have an abundance of easily available data, powerful and free open source tools to conduct our analysis, and research on machine learning, computational linguistics and computing with text is progressing at a pace we have not seen before.

In this chapter, we will go into details about what exactly text analysis is and look at the motivations for studying and understanding text analysis. Following are the topics we will cover in this chapter:

What is text analysis?

Where's the data at?

Garbage in, garbage out

Why should YOU be interested?

References

A note about the references: they will appear throughout the PDF version of the book as links, and if it is an academic reference it will link to the PDF of the reference or the journal page. All of these links and references are then displayed as the final section of the chapter, so offline readers can also visit the websites or research papers.

What is text analysis?

If there's one medium of media which we are exposed to every single day, it's text. Whether it's our morning paper or the messages we receive, it's likely you receive your information in the form of text.

Let's put things into a little more perspective – consider the amount of text data handled by companies such as Google (1+ trillion queries per year), Twitter (1.6 billion queries per day), and WhatsApp (30+ billion messages per day). That's an incredible resource, and the sheer ubiquitous nature of the text is enough reason for us to take it seriously. Textual data also has huge business value, and companies can use this data to help profile customers and understand customer trends. This can either be used to offer a more personalized experience for users or as information for targeted marketing. Facebook, for example, uses textual data heavily, and one of the algorithms we will learn later in this book was developed at Facebook's AI research team.

Fig 1.1 Rate of data growth from 2006 – 2018 with predicted rates of data in 2019 and 2020. Source: Patrick Cheeseman, https://www.eetimes.com/author.asp?section_id=36&doc_id=1330462

Text analysis can be understood as the technique of gleaning useful information from text. This can be done through various techniques, and we use Natural Language Processing (NLP), Computational Linguistics (CL), and numerical tools to get this information. These numerical tools are machine learning algorithms or information retrieval algorithms. We'll briefly, informally explain these terms as they will be coming up throughout the book.

Natural language processing (NLP) refers to the use of a computer to process natural language. For example, removing all occurrences of the word thereby from a body of text is one such example, albeit a basic example.

Computational linguistics (CL), as the name suggests, is the study of linguistics from a computational perspective. This means using computers and algorithms to perform linguistics tasks such as marking your text as a part of speech (such as noun or verb), instead of performing this task manually.

Machine Learning (ML) is the field of study where we use statistical algorithms to teach machines to perform a particular task. This learning occurs with data, and our task is often to predict a new value based on previously observed data.

Information Retrieval (IR) is the task of looking up or retrieving information based on a query by the user. The algorithms that aid in performing this task are called information retrieval algorithms, and we will be encountering them throughout the book.

Text analysis itself has been around for a long time – one of the first definitions of Business Intelligence (BI) itself, in an October 1958 IBM Journal article by H. P. Luhn, A Business Intelligence System [1], describes a system that will do the following:

"...utilize data-processing machines for auto-abstracting and auto-encoding of documents and for creating interest profiles for each of the 'action points' in an organization. Both incoming and internally generated documents are automatically abstracted, characterized by a word pattern, and sent automatically to appropriate action points."

It's interesting to see talk about documents, instead of numbers – to think that the first ideas of business intelligence were understanding text and documents is again a testament to text analysis throughout the ages. But even outside the realm of text analysis for business, using computers to better understand text and language has been around since the beginning of ideas of artificial intelligence. The 1999 review on text analysis by John Hutchins, Retrospect and prospect in computer-based translation [2], talks about efforts to do machine translation as early as the 1950s by the United States military, in order to translate Russian scientific journals into English.

Efforts to make an intelligent machine started with text as well – the ELIZA program developed in 1966 at MIT by Joseph Weizenbaum is one example. Even though the program had no real understanding of language, by basic pattern matching it could attempt to hold a conversation. These are just some of the earliest attempts to analyze text – computers (and human beings!) have come a long way since, and we now have incredible tools at our disposal.

Machine translation itself has come a long way, and we can now use our smartphones to effectively translate between languages, and with cutting-edge techniques such as Google's Neural Machine Translation, the gap between academia and industry is reducing – allowing us to actually experience the magic of natural language processing first hand.

Fig 1.2 An example of a Neural Translation model, working on French to English

Advances in this subject have helped advance the way we approach speech as well – closed captioning in videos, and personal assistants such as Apple's Siri or Amazon's Alexa are greatly benefited by superior text processing. Understanding structure in conversations and extracting information were key problems in early NLP, and the fruits of the research done are being very apparent in the 21st century.

Search engines such as Google or Bing! also stand on the shoulders of the research done in NLP and CL and affect our lives in an unprecedented way. Information retrieval (IR) builds on statistical approaches in text processing and allows us to classify, cluster, and retrieve documents. Methods such as topic modeling can help us identify key topics in large, unstructured bodies of text. Identifying these topics goes beyond searching for keywords, and we use statistical models to further understand the underlying nature of bodies of text. Without the power of computers, we could not perform this kind of large-scale statistical analysis on the text. We will be exploring topic modeling in detail later on in the book.

Fig 1.3 Techniques such as topic modeling use probabilistic modeling methods to identify key topics from the text. We will be studying this in detail later in the book

Going one step ahead of just being able to experience the wonders of modern computing on our mobile phones, recent developments in both Python and NLP means that we can now develop such systems on our own!

Not only has there been an evolution in the techniques used in NLP and text analysis, it has become very accessible to us – open source packages are becoming state-of-the-art, performing as well as commercial tools. An example of a commercial tool would be Microsoft's Text Analysis API (https://azure.microsoft.com/en-us/services/cognitive-services/text-analytics/).

MATLAB is another example of a popular commercial tool used for scientific computing. While historically such commercial tools performed better than free, open source software, an increase in people contributing to open source libraries, as well as funding from industry has helped the open source community immensely. Now, the tables appear to have turned and many software giants use open source packages for their internal systems – such as Google using TensorFlow and Apple using scikit-learn! TensorFlow and scikit-learn are two open source Python machine learning packages.

It can be argued that the sheer number of packages offered by the Python ecosystem means it leads the pack when it comes to doing text analysis, and we will focus our efforts here. A very strong and active open source community adds to the appeal.

Throughout the course of the book, we will discuss modern natural language processing and computational linguistics techniques and the best open source tools available to us which we can use to apply these techniques.

Where's the data at?

While it is important to be aware of the techniques and the tools involved in NLP and CL, it is, of course, pointless without any data. Luckily for us, we have access to an abundance of data if we look in the right places. The easiest way to find textual data to work on is to look for a corpus.

A text corpus is a large and structured set of texts and is a great way to start off with text analysis. Examples of such corpora that are free are the Open American National Corpus [5] or the British National Corpus [6]. Wikipedia has a useful list of the largest corpuses available in its article on text corpuses [7]. These are not limited to the English language, and there also exist various corpuses in European and Asian languages, and there are constant efforts worldwide to create corpuses for majority of languages. Universities research labs are another valuable source for obtaining corpuses – indeed, one of the most iconic English language corpuses, the Brown Corpus, was put together at Brown University.

Different corpuses tend to have varying levels of information present, usually dependent on the primary purpose for that corpora – for example, corpora whose primary function is to aid during translation would have the same sentence present in multiple languages. Another way corpora have extra information is through annotation. Examples of annotation in text usually include Part-Of-Speech (POS) tagging or Named-Entity-Recognition (NER). POS-tagging refers to marking each word in a sentence with its part of speech (Noun, verb, adverb, and so on), and a corpus annotated for NER would have all named entities recognized, such as places, people, and times. We'll be further going into details of both POS-tagging and NER later on in the book, in Chapter 5, POS-Tagging and its Applications and Chapter 6, NER-Tagging and its Applications.

Based on the structure and varying levels of information present in the corpora, it would have a different purpose. Some corpora are also built to evaluate clustering or classification tasks, where rather than annotation being important, the label or class would be. This means that some corpora are designed to aid with machine learning tasks such as cluster or classification by providing text with labels tagged by humans. Clustering refers to the task of grouping similar objects together, and classification is the process of deciding which predefined class an identifying what exactly your dataset is going to be used for is a crucial part of text analysis and an important first step.

Apart from downloading datasets or scraping data off the internet, there are still some rich sources for gathering our textual data – in particular, literature. One example of this is the research done at the University of Pennsylvania, where Alejandro Ribeiro, Santiago Segarra, Mark Eisen, and Gabriel Egan discovered possible collaborators of Shakespeare, a literary history problem that stumbled many researchers [14]. They approached the problem by identifying literary styles – an upcoming field of study in computational linguistics called style analysis.

The increased use of computational tools to perform research in the humanities has also led to the growth of Digital Humanities labs in universities, where traditional research approaches are either aided or overtaken by computer science, and in particular machine learning (and by extension), natural language processing. Speeches of politicians, or proceedings in parliament, for example, are another example of a data source used often in this community. TheyWorkForYou [17] is A UK parliament tracking system, which gets speeches and uploads them and is an example of the many sites available doing this kind of work.

Project Gutenberg is likely the best resource to download books and contains over 50,000 free eBooks and many literary classics. Personal PDFs and eBooks also remain a resource, but again, it is important to know the legal nature of your text before analyzing it. Downloading a pirated copy of, say, Harry Potter off the internet and publishing text analysis results might not be the best idea if you cannot explain where you got the text from! Similarly, text analysis on private text messages might not only annoy your friends but also could be infringing on privacy laws.

Fig 1.4 An example of a text dataset list – here, it is of reviews datasets found on

So where else apart from downloading a structured data-set straight off the internet, do we get our textual data? Well, the internet, of course. Even if it isn't labelled, the sheer amount of text on the internet means that we can access large parts of it – the [7] is one such example, and the media dump of all the content on Wikipedia, after unzipping, is about 58 GB (as of April 2018) – more than enough text to play around with. The popular news aggregation website reddit.com [9] allows for easy web-scraping and is another great resource for text analysis.

Python again remains a great choice to use for any such web-scraping, and libraries such as BeautifulSoup [10], urllib [11] and scrapy [12] are designed particularly for this. It is important to remain careful about the legal side of things here, and make sure to check the terms and conditions of the website where you are scraping the data from – a number of websites will not allow you to use the information on the website for commercial purposes.

Twitter is another website that is fast becoming a very important part of text analysis – you even have academia taking this resource very seriously (What is Twitter, a social network or a news media? [13] has over 5000 citations!), with multiple papers being written on text analysis of tweets, and even full-fledged tools [15] to do sentiment analysis have been built! The Twitter-streaming API allows us to easily mine for textual data from Twitter as well, and the Python interface [16] is straightforward. Most world leaders are users of Twitter, as well as celebrities and major news corporations – there is a lot of interesting insights Twitter can offer us.

Fig 1.5 An example of the rich text resource Twitter has become, with multiple structured datasets available [7]. These datasets, all mined from Twitter, have particular tasks, which can be used for and fall under the category of labeled datasets which we discussed before.

Other examples of textual information you can get off the internet include research articles, medical reports, restaurant reviews (the Yelp! dataset comes to mind), and other social media websites. Sentiment analysis is usually the prime objective in these cases. As the name suggests, sentiment analysis refers to the task of identifying sentiment in text. These sentiments can be basic, such as positive or negative sentiment, but we could have more complex sentiment analysis tasks where we analyze whether a sentence contains happy, sad, or angry sentiments.

It's clear that if we look hard enough, it's more than easy to find data to play around with. But let's take a small step back from downloading data off the internet – where else can we try and find information?

Right in our hands, as it may seem – we send and receive text messages and emails every day, and we can use this text for text analysis. Most text messaging applications have interfaces to download chats. WhatsApp, for example, will mail the data to you [18], with both media and text. Most mail clients have the same option, and the advantage in both these cases is that this kind of data is often well organized, allowing for easy cleaning and preprocessing before we dive into the data.

One aspect we've ignored so far whilst talking about data is the noise which is often in the text – in tweets, for example, short forms and emoticons which are often used, and in some cases, we have multi-lingual data where a simple analysis might fail. This brings us to arguably the most important aspect of text analysis – preprocessing.

Garbage in, garbage out

Garbage in, garbage out (or GIGO) is an adage of computer science which is even more important when dealing with machine learning and possibly even more so when dealing with textual data. Garbage in, garbage out means that if we have poorly formatted data, it is likely we will have poor results.

Fig 1.6 XKCD hits the hammer on the nail once again (https://xkcd.com/1838/)

While more data usually leads to a better prediction, it isn't always the same case with text analysis, where more data can result in nonsense results or results which we don't always want. An intuitive example: the part of speech, articles, such as the words a, or the tend to appear a lot in text, but not adding any information to the text, and is usually limited to grammar or structure.

Words such as these which don't provide useful information are called stop words, and these words are often removed from the text before applying text analysis techniques on them. Similarly, sometimes we remove words with very high frequency in the body of text, and words which only appear once or twice – it is highly likely these words will not be useful to our analysis. That being said, this depends heavily on the kind of task being performed - if, for example, we would want to replicate human writing styles, stop words are important because humans many such words when writing. An example of how stop words can also include useful information is in this article, Pastiche detection based on stopword rankings. Exposing impersonators of a Romanian writer [20], is a study identified a certain author using frequency of stop words.

Let's consider another example where we might be dealing with useless data – if searching for influential words or topics in the text, would it make sense to have both the words reading and read in the results? Here, shortening the word reading to read would not lead to any loss of information. But on a similar note, it would make sense to have the words information and inform exist separately in the same body of text, because they could mean different things based on the context. We would then need techniques to shorten words appropriately. Lemmatizing and stemming are two methods we use to tackle this problem and remain two of the core concepts in natural language processing. We will be exploring these two techniques in more detail in Chapter 3, spaCy's Language models.

Even after basic text-processing, our data is still a collection of words. Since machines do not inherently understand the concepts tied to words, we can instead use numbers that represent individual words. The next important step in text analysis is converting words into numbers, whether it is bag-of-words (BOW), or term frequency-inverse document frequency (TF-IDF), which are different ways to count the number of words in each document or sentence. There are also more advanced techniques to represent words such as Word2Vec and GloVe.

We will go into these details and techniques in more detail in the chapter on preprocessing techniques – it is especially important to understand the motivation behind these techniques, and that a computer's output is only as good as the input you feed it.

Why should you do text analysis?

We've talked about what text analysis is, where we can find the data, and some of the things to keep in mind before diving into text analysis. But after all, what motivation do you, the reader, have to actually go about doing text analysis?

For starters, it's the sheer abundance of easily available data that we can use. In the big data age, there really is no excuse to not have a look at what all our data really means. In fact, apart from the massive data sets, we can download off the internet, we also have access to small data– text messages, emails, a collection of poems are such examples. You could even do a meta-analysis and run an analysis on this very book! Textual data is even easier to get a hold-off, but far more importantly - it's easy to interpret and understand the results of the analysis. Numbers might not always make sense and are not always appealing to look at - but words are easier for us human beings to appreciate.

Text analysis remains exciting also because we can use data which directly involves the user- our own text conversations, our favorite childhood book, or tweets by our favorite celebrity. The personal nature of text data always adds an extra bit of motivation, and it also likely means we are aware of the nature of the data, and what kind of results to expect.

NLP techniques can also help us construct tools that can assist personal businesses or enterprises – chatbots, for example, are becoming increasingly common in major websites, and with the right approach, it is possible to have a personal chat-bot. This is largely due to a subfield of machine learning, called Deep Learning, where we use algorithms and structures that are inspired by the structure of the human brain. These algorithms and structures are also referred to as neural networks. Advances in deep learning have introduced to powerful neural networks such as Recurrent Neural Networks(RNNs) and Convolutional Neural Networks(CNNs). Now, even with minimal knowledge of the mathematical functioning of these algorithms, high-level APIs are allowing us to use these tools. Integrating this into our daily life is no longer reserved for computer science researchers or full-time engineers – with the right collection of data and open source packages, this is well within our capabilities.

Open source packages have become industry standard – Google has released and maintains TensorFlow [21], and packages such as scikit-learn [22] are used by Apple and Spotify, and spaCy [23], which we will extensively discuss throughout this book – is used by Quora, a popular question-answer website.

We are no longer limited by either data or the tools – the only two things we would need to do text analysis.

The programming language Python will be our friend throughout the book, and all the tools we will use will all be free open-source software. While we move towards open science, we also move towards open source code, and this will remain a key philosophy throughout the book. In the world of research, open source code means academic results are reproducible and available to all those interested. Python remains an easy-to-use and powerful language and serves as a great way to enter the world of natural language processing.

One could argue that the last thing needed was the knowledge of how to apply these tools and to wrangle with the data – but that is precisely the purpose of the book and, hoping to let the reader build their own natural language processing pipelines and models at the end of the journey.

Summary

We've had a look at the incredible power of text analysis, and the kind of things we can do with it – as well as the kind of tools we would be using to take advantage of this. Data has become increasingly easy for us to access, and with the growth of social media, we have continuous access to both new data, as well as standardized annotated datasets.

This book will aim at walking the reader through the tools and knowledge required to conduct textual analysis on their own personal data or own standardized datasets. We will discuss methods to access and clean data to make it ready for preprocessing, as well as how to explore and organize our textual data. Classification and clustering are two other commonly conducted text processing tasks, and we will figure out how to perform this as well, before finishing up with how to use deep learning for text.