Practical Text Mining with Perl - Roger Bilisoly - E-Book

Practical Text Mining with Perl E-Book

Roger Bilisoly

0,0
112,99 €

oder
-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Provides readers with the methods, algorithms, and means to perform text mining tasks This book is devoted to the fundamentals of text mining using Perl, an open-source programming tool that is freely available via the Internet (www.perl.org). It covers mining ideas from several perspectives--statistics, data mining, linguistics, and information retrieval--and provides readers with the means to successfully complete text mining tasks on their own. The book begins with an introduction to regular expressions, a text pattern methodology, and quantitative text summaries, all of which are fundamental tools of analyzing text. Then, it builds upon this foundation to explore: * Probability and texts, including the bag-of-words model * Information retrieval techniques such as the TF-IDF similarity measure * Concordance lines and corpus linguistics * Multivariate techniques such as correlation, principal components analysis, and clustering * Perl modules, German, and permutation tests Each chapter is devoted to a single key topic, and the author carefully and thoughtfully introduces mathematical concepts as they arise, allowing readers to learn as they go without having to refer to additional books. The inclusion of numerous exercises and worked-out examples further complements the book's student-friendly format. Practical Text Mining with Perl is ideal as a textbook for undergraduate and graduate courses in text mining and as a reference for a variety of professionals who are interested in extracting information from text documents.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 541

Veröffentlichungsjahr: 2011

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Contents

List of Figures

List of Tables

Preface

Acknowledgments

1 Introduction

1.1 Overview of this Book

1.2 Text Mining and Related Fields

1.3 Advice for Reading this Book

2 Text Patterns

2.1 Introduction

2.2 Regular Expressions

2.3 Finding Words in a Text

2.4 Decomposing Poe’s “The Tell-Tale Heart” into Words

2.5 A Simple Concordance

2.6 First Attempt at Extracting Sentences

2.7 Regex Odds and Ends

2.8 References

Problems

3 Quantitative Text Summaries

3.1 Introduction

3.2 Scalars, Interpolation, and Context in Perl

3.3 Arrays and Context in Perl

3.4 Word Lengths in Poe’s “The Tell-Tale Heart”

3.5 Arrays and Functions

3.6 Hashes

3.7 Two Text Applications

3.8 Complex Data Structures

3.9 References

3.10 First Transition

Problems

4 Probability and Text Sampling

4.1 Introduction

4.2 Probability

4.3 Conditional Probability

4.4 Mean and Variance of Random Variables

4.5 The Bag-of-Words Model for Poe’s “The Black Cat”

4.6 The Effect of Sample Size

4.7 References

Problems

5 Applying Information Retrieval to Text Mining

5.1 Introduction

5.2 Counting Letters and Words

5.3 Text Counts and Vectors

5.4 The Term-Document Matrix Applied to Poe

5.5 Matrix Multiplication

5.6 Functions of Counts

5.7 Document Similarity

5.8 References

Problems

6 Concordance Lines and Corpus Linguistics

6.1 Introduction

6.2 Sampling

6.3 Corpus as Baseline

6.4 Concordancing

6.5 Collocations and Concordance Lines

6.6 Applications with References

6.7 Second Transition

Problems

7 Multivariate Techniques with Text

7.1 Introduction

7.2 Basic Statistics

7.3 Basic linear algebra

7.4 Principal Components Analysis

7.5 Text Applications

7.6 Applications and References

Problems

8 Text Clustering

8.1 Introduction

8.2 Clustering

8.3 A Note on Classification

8.4 References

8.5 Last Transition

Problems

9 A Sample of Additional Topics

9.1 Introduction

9.2 Perl Modules

9.3 Other Languages: Analyzing Goethe in German

9.4 Permutation Tests

9.5 References

Appendix A: Overview of Perl for Text Mining

A.1 Basic Data Structures

A.2 Operators

A.3 Branching and Looping

A.4 A Few Pen Functions

A.5 Introduction to Regular Expressions

Appendix B: Summary of R used in this Book

B.1 Basics of R

B.2 This Book’s R Code

References

Index

WILEY SERIES ON METHODS AND APPLICATIONS IN DATA MINING

Series Editor: Daniel T. Larose

Discovering Knowledge in Data: An Introduction to Data Mining • Daniel T. LaRose

Data-Mining on the Web: Uncovering Patterns in Web Content, Structure, and Usage • Zdravko Markov and Daniel Larose

Data Mining Methods and Models • Daniel Larose

Practical Text Mining with Perl • Roger Bilisoly

Copyright © 2008 by John Wiley & Sons, Inc. All rights reserved.

Published by John Wiley & Sons, Inc., Hoboken, New Jersey.Published simultaneously in Canada.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic format. For information about Wiley products, visit our web site at www.wiley.com.

Library of Congress Cataloging-in-Publication Data:

Bilisoly, Roger, 1963–Practical text mining with Perl / Roger Bilisoly. p. cm.Includes bibliographical references and index.ISBN 978-0-470-17643-6 (cloth)1. Data mining. 2. Text processing (Computer science) 3. Perl (Computer program language) I. Title.QA76.9.D343.B45 2008005.74—dc222008008144

To my Mom and Dad & all their cats.

List of Figures

3.1 Log(Frequency) vs. Log(Rank) for the words in Dickens’s A ChristmasCarol.

4.1 Plot of the running estimate of the probability of heads for 50 flips.

4.2 Plot of the running estimate of the probability of heads for 5000 flips.

4.3 Histogram of the proportions of the letter e in 68 Poe short stories based ontable4.l.

4.4 Histogram and best fitting normal curve for the proportions of the letter e in 68 Poe short stories.

4.5 Plot of the number of types versus the number of tokens for “The Unparalleled Adventures of One Hans Pfaall.” Data is from program 4.5.Figure adapted from figure 1.1 of Baayen [61 with kind permission from Springer Science and Business Media and the author.

4.6 Plot of the mean word frequency against the number of tokens for “The Unparalleled Adventures of One Hans Pfaall.” Data is from program 4.5. Figure adapted from figure 1.1 of Baayen [61] with kind permission from Springer Science and Business Media and the author.

4.7 Plot of the mean word frequency against the number of tokens for “The Unparalleled Adventures of One Hans Pfaall” and “The Black Cat.” Figure adapted from figure 1.1 of Baayen [6] with kind permission from Springer Science and Business Media and the author.

5.1 The vector (4,3) makes a right triangle if a line segment perpendicular to the x-axis is drawn to the x-axis.

5.2 Comparing the frequencies of the word the (on the x-axis) against city (on the y-axis). Note that the y-axis is not to scale: it should be more compressed.

5.3 Comparing the logarithms of the frequencies for the words the (on the x-axis) and city (on the y-axis).

7.1 Plotting pairs of word counts for the 68 Poe short stories.

7.2 Plots of the word counts for the versus of using the 68 Poe short stories.

8.1 A two variable data set that has two obvious clusters.

8.2 The perpendicular bisector of the line segment from (0,1) to (1,1)divides this plot into two half-planes. The points in each form the two clusters.

8.3 The next iteration of k-means after figure 8.2. The line splits the data into two groups, and the two centroids are given by the asterisks.

8.4 Scatterplot of heRate against sheRate for Poe’s 68 short stories.

8.5 Plot of two short story clusters fitted to the heRate and sheRate data.

8.6 Plots of three, four, five, and six short story clusters fitted to the heRate and sheRate data.

8.7 Plots of two short story clusters based on eight variables, but only plotted for the two variables heRate and sheRate.

8.8 Four more plots showing projections of the two short story clusters found in output 8.7 onto two pronoun rate axes.

8.9 Eight principal components split into two short story clusters and projected onto the first two PCs.

8.10 A portion of the dendrogram computed in output 8.11, which shows hierarchical clusters for Poe’s 68 short stories.

8.11 The plot of the Voronoi diagram computed in output 8.12.

8.12 All four plots have uniform marginal distributions for both the x and y-axes. For problem 8.4.

8.13 The dendrogram for the distances between pronouns based on Poe’s 68 short stories. For problem 8.5.

9.1 Histogram of the numbers of runs in 100,000 random permutations of digits in equation 9.1.

9.2 Histogram of the runs of the 10,000 permutations of the names Scrooge and Marley as they appear in A Christmas Carol.

9.3 Histogram of the runs of the 10,000 permutations of the names Francois and Perrault as they appear in The Call of the Wild.

List of Tables

2.1 Telephone number formats we wish to find with a regex. Here d stands for a digit 0 through 9.

2.2 Telephone number input to test regular expression 2.2.

2.3 Summary of some of the special characters used by regular expressions with examples of strings that match.

2.4 Removing punctuation: a sample of five mistakes made by program 2.4.

2.5 Some values of the Perl variable $1 and their effects.

2.6 A variety of ways of combining two short sentences.

2.7 Sentence segmentation by program 2.8 fails for this sentence.

2.8 Defining true and false in Perl.

3.1 Comparison of arrays and hashes in Perl.

4.1 Proportions of the letter e for 68 Poe short stories, sorted smallest to largest.

4.2 Two intervals for the proportion of e’s in Poe’s short stories using table 4.1.

4.3 Counts of four-letter words satisfying each pair of conditions. For problem 4.5.

Preface

What This Book Covers

This book introduces the basic ideas of text mining, which is a group of techniques that extracts useful information from one or more texts. This is a practical book, one that focuses on applications and examples. Although some statistics and mathematics is required, it is kept to a minimum, and what is used is explained.

This book, however, does make one demand: it assumes that you are willing to learn to write simple programs using Perl. This programming language is explicitly designed to work with text. In addition, it is open-source software that is available over the Web for free. That is, you can download the latest full-featured version of Perl right now, and install it on all the computers you want without paying a cent.

Chapters 2 and 3 give the basics of Perl, including a detailed introduction to regular expressions, which is a text pattern matching methodology used in a variety of programming languages, not just Perl. For each concept there are several examples of how to use it to analyze texts. Initial examples analyze short strings, for example, a few words or a sentence. Later examples use text from a variety of literary works, for example, the short stories of Edgar Allan Poe, Charles Dickens’s A Christmas Carol, Jack London’s The Call of the Wild, and Mary Shelley’s Frankenstein. All the texts used here are part of the public domain, so you can download these for free, too. Finally, if you are interested in word games, Perl plus extensive word lists are a great combination, which is covered in chapter 3.

Chapters 4 through 8 each introduce a core idea used in text mining. For example, chapter 4 explains the basics of probability, and chapter 5 discusses the term-document matrix, which is an important tool from information retrieval.

This book assumes that you want to analyze one or more texts, so the focus is on the practical. All the techniques in this book have immediate applications. Moreover, learning a minimal amount of Perl enables you to modify the code in this book to analyze the texts that interest you.

The level of mathematical knowledge assumed is minimal: you need to know how to count. Mathematics that arises for text applications is explained as needed and is kept to the minimum to do the job at hand. Although most of the techniques used in this book were created by researchers knowledgeable in math, a few basic ideas are all that are needed to read this book.

Although I am a statistician by training, the level of statistical knowledge assumed is also minimal. The core tools of statistics, for example, variability and correlations, are explained. It turns out that a few techniques are applicable in many ways.

The level of prior programming experience assumed is again minimal: Perl is explained from the beginning, and the focus is on working with text. The emphasis is on creating short programs that do a specific task, not general-purpose text mining tools. However, it is assumed that you are willing to put effort into learning Perl. If you have never programmed in any computer language at all, then doing this is a challenge. Nonetheless, the payoff is big if you rise to this challenge.

Finally, all the code, output, and figures in this book are produced with software that is available from the Web at no cost to you, which is also true of all the texts analyzed. Consequently, you can work through all the computer examples with no additional costs.

What Is Text Mining?

The text in text mining refers to written language that has some informational content. For example, newspaper stories, magazine articles, fiction and nonfiction books, manuals, blogs, email, and online articles are all texts. The amount of text that exists today is vast, and it is ever growing.

Although there are numerous techniques and approaches to text mining, the overall goal is simple: it discovers new and useful information that is contained in one or more text documents. In practice, text mining is done by running computer programs that read in documents and process them in a variety of ways. The results are then interpreted by humans.

Text mining combines the expertise of several disciplines: mathematics, statistics, probability, artificial intelligence, information retrieval, and databases, among others. Some of its methods are conceptually simple, for example, concordancing where all instances of a word are listed in its context (like a Bible concordance). There are also sophisticated algorithms such as hidden Markov models (used for identifying parts of speech). This book focuses on the simpler techniques. However, these are useful and practical nonetheless, and serve as a good introduction to more advanced text mining books.

This Book’s Approach to Text Mining

This book has three broad themes. First, text mining is built upon counting and text pattern matching. Second, although language is complex, some aspects of it can be studied by considering its simpler properties. Third, combining computer and human strengths is a powerful way to study language. We briefly consider each of these.

First, text pattern matching means identifying a pattern of letters in a document. For example, finding all instances of the word cat requires using a variety of patterns, some of which are below.

cat Cat cats Cats cat’s Cat’s cats’ cat, cat. cat!

It also requires rejecting words like catastrophe or scatter, which contain the string cat, but are not otherwise related. Using regular expressions, this can be explained to a computer, which is not daunted by the prospect of searching through millions of words. See section 2.2.1 for further discussion of this example and chapter 2 for text patterns in general.

It turns out that counting the number of matches to a text pattern occurs again and again in text mining, even in sophisticated techniques. For example, one way to compute the similarity of two text documents is by counting how many times each word appears in both documents. Chapter 5 considers this problem in detail.

Second, while it is true that the complexity of language is immense, some information about language is obtainable by simple techniques. For example, recent language reference books are often checked against large text collections (called corpora). Language patterns have been both discovered and verified by examining how words are used in writing and speech samples. For example, big, large, and great are similar in meaning, but the examination of corpora shows that they are not used interchangeably. For example, the following sentences: “he has big feet,” “she has large feet,” and “she has great insight” sound good, but “he has big insight” or “she has large insight” are less fluent. In this type of analysis, the computer finds the examples of usage among vast amounts of text, and a human examines these to discover patterns of meanings. See section 6.4.2 for an example.

Third, as noted above, computers follow directions well, and they are untiring, while humans are experts at using and interpreting language. However, computers have limited understanding of language, and humans have limited endurance. These facts suggest an iterative and collaborative strategy: the results of a program are interpreted by a human who, in turn, decides what further computer analyses are needed, if any. This back and forth process is repeated as many times as is necessary. This is analogous to exploratory data analysis, which exploits the interplay between computer analyses and human understanding of what the data means.

Why Use Perl?

This section title is really three questions. First, why use Perl as opposed to an existing text mining package? Second, why use Perl as opposed to other programming languages? Third, why use Perl instead of so-called pseudo-code? Here are three answers, respectively.

First, if you have a text mining package that can do everything you want with all the texts that interest you, and if this package works exactly the way you want it, and if you believe that your future processing needs will be met by this package, then keep using it. However, it has been my experience that the process of analyzing texts suggests new ideas requiring new analyses and that the boundaries of existing tools are reached too soon in any package that does not allow the user to program. So at the very least, I prefer packages that allow the user to add new features, which requires a programming language. Finally, learning how to use a package also takes time and effort, so why not invest that time in learning a flexible tool like Perl.

Second, Perl is a programming language that has text pattern matching (called regular expressions or regexes), and these are easy to use with a variety of commands. It also has a vast amount of free add-ons available on the Web, many of which are for text processing. Additionally, there are numerous books and tutorials and online resources for Perl, so it is easy to find out how to make it do what you want. Finally, you can get on the Web and download full-strength Perl right now, for free: no hidden charges!

Larry Wall built Perl as a text processing computer language. Moreover, he studied linguistics in graduate school, so he is knowledgeable about natural languages, which influenced his design of Perl. Although many programming languages support text pattern matching, Perl is designed to make it easy to use this feature.

Third, many books use pseudo-code, which excels at showing the programming logic. In my experience, this has one big disadvantage. Students without a solid programming background often find it hard to convert pseudo-code to running code. However, once Perl is installed on a computer, accurate typing is all that is required to run a program. In fact, one way to learn programming is by taking existing code and modifying it to see what happens, and this can only be done with examples written in a specific programming language.

Finally, personally, I enjoy using Perl, and it has helped me finish numerous text processing tasks. It is easy to learn a little Perl and then apply it, which leads to learning more, and then trying more complex applications. I use Perl for a text mining class I teach at Central Connecticut State University, and the students generally like the language. Hence, even if you are unfamiliar with it, you are likely to enjoy applying it to analyzing texts.

Organization of This Book

After an overview of this book in chapter 1, chapter 2 covers regular expressions in detail. This methodology is quite powerful and useful, and the time spent learning it pays off in the later chapters. Chapter 3 covers the data structures of Perl. Often a large number of linguistic items are considered all at once, and to work with all of them requires knowing how to use arrays and hashes as well as more complex data structures.

With the basics of Perl in hand, chapter 4 introduces probability. This lays the foundation for the more complex techniques in later chapters, but it also provides an opportunity to study some of the properties of language. For example, the distribution of the letters of the alphabet of a Poe story is analyzed in section 4.2.2.1.

Chapter 5 introduces the basics of vectors and arrays. These are put to good use as term-document matrices, which is a fundamental tool of information retrieval. Because it is possible to represent a text as a vector, the similarity of two texts can be measured by the angle between the two vectors representing the texts.

Corpus linguistics is the study of language using large samples of texts. Obviously this field of knowledge overlaps with text mining, and chapter 6 introduces the fundamental idea of creating a text concordance. This takes the text pattern matching ability of regular expressions, and allows a researcher to compare the matches in a variety of ways.

Text can be measured in numerous ways, which produces a data set that has many variables. Chapter 7 introduces the statistical technique of principal components analysis (PCA), which is one way to reduce a large set of variables to a smaller, hopefully easier to interpret, set. PCA is a popular tool among researchers, and this chapter teaches you the basic idea of how it works.

Given a set of texts, it is often useful to find out if these can be split into groups such that (1) each group has texts that are similar to each other and (2) texts from two different groups are dissimilar. This is called clustering. A related technique is to classify texts into existing categories, which is called classification. These topics are introduced in chapter 8.

Chapter 9 has three shorter sections, each of which discusses an idea that did not fit in one of the other chapters. Each of these is illustrated with an example, and each one has ties to earlier work in this book.

Finally, the first appendix gives an overview of the basics of Perl, while the second appendix lists the R commands used at the end of chapter 5 as well as chapters 7 and 8. R is a statistical software package that is also available for free from the Web. This book uses it for some examples, and references for documentation and tutorials are given so that an interested reader can learn more about it.

ROGER BILIs0LY

New Britain, Connecticut

May 2008

Acknowledgments

Thanks to the Department of Mathematical Sciences of Central Connecticut State University (CCSU) for an environment that provided me the time and resources to write this book. Thanks to Dr. Daniel Larose, Director of the Data Mining Program at CCSU, for encouraging me to develop Stat 527, an introductory course on text mining. He also first suggested that I write a data mining book, which eventually became this text.

Some of the ideas in chapters 2, 3, and 5 arose as I developed and taught text mining examples for Stat 527. Thanks to Kathy Albers, Judy Spomer, and Don Wedding for taking independent studies on text mining, which helped to develop this class. Thanks again to Judy Spomer for comments on a draft of chapter 2.

Thanks to Gary Buckles and Gina Patacca for their hospitality over the years. In particular, my visits to The Ohio State University’s libraries would have been much less enjoyable if not for them.

Thanks to Dr. Edward Force for reading the section on text mining German. Thanks to Dr. Krishna Saha for reading over my R code and giving suggestions for improvement. Thanks to Dr. Nell Smith and David LaPierre for reading the entire manuscript and making valuable suggestions on it.

Thanks to Paul Petralia, senior editor at Wiley Interscience who let me write the book that I wanted to write.

The notation and figures in my section 4.6.1 are based on section 1.1 and figure 1.1 of Word Fequency Distributions by R. Harald Baayen, which is volume 18 of the “Text, Speech and Language Technology” series, published in 2001. This is possible with the kind permission of Springer Science and Business Media as well as the author himself.

Thanks to everyone who has contributed their time and effort in creating the wonderful assortment of public domain texts on the Web. Thanks to programmers everywhere whc have contributed open-source software to the world.

I would never have gotten to where I am now without the support of my family. This book is dedicated to my parents who raised me to believe in following my interests wherevei they may lead. To my cousins Phyllis and Phil whose challenges in 2007 made writing a book seem not so bad after all. In memory of Sam, who did not live to see his name in print. And thanks to the fun crowd at the West Virginia family reunions each year. See you this summer!

Finally, thanks to my wife for all the good times and for all the support in 2007 as I speni countless hours on the computer. Love you!

R. B.

CHAPTER 1

INTRODUCTION

1.1 OVERVIEW OF THIS BOOK

This is a practical book that introduces the key ideas of text mining. It assumes that you have electronic texts to analyze and are willing to write programs using the programming language Perl. Although programming takes effort, it allows a researcher to do exactly what he or she wants to do. Interesting texts often have many idiosyncrasies that defy a software package approach.

Numerous, detailed examples are given throughout this book that explain how to write short programs to perform various text analyses. Most of these easily fit on one page, and none are longer than two pages. In addition, it takes little skill to copy and run code shown in this book, so even a novice programmer can get results quickly.

The first programs illustrating a new idea use only a line or two of text. However, most of the programs in this book analyze works of literature, which include the 68 short stories of Edgar Allan Poe, Charles Dickens’s A Christmas Carol, Jack London’s The Call of the Wild, Mary Shelley’s Frankenstein, and Johann Wolfgang von Goethe’s Die Leiden des jungen Werthers. All of these are in the public domain and are available from the Web for free. Since all the software to write the programs is also free, you can reproduce all the analyses of this book on your computer without any additional cost.

This book is built around the programming language Perl for several reasons. First, Perl is free. There are no trial or student versions, and anyone with access to the Web can download it as many times and on as many computers as desired. Second, Larry Wall created Perl to excel in processing computer text files. In addition, he has a background in linguistics, and this influenced the look and feel of this computer language. Third, there are numerous additions to Perl (called modules) that are also free to download and use. Many of these process or manipulate text. Fourth, Perl is popular and there are numerous online resources as well as books on how to program in Perl. To get the most out of this book, download Perl to your computer and, starting in chapter 2, try writing and running the programs listed in this book.

This book does not assume that you have used Perl before. If you have never written any program in any computer language, then obtaining a book that introduces programming with Perl is advised. If you have never worked with Perl before, then using the free online documentation on Perl is useful. See sections 2.8 and 3.9 for some Perl references.

Note that this book is not on Perl programming for its own sake. It is devoted to how to analyze text with Perl. Hence, some parts of Perl are ignored, while others are discussed in great detail. For example, process management is ignored, but regular expressions (a text pattern methodology) is extensively discussed in chapter 2.

As this book progresses, some mathematics is introduced as needed. However, it is kept to a minimum, for example, knowing how to count suffices for the first four chapters. Starting with chapter 5, more of it is used, but the focus is always on the analysis of text while minimizing the required mathematics.

As noted in the preface, there are three underlying ideas behind this book. First, much text mining is built upon counting and text pattern matching. Second, although language is complex, there is useful information gained by considering the simpler properties of it. Third, combining a computer’s ability to follow instructions without tiring and a human’s skill with language creates a powerful team that can discover interesting properties of text. Someday, computers may understand and use a natural language to communicate, but for the present, the above ideas are a profitable approach to text mining.

1.2 TEXT MINING AND RELATED FIELDS

The core goal of text mining is to extract useful information from one or more texts. However, many researchers from many fields have been doing this for a long time. Hence the ideas in this book come from several areas of research.

Chapters 2 through 8 each focus on one idea that is important in text mining. Each chapter has many examples of how to implement this in computer code, which is then used to analyze one or more texts. That is, the focus is on analyzing text with techniques that require little or modest knowledge of mathematics or statistics.

The sections below describe each chapter’s highlights in terms of what useful information is produced by the programs in each chapter. This gives you an idea of what this book covers.

1.2.1 Chapter 2: Pattern Matching

To analyze text, language patterns must be detected. These include punctuation marks, characters, syllables. words, phrases, and so forth. Finding string patterns is so important that a pattern matching language has been developed, which is used in numerous programming languages and software applications. This language is called regular expressions.

Literally every chapter in this book relies on finding string patterns, and some tasks developed in this chapter demonstrate the power of regular expressions. However, many tasks that are easy for a human require attention to detail when they are made into programs.

For example, section 2.4 shows how to decompose Poe’s short story, “The Tell-Tale Heart,” into words. This is easy for someone who can read English, but dealing with hyphenated words, apostrophes, conventions of using single and double quotes, and so forth all require the programmer’s attention.

Section 2.5 uses the skills gained in finding words to build a concordance program that is able to find and print all instances of a text pattern. The power of Perl is shown by the fact that the result, program 2.7, fits within one page (including comments and blank lines for readability).

Finally, a program for detecting sentences is written. This, too, is a key task, and one that is trickier than it might seem. This also serves as an excellent way to show several of the more advanced features of regular expressions as implemented in Perl. Consequently, this program is written more than once in order to illustrate several approaches. The results are programs 2.8 and 2.9, which are applied to Dickens’s A Christmas Carol.

1.2.2 Chapter 3: Data Structures

Chapter 2 discusses text patterns, while chapter 3 shows how to record the results in a convenient fashion. This requires learning about how to store information using indices (either numerical or string).

The first application is to tally all the word lengths in Poe’s “The Tell-Tale Heart,” the results of which are shown in output 3.4. The second application is finding out how often each word in Dickens’s A Christmas Carol appears. These results are graphed in figure 3.1, which shows a connection between word frequency and word rank.

Section 3.7.2 shows how to combine Perl with a public domain word list to solve certain types of word games, for example, finding potential words in an incomplete crossword puzzle. Here is a chance to impress your friends with your superior knowledge of lexemes.

Finally, the material in this chapter is used to compare the words in the two Poe stories, “Mesmeric Revelations” and “The Facts in the Case of M. Valdemar.” The plots of these stories are quite similar, but is this reflected in the language used?

1.2.3 Chapter 4: Probability

Language has both structure and unpredictability. One way to model the latter is by using probability. This chapter introduces this topic using language for its examples, and the level of mathematics is kept to a minimum. For example, Dickens’s A Christmas Carol and Poe’s “The Black Cat” are used to show how to estimate letter probabilities (see output 4.2).

One way to quantify variability is with the standard deviation. This is illustrated by comparing the frequencies of the letter e in 68 of Poe’s short stories, which is given in table 4.1, and plotted in figures 4.3 and 4.4.

Finally, Poe’s “The Unparalleled Adventures of One Hans Pfaall” is used to show one way that text samples behave differently from simpler random models such as coin flipping. It turns out that it is hard to untangle the effect of sample size on the amount of variability in a text. This is graphically illustrated in figures 4.5, 4.6, and 4.7 in section 4.6.1.

1.2.4 Chapter 5: Information Retrieval

One major task in information retrieval is to find documents that are the most similar to a query. For instance, search engines do exactly this. However, queries are short strings of text, so even this application compares two texts: the query and a longer document. It turns out that these methods can be used to measure the similarity of two long texts.

The focus of this chapter is the comparison of the following four Poe short stories: “Hop Frog,” “A Predicament,” “The Facts in the Case of M. Valdemar,” and “The Man of the Crowd.” One way to quantify the similarity of any pair of stories is to represent each story as a vector. The more similar the stories, the smaller the angle between them. See output 5.2 for a table of these angles.

At first, it is surprising that geometry is one way to compare literary works. But as soon as a text is represented by a vector, and because vectors are geometric objects, it follows that geometry can be used in a literary analysis. Note that much of this chapter explains these geometric ideas in detail, and this discussion is kept as simple as possible so that it is easy to follow.

1.2.5 Chapter 6: Corpus Linguistics

Corpus linguistics is empirical: it studies language through the analysis of texts. At present, the largest of these are at a billion words (an average size paperback novel has about 100,000 words, so this is equivalent to approximately 10,000 novels). One simple but powerful technique is using a concordance program, which is created in chapter 2. This chapter adds sorting capabilities to it.

Even something as simple as examining word counts can show differences between texts. For example, table 6.2 shows differences in the following texts: a collection of business emails from Enron, Dickens’s A Christmas Carol, London’s The Call of the Wild, and Shelley’s Frankenstein. Some of these differences arise from narrative structure.

One application of sorted concordance lines is comparing how words are used. For example, the word body in The Call of the Wild is used for live, active bodies, but in Frankenstein it is often used to denote a dead, lifeless body. See tables 6.4 and 6.5 for evidence of this.

Sorted concordance lines are also useful for studying word morphology (see section 6.4.3) and collocations (see section 6.5). An example of the latter is phrasal verbs (verbs that change their meaning with the addition of a word, for example, throw versus throw up), which is discussed in section 6.5.2.

1.2.6 Chapter 7: Multivariate Statistics

Chapter 4 introduces some useful, core ideas of probability, and this chapter builds on this foundation. First, the correlation between two variables is defined, and then the connection between correlations and angles is discussed, which links a key tool of information retrieval (discussed in chapter 5) and a key technique of statistics.

This leads to an introduction of a few essential tools from linear algebra, which is a field of mathematics that works with vectors and matrices, a topic introduced in chapter 5. With this background, the statistical technique of principal components analysis (PCA) is introduced and is used to analyze the pronoun use in 68 of Poe’s short stories. See output 7.13 and the surrounding discussion for the conclusions drawn from this analysis.

This chapter is more technical than the earlier ones, but the few mathematical topics introduced are essential to understanding PCA, and all these are explained with concrete examples. The payoff is high because PCA is used by linguists and others to analyze many measurements of a text at once. Further evidence of this payoff is given by the references in section 7.6, which apply these techniques to specific texts.

1.2.7 Chapter 8: Clustering

Chapter 7 gives an example of a collection of texts, namely, all the short stories of Poe published in a certain edition of his works. One natural question to ask is whether or not they form groups. Literary critics often do this, for example, some of Poe’s stories are considered early examples of detective fiction. The question is how a computer might find groups.

To group texts, a measure of similarity is needed, but many of these have been developed by researchers in information retrieval (the topic of chapter 5). One popular method uses the PCA technique introduced in chapter 7, which is applied to the 68 Poe short stories, and results are illustrated graphically. For example, see figures 8.6, 8.7 and 8.8.

Clustering is a popular technique in both statistics and data mining, and successes in these areas have made it popular in text mining as well. This chapter introduces just one of many approaches to clustering, which is explained with Poe’s short stories, and the emphasis is on the application, not the theory. However, after reading this chapter, the reader is ready to tackle other works on the topic, some of which are listed in the section 8.4.

1.2.8 Chapter 9: Three Additional Topics

All books have to stop somewhere. Chapters 2 through 8 introduce a collection of key ideas in text mining, which are illustrated using literary texts. This chapter introduces three shorter topics.

First, Perl is popular in linguistics and text processing not just because of its regular expressions, but also because many programs already exist in Perl and are freely available online. Many of these exist as modules, which are groups of additional functions that are bundled together. Section 9.2 demonstrates some of these. For example, there is one that breaks text into sentences, a task also discussed in detail in chapter 2.

Second, this book focuses on texts in English, but any language expressed in electronic form is fair game. Section 9.3 compares Goethe’s novel Die Leiden des jungen Werthers (written in German) with some of the analyses of English texts computed earlier in this book.

Third, one popular model of language in information retrieval is the so-called bag-of-words model, which ignores word order. Because word order does make a difference, how does one quantify this? Section 9.4 shows one statistical approach to answer this question. It analyzes the order that character names appear in Dickens’s A Christmas Carol and London’s The Call of the Wild.

1.3 ADVICE FOR READING THIS BOOK

As noted above, to get the most out of this book, download Perl to your computer. As you read the chapters, try writing and running the programs given in the text. Once a program runs, watching the computer print out results of an analysis is fun, so do not deprive yourself of this experience.

How to read this book depends on your background in programming. If you never used any computer language, then the subsequent chapters will require time and effort. In this case, buying one or more texts on how to program in Perl is helpful because when starting out, programming errors are hard to detect, so the more examples you see, the better. Although learning to program is difficult, it allows you to do exactly what you want to do, which is critical when dealing with something as complex as language.

If you have programmed in a computer language other than Perl, try reading this book with the help of the online documentation and tutorials. Because this book focuses on a subset of Perl that is most useful for text mining, there are commands and functions that you might want to use but are not discussed here.

If you already program in Perl, then peruse the listings in chapters 2 and 3 to see if there is anything that is new to you. These two chapters contain the core Perl knowledge needed for the rest of the book, and once this is learned, the other chapters are understandable.

After chapters 2 and 3, each chapter focuses on a topic of text mining. All the later chapters make use of these two chapters, so read or peruse these first. Although each of the later chapters has its own topic, these are the following interconnections. First, chapter 7 relies on chapters 4 and 5. Second, chapter 8 uses the idea of PCA introduced in chapter 7. Third, there are many examples of later chapters referring to the computer programs or output of earlier chapters, but these are listed by section to make them easy to check.

The Perl programs in this book are divided into code samples and programs. The former are often intermediate results or short pieces of code that are useful later. The latter are typically longer and perform a useful task. These are also boxed instead of ruled. The results of Perl programs are generally called outputs. These are also used for R programs since they are interactive.

Finally, I enjoy analyzing text and believe that programming in Perl is a great way to do it. My hope is that this book helps share my enjoyment to both students and researchers.

CHAPTER 2

TEXT PATTERNS

2.1 INTRODUCTION

Did you ever remember a certain passage in a book but forgot where it was? With the advent of electronic texts, this unpleasant experience has been replaced by the joy of using a search utility. Computers have limitations, but their ability to do what they are told without tiring is invaluable when it comes to combing through large electronic documents. Many of the more sophisticated techniques later in this book rely on an initial analysis that starts with one or more searches.

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!