29,99 €
Language and Computers introduces students to the fundamentals of how computers are used to represent, process, and organize textual and spoken information. Concepts are grounded in real-world examples familiar to students’ experiences of using language and computers in everyday life.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 480
Veröffentlichungsjahr: 2012
Contents
What This Book Is About
Overview for Instructors
Acknowledgments
1 Prologue
1.1 Where do we start?
1.2 Writing systems used for human languages
1.3 Encoding written language
1.4 Encoding spoken language
2 Writers’ Aids
2.1 Introduction
2.2 Kinds of spelling errors
2.3 Spell checkers
2.4 Word correction in context
2.5 Style checkers
3 Language Tutoring Systems
3.1 Learning a language
3.2 Computer-assisted language learning
3.3 Why make CALL tools aware of language?
3.4 What is involved in adding linguistic analysis?
3.5 An example ICALL system: TAGARELA
3.6 Modeling the learner
4 Searching
4.1 Introduction
4.2 Searching through structured data
4.3 Searching through unstructured data
4.4 Searching semi-structured data with regular expressions
4.5 Searching text corpora
5 Classifying Documents
5.1 Automatic document classification
5.2 How computers “learn”
5.3 Features and evidence
5.4 Application: Spam filtering
5.5 Some types of document classifiers
5.6 From classification algorithms to context of use
6 Dialog Systems
6.1 Computers that “converse”?
6.2 Why dialogs happen
6.3 Automating dialog
6.4 Conventions and framing expectations
6.5 Properties of dialog
6.6 Dialog systems and their tasks
6.7 Eliza
6.8 Spoken dialogs
6.9 How to evaluate a dialog system
6.10 Why is dialog important?
7 Machine Translation Systems
7.1 Computers that “translate”?
7.2 Applications of translation
7.3 Translating Shakespeare
7.4 The translation triangle
7.5 Translation and meaning
7.6 Words and meanings
7.7 Word alignment
7.8 IBM Model 1
7.9 Commercial automatic translation
8 Epilogue
References
Concept Index
This edition first published 2013© 2013 Markus Dickinson, Chris Brew, Detmar Meurers
Blackwell Publishing was acquired by John Wiley & Sons, in February 2007. Blackwell’s publishing program has been merged with Wiley’s global Scientific, Technical, and Medical business to form Wiley-Blackwell.
Registered OfficeJohn Wiley & Sons, Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK
Editorial Offices350 Main Street, Malden, MA 02148-5020, USA9600 Garsington Road, Oxford, OX4 2DQ, UK The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK
For details of our global editorial offices, for customer services, and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com/wiley-blackwell.
The right of Markus Dickinson, Chris Brew, Detmar Meurers to be identified as the authors of this work has been asserted in accordance with the UK Copyright, Designs and Patents Act 1988.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.
Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The publisher is not associated with any product or vendor mentioned in this book. This publication is designed to provide accurate and authoritative information in regard to the subject matter covered. It is sold on the understanding that the publisher is not engaged in rendering professional services. If professional advice or other expert assistance is required, the services of a competent professional should be sought.
Library of Congress Cataloging-in-Publication Data
Dickinson, Markus.Language and computers / Markus Dickinson, Chris Brew, Detmar Meurers. p. cm. Includes index.
ISBN 978-1-4051-8306-2 (cloth) – ISBN 978-1-4051-8305-5 (pbk.) 1. Computational linguistics. 2. Natural language processing (Computer science) I. Brew, Chris. II. Meurers, Detmar. III. Language and computersP98.D495 2013410.285–dc23
2012010324
A catalogue record for this book is available from the British Library.
Cover design by www.cyandesign.co.uk
The computer has become the medium of choice through which much of our language use is channeled. Modern computer systems therefore spend a good part of their time working on human language. This is a positive development: not only does it give everyone on the internet access to a world of information well beyond the scope of even the best research libraries of the 1960s and 1970s, it also creates new capabilities for creation, exploitation, and management of information. These include tools that support nonfiction, creative writing, blogs and diaries, citizen journalism and social interactions, web search and online booking systems, smart library catalogs, knowledge discovery, spoken language dialogs, and foreign language learning.
This book takes you on a tour of different real-world tasks and applications where computers deal with language. During this tour, you will encounter essential concepts relating to language, representation, and processing, so that by the end of the book you will have a good grasp of key concepts in the field of computational linguistics. The only background you need to read this book is some curiosity about language and some everyday experience with computers.
This is indeed why the book is organized around real-world tasks and applications. We assume that most of you will be familiar with many of the applications and may wonder how they work or why they don’t work. What you may not realize is how similar the underlying processing is. For example, there is a great deal in common between how grammar checkers and automatic speech-recognition systems work. We hope that demonstrating how these concepts recur – in this case, in something called n-grams – will reinforce the importance of applying general techniques to new applications.
The book is designed to make you aware of how technology works and how language works. We focus on a few applications of language technology (LT), computational linguistics (CL), and natural language processing (NLP). LT, CL, and NLP are essentially names for the same thing, seen from the perspectives of industry, linguistics, and computer science, respectively. The tasks and applications were chosen because: (i) they are representative of techniques used throughout the field; (ii) they represent a significant body of work in and of themselves; (iii) they connect directly to linguistic modeling; and (iv) they are the ones the authors know best. We hope that you will be able to use these examples as an introduction to general concepts that you can apply to learning about other applications and areas of inquiry.
There are a number of features in this textbook that allow you to structure what you learn, explore more about the topics, and reinforce what you are learning. As a start, the relevant concepts being covered are typeset in bold and shown in the margins of each page. You can also look those up in the Concept Index at the end of the book.
The Under the Hood sections included in many of the chapters are intended to give you more detail on selected advanced topics. For those interested in learning more about language and computers, we hope that you find these sections enjoyable and enlightening, though the gist of each chapter can be understood without reading them.
At the end of each chapter there is a Checklist indicating what you should have learned. The Exercises also found at the end of each chapter review the material and give you opportunities to go beyond it. Our hope is that the checklist and exercises help you to get a good grasp of each of the topics and concepts involved. We recognize, however, that students from different backgrounds have different skills, so we have marked each question with an indication of who the question is for. There are four designations: most questions are appropriate for all students and thus are marked with ALL; LING questions assume some background and interest in linguistics; CS questions are appropriate for those with a background in computer science; and MATH is appropriate for those wanting to tackle more mathematical challenges. Of course, you should not feel limited by these markers, as a strong enough desire will generally allow you to tackle most questions.
If you enjoy the topic of a particular chapter, we encourage you to make use of the Further reading recommendations. You can also follow the page numbers under each entry in the References at the end of the book to the place where it is discussed in the book.
Finally, on the book’s companion websitehttp://purl.org/lang-and-comp we have collected resources and links to other materials that could be of interest to you when exploring topics around language and computers.
Everyday natural language processing tools, such as those mentioned in the previous section, provide new educational opportunities. The goal of our courses is to show students the capabilities of these tools, and especially to encourage them to take a reflective and analytic approach to their use.
The aim of this book is to provide insight into how computers support language-related tasks, and to provide a framework for thinking about these tasks. There are two major running themes. The first is an emphasis on the relation between the aspects of language that need to be made explicit to obtain working language technology and how they can be represented. We introduce and motivate the need for explicit representations, which arises from the fact that computers cannot directly work with language but require a committment to linguistic modeling and data structures that can be expressed in bits and bytes. We emphasize the representational choices that are involved in modeling linguistic abstractions in a concrete form computers can use.
The second running theme is the means needed in order to obtain the knowledge about language that the computer requires. There are two main options here: either we arrange for the computer to learn from examples, or we arrange for experts to create rules that encode what we know about language. This is an important distinction, both for students whose primary training is in formal linguistics, to whom the idea that such knowledge can be learned may be unfamiliar, and for the increasing number of students who are exposed to the “new AI” tradition of machine learning, to whom the idea of creating and using hand-coded knowledge representations may be surprising. Our view is that the typical real-world system uses a synthesis of both approaches, so we want students to understand and respect the strengths and weaknesses of both data-driven and theory-driven traditions.
This chapter lays the groundwork for understanding how natural language is used on a computer, by outlining how language can be represented. There are two halves to this chapter, focusing on the two ways in which language is transmitted: text and speech. The text portion outlines the range of writing systems in use and then turns to how information is encoded on the computer, specifically how all writing systems can be encoded effectively. The speech portion offers an overview of both the articulatory and the acoustic properties of speech. This provides a platform for talking about automatic speech recognition and text-to-speech synthesis. The chapter closes with a discussion of language modeling in the context of speech recognition.
This chapter sets out to (i) explain what is currently known about the causes of and reasons for spelling errors; (ii) introduce the main techniques for the separate but related tasks of nonword error detection, isolated-word spelling correction, and real-word spelling correction; and (iii) introduce the linguistic representations that are currently used to perform grammar correction – including a lengthy discussion of syntax – and explain the techniques employed. The chapter describes classical computational techniques, such as dynamic programming for calculating edit distance between words. It concludes with a discussion of advances in technology for applying spelling correction to newer contexts, such as web queries.
In this chapter, we seek to (i) introduce some fundamentals of first and second language acquisition and the relevance of language awareness for the latter; (ii) explain how computer-assisted language learning (CALL) tools can offer feedback for exercises without encoding anything about language in general; (iii) motivate and exemplify that the space of well-formed and ill-formed variation arising in language use often is well beyond what can be captured in a CALL tool; (iv) introduce the idea that the need for linguistic abstraction and generalization is addressed in tokenization and part-of-speech tagging as two fundamental NLP processing steps, that even such basic steps can be surprisingly difficult, and how part-of-speech classes are informed by different types of empirical evidence; (v) motivate the need for analysis beyond the word level and syntactic generalizations; and (vi) showcase what a real-life intelligent language tutoring system looks like and how it makes use of linguistic analysis for both analysis and feedback. The chapter ends with a section discussing how in addition to the context of language use and linguistic analysis, learner modeling can play an important role in tutoring systems.
To cover the task of searching, the goals of this chapter are (i) to outline the contexts and types of data in which people search for information (structured, unstructured, and semi-structured data), emphasizing the concept of one’s information need; (ii) to provide ways to think about the evaluation and improvement of one’s search results; (iii) to cover the important concept of regular expressions and the corresponding machinery of finite-state automata; and (iv) to delve into linguistic corpora, illustrating a search for linguistic forms instead of for content. The middle of the chapter provides more in-depth discussion of web searching, including how webpages are indexed and how the PageRank algorithm is used to determine relevance for a query.
This chapter aims to (i) explain the idea of classifiers and machine learning; (ii) introduce the Naive Bayes and Perceptron classifiers; (iii) give basic information about how to evaluate the success of a machine learning system; and (iv) explain the applications of machine learning to junk-mail filtering and to sentiment analysis. The chapter concludes with advice on how to select a machine learning algorithm and a discussion of how this plays out for a consulting company employing sentiment analysis as part of an opinion-tracking application designed to be used by corporate customers.
The goals of this chapter are (i) to introduce the idea of dialog systems; (ii) to describe some of the ways in which researchers have conceptualized dialog, including dialog moves, speech acts, and conversational maxims; (iii) to show some of the ways of classifiying dialog systems according to their purpose and design; and (iv) to illustrate how to measure the performance of dialog systems. We spend some time discussing the difficulties in automating dialogue and illustrate this with the example of the early dialog system Eliza.
Starting from the general idea of what it means to translate, in this chapter we (i) introduce the idea of machine translation (MT) and explain its capabilities and limits; (ii) indicate the differences between direct MT systems, transfer systems, and modern statistical methods; and (iii) set machine translation in its business context, emphasizing the idea of a translation need. The chapter includes extended discussion of IBM’s Model 1 translation model and of the Noisy Channel Model as applied to translation. It also discusses the translation needs of the European Union and those of the Canadian Meteorological Service, and contrasts them with the very difficult requirements for a satisfactory translation of a Shakespeare sonnet. The chapter concludes with a discussion of the likely consequences of automated translation for the career prospects and training choices of human translators.
The final chapter takes a look at the impact of language technology on society and human self-perception, as well as some of the ethical issues involved. We raise questions about how computers and language technology change the way information can be accessed and what this means for a democratic society in terms of control of information and privacy, how this changes learning and teaching as well as our jobs through upskilling and deskilling, and the impact on human self-perception when faced with machines capable of communicating using language. The goal of the chapter is to raise awareness of such issues arising through the use of computers and language technology in the context of real life.
A typical way to use the material in this book is in a quarter-length course assuming no mathematical or linguistic background beyond normal high-school experience. For this kind of course, instructors may choose to cover only a subset of the chapters. Each chapter is a stand-alone package, with no strict dependencies between the topics. We have found this material to be accessible to the general student population at the Ohio State University (OSU), where we originally developed the course.
To support the use of the book for longer or more advanced courses, the book also includes Under the Hood sections providing more detail on selected advanced topics, along with development of related analytical and practical skills. This kind of use is more appropriate as part of a Linguistics, Computer Science, or Communications major, or as an overview course at the nonspecialist graduate level. The Under the Hood topics have been useful in semester-length courses at Georgetown University and Indiana University, as well as honors versions at OSU.
Accompanying the book there is a website containing course materials, such as presentation slides, at http://purl.org/lang-and-comp/teaching.
This book grew out of a course that was offered for the first time in the winter quarter of 2004 at the Ohio State University (OSU): Linguistics 384, Language and Computers. A special thanks to the chair of the Department of Linguistics at the time, Peter Culicover, as well as the Director of Undergraduate Studies in Linguistics, Beth Hume, for having the foresight to recognize the potential for such a course and supporting its development and approval as a general education requirement course at OSU.
There would not be a book were it not for Danielle Descoteaux, who heard about the course and our ideas to turn it into a book and encouraged us to realize this with Wiley-Blackwell. Her sustained enthusiasm for the project made sure we stayed on the ball. We are also very grateful to our project editor Julia Kirk, who continued to be supportive and friendly despite our slow progress.
Drafts of this book benefited significantly from feedback on the course at OSU, as well as similar courses taught at Georgetown University (GU) and Indiana University (IU). The instructors shared many good ideas and pointers to relevant materials, for which we would like to thank particularly Stacey Bailey, Xiaofei Lu (who also provided the Chinese characters for this book), Anna Feldman, DJ Hovermale, Jon Dehdari, Rajakrishnan Rajkumar, Michael White, Sandra Kübler, Ross Israel, and the computational linguistics community at our universities for encouragement and neat ideas for the course and book. Thanks also to Lwin Moe for providing figures of the Burmese writing system; to Tony Meyer and Ayelet Weiss for help on the Hebrew examples; and to Wes Collins for providing Mam examples in chapter 7.
We wish to specially acknowledge Jason Baldridge at the University of Texas, who has continually tested book chapters, provided insightful suggestions for the book and associated courses, and diligently encouraged us to get the book completed.
While it would take too long to name them individually, the students at OSU, GU, and IU who took these courses have been a joy to teach, and their feedback on particular exercises and requests for clarifications on material have definitely made the book better.
A number of people read drafts or partial drafts and provided useful comments. Thanks to Amber Smith for her comments and discussion on integrating the book material into a real course, to Johannes Widmann for his comments on the Language Tutoring Systems chapter, and Keelan Evanini and Michael Heilman at Educational Testing Service (ETS) for extensive and extremely helpful comments on every aspect of the book, from the structure of the chapters through the best way to talk about neural networks to correcting typographical mistakes (including the especially awkward ones in the sentences advocating the use of spell checkers). Sheena Phillips, Jason Quinley, and Christl Glauder also helped us improve the book by carefully proofreading the final version - thanks!
Speaking individually, Markus Dickinson would like to thank Stephanie Dickinson for her encouragement and support during the last few years of this project, and also for her willingness to discuss specificity, sensitivity, and other classification metrics at the dinner table. Lynn Weddle deserves credit for responding to a Facebook post and suggesting Bart and Lisa Simpson for an example in the Searching chapter – although she had no idea it had to do with a book or would lead to an acknowledgment.
Chris Brew thanks Sheena Phillips for everything, and specifically for being exactly the right person to answer the question: “Is this too British?” Matthew and Jamie Brew gave helpful advice on the design of the cover, pointing out things that the older generation just did not see.
Detmar Meurers would like to thank Markus and Chris for the excellent collaboration and being such reliable colleagues and friends, Walter Meurers for emphasizing the importance of connecting research issues with the real world, and Kordula De Kuthy, Marius, Dario, and Cora for being around to remind him that life has a meaning beyond deadlines.
One of the aims of this book is to introduce you to different ways that computers are able to process natural language. To appreciate this task, consider how difficult it is to describe what happens when we use language. As but one example, think about what happens when a friend points at a book and says: “He’s no Shakespeare!”. First of all, there is the difficulty of determining who is meant by “he”. Your friend is pointing at a book, not at a person, and although we can figure out that “he” most likely refers to the author of the book, this is not obvious to a computer (and sometimes not obvious to the person you are talking with). Secondly, there is the difficulty of knowing who “Shakespeare” is. Shakespeare is the most famous writer in the English language, but how does a computer know that? Or, what if your friend had said “He’s no Lessing!”? English majors with an interest in science-fiction or progressive politics might take this as a reference to Doris Lessing; students of German literature might suspect a comparison to G.E. Lessing, the elegant Enlightenment stylist of German theater; but in the absence of background knowledge, it is hard to know what to make of this remark.
Finally, even if we unpack everything else, consider what your friend’s statement literally means: the author of this book is not William Shakespeare. Unless there is a serious possibility that the bookwas written by Shakespeare, this literal meaning is such a crushingly obvious truth that it is difficult to see why anyone would bother to express it. In context, however, we can tell that your friend is not intending to express the literal meaning, but rather to provide a negative evaluation of the author relative to Shakespeare,who is the standard benchmark for good writing in English. You could do the samething for a slim book of mystical poetry by saying “She’s no Dickinson!”, provided the hearer was going to understand that the reference was to American poet Emily Dickinson.
Or consider a different kind of statement: “I’m going to the bank with a fishing pole.” Most likely, this means that the speaker is going to a river bank and is carrying a fishing pole. But it could also mean that the speaker is going to a financial institution, carrying a fishing pole, or it could mean that the speaker is going to a financial institution known for its fishing pole – or even that the river bank the speaker is going to has some sort of notable fishing pole on it. We reason out a preferred meaning based on what we know about the world, but a computer does not know much about the world. How, then, can it process natural language?
From the other side of things, let us think for a moment about what you may have observed a computer doing with natural language. When you get a spam message, your email client often is intelligent enough to mark it as spam. Search for a page in a foreign language on the internet, and you can get an automatic translation, which usually does a decent job informing you as to what the site is about. Your grammar checker, although not unproblematic, is correct a surprising amount of the time. Look at a book’s listing on a site that sells books, like Amazon, and you may find automatically generated lists of keywords; amazingly, many of these words and phrases seem to give a good indication of what the book is about.
If language is so difficult, how is it that a computer can “understand” what spam is, or how could it possibly translate between two languages, for example from Chinese to English? A computer does not have understanding, at least in the sense that humans do, so we have to wonder what technology underlies these applications. It is these very issues that we delve into in this book.
There is a fundamental issue that must be addressed here before we can move on to talking about various applications. When a computer looks at language, what is it looking at? Is it simply a variety of strokes on a piece of paper, or something else? If we want to do anything with language, we need a way to represent it.
This chapter outlines the ways in which language is represented on a computer; that is, how language is encoded. It thus provides a starting point for understanding the material in the rest of the chapters.
If we think about language, there are two main ways in which we communicate – and this is true of our interactions with a computer, too. We can interact with the computer by writing or reading text or by speaking or listening to speech. In this chapter, we focus on the representations for text and speech, while throughout the rest of the book we focus mainly on processing text.
If we only wanted to represent the 26 letters of the English alphabet, our task would be fairly straightforward. But we want to be able to represent any language in any writing system, where a writing system is “a system of more or less permanent marks used to represent an utterance in such a way that it can be recovered more or less exactly without the intervention of the utterer” (Daniels and Bright, 1996).
And those permanent marks can vary quite a bit in what they represent. We will look at a basic classification of writing systems into three types: alphabetic, syllabic, and logographic systems. There are other ways to categorize the world’s writing systems, but this classification is useful in that it will allow us to look at how writing systems represent different types of properties of a language by means of a set of characters. Seeing these differences should illustrate how distinct a language is from its written representation and how the written representation is then distinct from the computer’s internal representation (see Section 1.3).
For writing English, the idea is that each letter should correspond to an individual sound, more or less, but this need not be so (and it is not entirely true in English). Each character could correspond to a series of sounds (e.g., a single character for str), but we could also go in a different direction and have characters refer to meanings.Thus, we could have a character that stands for the meaning of “dog”. Types of writing systems vary in how many sounds a character represents or to what extent a meaning is captured by a character. Furthermore, writing systems differ in whether they even indicate what a word is, as English mostly does by including spaces; we will return to this issue of distinguishing words in Section 3.4.
One important point to remember is that these are systems for writing down a language; they are not the language itself. The same writing system can be used for different languages, and the same language in principle could be written down in different writing systems (as is the case with Japanese, for example).
We start our tour of writing systems with what should be familiar to any reader of English: alphabets. Inalphabetic systems, a single character refers to a single sound. As any English reader knows, this is not entirely true, but it gives a good working definition.
We will look at two types of alphabetic systems. First, there are the alphabets, or phonemic alphabets, which represent all sounds with their characters; that is, both consonants and vowels are represented. Many common writing systems are alphabets: Etruscan, Latin, Cyrillic, Runic, and so forth. Note that English is standardly written in the Latin, or Roman, alphabet, although we do not use the entire repertoire of available characters, such as those with accents (e.g., è) or ligatures, combinations of two or more characters, such as the German ß, which was formed from two previous versions of s.
As an example of an alphabet other than Latin, we can look at Cyrillic, shown in Figure 1.1. This version of the alphabet is used to write Russian, and slight variants are used for other languages (e.g., Serbo-Croatian). Although some characters correspond well to English letters, others do not (e.g., the letter for [n]). The characters within brackets specify how each letter is said – that is, pronounced; we will return to these in the discussion of phonetic alphabets later on.
Figure 1.1 The Cyrillic alphabet used for Russian
Some alphabets, such as the Fraser alphabet used for the Lisu language spoken in Myanmar, China, and India, also include diacritics to indicate properties such as a word’s tone (how high or low pitched a sound is). A diacritic is added to a regular character, for example a vowel, indicating in more detail how that sound is supposed to be realized. In the case of Fraser, for example, M: refers to an [m] sound (written as M), which has a low tone (written as :).
Our second type of alphabetic system also often employs diacritics.Abjads, or consonant alphabets, represent consonants only; some prime examples are Arabic, Aramaic, and Hebrew. In abjads, vowels generally need to be deduced from context, as is illustrated by the Hebrew word for “computer”, shown on the left-hand side of Figure 1.2.
Figure 1.2 Example of Hebrew (abjad) text
The Hebrew word in its character-by-character transliteration bšxm contains no vowels, but context may indicate the [a] and [e] sounds shown in the pronunciation of the word [max∫ev]. (Note that Hebrew is written right to left, so the m as the rightmost character of the written word is the first letter pronounced.) As shown in the middle and right-hand side of Figure 1.2, the context could also indicate different pronunciations with different meanings.
The situation with abjads often is a little more complicated than the one we just described, in that characters sometimes represent selected vowels, and often vowel diacritics are available.
As we have discussed, alphabets use letters to encode sounds. However, there is not always a simple correspondence between a word’s spelling and its pronunciation. To see this, we need look no further than English.
English has a variety of non-letter–sound correspondences, which you probably labored through in first grade. First of all, there are words with the same spellings representing different sounds. The string ough, for instance, can be pronounced at least five different ways: “cough”, “tough”, “through”, “though”, and “hiccough”. Letters are not consistently pronounced, and, in fact, sometimes they are not pronounced at all; this is the phenomenon of silent letters. We can readily see these in “knee”, “debt”, “psychology”, and “mortgage”, among others. There are historical reasons for these silent letters, which were by and large pronounced at one time, but the effect is that we now have letters we do not speak.
Aside from inconsistencies of pronunciation, another barrier to the letter–sound correspondence is that English has certain conventions where one letter and one sound do not cleanly map to one another. In this case, the mapping is consistent across words; it just uses more or less letters to represent sounds. Single letters can represent multiple sounds, such as the x in “tax”, which corresponds to a k sound followed by an s sound. And multiple letters can consistently be used to represent one sound, as in the th in “the” or the ti in “revolution”.
Finally, we can alternate spellings for the same word, such as “doughnut” and “donut”, and homophones show us different words that are spelled differently but spoken the same, such as “colonel” and “kernel”.
Of course, English is not the only language with quirks in the letter–sound correspondences in its writing system. Looking at the examples in Figure 1.3 for Irish, we can easily see that each letter does not have an exact correspondent in the pronunciation.
Figure 1.3 Some Irish expressions
The issue we are dealing with here is that of ambiguity in natural language, in this case a letter potentially representing multiple possible sounds. Ambiguity is a recurring issue in dealing with human language that you will see throughout this book. For example, words can have multiple meanings (see Chapter 2); search queries can have different, often unintended meanings (see Chapter 4); and questions take on different interpretations in different contexts (see Chapter 6). In this case, writing systems can be designed that are unambiguous; phonetic alphabets, described next, have precisely this property.
You have hopefully noticed the notation used within the brackets ([]). The characters used there are a part of the International Phonetic Alphabet (IPA). Several special alphabets for representing sounds have been developed, and probably the best knownamong linguists is the IPA. We have been discussing problems with letter–sound correspondences, and phonetic alphabets help us discuss these problems, as theyallow for a way to represent all languages unambiguously using the same alphabet.
Each phonetic symbol in a phonetic alphabet is unambiguous: the alphabet is designed so that each speech sound (from any language) has its own symbol. This eliminates the need for multiple symbols being used to represent simple sounds and one symbol being used for multiple sounds. The problem for English is that the Latin alphabet, as we use it, only has 26 letters, but English has more sounds than that. So, it is no surprise that we find multiple letters liketh orsh being used for individual sounds.
The IPA, like most phonetic alphabets, is organized according to the articulatory properties of each sound, an issue to which we return in Section 1.4.2. As an example of the IPA in use, we list some words in Figure 1.4 that illustrate the different vowels in English.
Figure 1.4 Example words for English vowels (varies by dialect)
At http://purl.org/lang-and-comp/ipa you can view an interactive IPA chart, provided by the University of Victoria’s Department of Linguistics. Most of the English consonants are easy to figure out, e.g., [b] in “boy”, but some are not obvious. For example, [θ] stands for the th in “thigh”; [ð] for the th in “thy”; and [∫] for the sh in “shy”.
syllabic system are like alphabetic systems in that they involve a mapping between characters and sounds, but the units of sound are larger. The unit in question is called thesyllable. All human languages have syllables as basic building blocks of speech, but the rules for forming syllables differ from language to language. For example, in Japanese a syllable consists of a single vowel, optionally preceded by at most one consonant, and optionally followed by [m], [n], or [ŋ]. Most of the world’s languages, like Japanese, have relatively simple syllables. This means that the total number of possible syllables in the language is quite small, and that syllabic writing systems work well. But in English, the vowel can also be preceded by a sequence of several consonants (a so-calledconsonant cluster), and there can also be a consonant cluster after the vowel. This greatly expands the number of possible syllables. You could design a syllabic writing system for English, but it would be unwieldy and difficult to learn, because there are so many different possible syllables.
There are two main variants of syllabic systems, the first being abugidas (or alphasyllabaries). In these writing systems, the symbols are organized into families. All the members of a family represent the same consonant, but they correspond to different vowels. The members of a family also look similar, but have extra components that are added in order to represent the different vowels. What is distinctive about an abugida is that this process is systematic, with more or less the same vowel components being used in each family.
To write a syllable consisting of a consonant and a vowel, you go to the family for the relevant consonant, then select the family member corresponding to the vowel that you want. This works best for languages in which almost all syllables consist of exactly one consonant and exactly one vowel. Of course, since writing is a powerful technology, this has not stopped abugidas from being used, with modifications, to represent languages that do not fall into this pattern. One of the earliest abugidas was the Brahmi script, which was in wide use in the third centuryBCE and which forms the basis of many writing systems used on the Indian subcontinent and its vicinity.
As an example, let us look at the writing system for Burmese (or Myanmar), a Sino-Tibetan language spoken in Burma (or Myanmar). In Figure 1.5, we see a table displaying the base syllables.
Figure 1.5 Base syllables of the Burmese abugida
As you can see in the table, every syllable has a default vowel of. This default vowel can be changed by adding diacritics, as shown in Figure 1.6, for a syllables that start with [k]. We can see that the base character remains the same in all cases, while diacritics indicate the vowel change. Even though there is some regularity, the combination of the base character plus a diacritic results in a single character, which distinguishes abugidas from the alphabets in Section 1.2.1. Characters are written from left to right in Burmese, but the diacritics appear on any side of the base character.
Figure 1.6 Vowel diacritics of the Burmese abugida
The second kind of syllabic system is the syllabary. These systems use distinct symbols for each syllable of a language. An example syllabary for Vai, a Niger-Congo language spoken in Liberia, is given in Figure 1.7 (http://commons.wikimedia.org/wiki/Category:Vai-script).
An abugida is a kind of syllabary, but what is distinctive about a general syllabary is that the syllables need not be organized in any systematic way. For example, in Vai, it is hard to see a connection between the symbols for [pi] and [pa], or any connection between the symbols for [pi] and [di].
The final kind of writing system to examine involveslogographs, or logograms. A logograph is a symbol that represents a unit of meaning, as opposed to a unit of sound. It is hard to speak of a true logographic writing system because, as we will see, a language like Chinese that uses logographs often also includes phonetic information in the writing system.
To start, we can consider some non-linguistic symbols that you may have encountered before. Figure 1.8, for example, shows symbols found on US National Park Service signs (http://commons.wikimedia.org/wiki/File:National-Park-Service-sample-pictographs.svg). These are referred to as pictographs, or pictograms, because they essentially are pictures of the items to which they refer. In some sense, this is the simplest way of encoding semantic meaning in a symbol. The upper left symbol, for instance, refers to camping by means of displaying a tent.
Some modern systems evolved from a more pictographic representation into a more abstract symbol. To see an example of such character change, we can look at the development of the Chinese character for “horse”, as in Figure 1.9 (http://commons.wikimedia.org/wiki/Category:Ancient-Chinese-characters).
Figure 1.7 The Vai syllabary
Originally, the character very much resembled a horse, but after evolving over the centuries, the character we see now only bears a faint resemblance to anything horse-like.
Figure 1.8 US National Park Service symbols (pictographs)
Figure 1.9 The Chinese character for “horse”
Figure 1.10 Semantic–Phonetic Compounds used in writing Chinese
There are characters in Chinese that prevent us from calling the writing system a fully meaning-based system.Semantic-phonetic compounds are symbols with a meaning element and a phonetic element. An example is given in Figure 1.10, where we can see that, although both words are pronounced thesame, they have different meanings depending on the semantic component. Of course, it is not a simple matter of adding the phonetic and semantic components together: knowing that the meaning component of a semantic-phonetic compound is “wood” by itself does not tell you that the meaning of the compound is “timber”.
In addition to writing systems making use of characters differentiated by the shape and size of different marks, there are other writing systems in existence that exploit different sensory characteristics.
Perhaps best known is the tactile system of Braille. Braille is a writing system that makes it possible to read and write through touch, and as such it is primarily used by the blind or partially blind.We can see the basic alphabet in Figure 1.11 (http://commons.wikimedia.org/wiki/File:Braille-alfabet.jpg). The Braille system works by using patterns of raised dots arranged in cells of up to six dots, in a 3 x 2 configuration. Each pattern represents a character, but some frequent words and letter combinations have their own pattern. For instance, the pattern for f also indicates the number 6 and the word “from”. So, even though it is at core an alphabet, it has some logographic properties.
Figure 1.11 The Braille alphabet
An interesting case is the chromatographic writing system supposedly used by the Benin and Edo people in southern Nigeria (http://purl.org/lang-and-comp/chroma). This system is based on different color combinations and symbols. We have some reservations in mentioning this system, as details are difficult to obtain, but in principle both color and shape can encode pronunciation.
As we mentioned before, there is no simple correspondence between a writing system and a language. We will look at two examples, Korean and Azeri, which will highlight different aspects of the unique ways languages are written.
The writing system for Korean is a hybrid system, employing bothalphabetic and syllabic concepts. The writing system is actually referred to asHangul(orHangeul) and was developed in 1444 during the reign of King Sejong. The Hangul system contains 24 letter characters, 14 consonants and 10 vowels. Butwhen the language is written down, the letters are grouped together into syllablesto form new characters. The letters in a syllable are not written separately as inthe English system, but together form a single character. We can see an example in Figure 1.12 (http://commons.wikimedia.org/wiki/File:Hangeul.png), which shows how individual alphabetic characters together form the syllabic characters for “han” and “geul”. The letters are not in a strictly left-to-right or top-to-bottom pattern, but together form a unique syllabic character. Additionally, in South Korea,hanja (logographic Chinese characters) are also used.
Figure 1.12 Composition of the characters for “Hangeul”
Azeri is a language whose history illustrates the distinction between alanguage and its written encoding. Azeri is spoken in Azerbaijan, northwest Iran, and Georgia, and up until the 1920s it was written in different Arabic scripts. In 1929, however, speakers were forced to switch to the Latin alphabet for political reasons. In 1939, it was decided to change to the Cyrillic alphabet, to bring Azeri more in line with the rest of the Soviet Union. After the fall of the USSR in 1991, speakers went back to the Latin alphabet, although with some minor differences from when they had used it before. Azeri is thus a single language that has been written in many ways.
Given the range of writing systems, we now turn to the question of how to encode them on a computer. But to address that, we have a more fundamental question: How do we encode anything on a computer?
To answer that, we need to know that information on a computer is stored in bits. We can think of the memory of a computer as, at its core, a large number of on–off switches. A bit has two possible values, 1 (yes) or 0 (no), allowing us to flip the switches on or off. A single bit on its own does not convey much information, but multiple bits can come together to make meaningful patterns. It is thus often more convenient to speak of a byte, or a sequence of 8 bits, e.g., 01001010.
These sequences of bits tell the computer which switches are on and which are off, and – in the context of writing systems – a particular character will have a unique pattern of on–off switches. Before we fully spell that out, though, let us consider a better way to think of sequences of bits, other than just a sequence of mindless 0s and 1s.
Bit sequences are useful because they can represent numbers, in so-called binarynotation. They are called binary because there are only two digits to work with. The base ten numbers we normally use have columns for ones, tens, hundreds, and so on; likewise, binary numbers have their own columns, for ones, twos, fours, eights, and so on. In addition to base two and base ten, there are encodings such as hexadecimal, where there are 16 digits (0–9 and then the letters A–F).
In Big Endian notation, the most significant bit is the leftmost one; this is the standard way of encoding and is parallel to decimal (base ten) numbers. The positions in a byte thus encode the top row of Figure 1.13. As we can see in the second row of the figure, the positions for 64, 8, and 2 are “on”, and 64 + 8 + 2 equals 74. The binary (base two) number 01001010 therefore corresponds to the decimal number 74.
Figure 1.13 Example of Big Endian notation for binary numbers
Little Endian notation is just the opposite, where the most significant bit is the rightmost one, but it is less common. In both cases, the columns are all powers of two. This is just like with decimal numbers, where the columns are all powersof ten. As each digit is here limited to either 0 or 1 (two choices), we have to use powers of two.
Although many of you are likely already familiar with binary numbers, it is instructive to see how to convert from decimal to binary notation. We will consider the division method of conversion and walk through an example, converting the decimal number 9 into a 4-bit binary number.
The division method is easy to calculate and moves from the least significant to the most significant bit. Because every column has a value that is a multiple of 2, we divide by 2 with every step. In Figure 1.14, for example, we divide 9 by 2 and find that we have a remainder. A remainder after dividing by 2 means that we started with an odd number. Since 9 is odd, the rightmost bit should be 1.
Figure 1.14 The division method
The trick now is to take the resulting value, in this case 4, and divide it by 2. The same principle is at work here: if there is no remainder, it means that the starting number (4) was even, and this bit needs to be switched off for that to happen.
With 8 bits (a single byte) and each byte storing a separate character, we can represent 256 different characters (= 28). This is sufficient for many applications and more than enough for anyone wishing simply to type in Latin characters for English. With 256 possible characters, we can store every single letter used in English, plus all the auxiliary characters such as the comma, the space, the percent sign, and so on.
One of the first encodings for storing English text used only 7 bits, thus allowing for 128 possible characters. This is the ASCII encoding, the American Standard Code for Information Interchange. We can see most of the ASCII chart in Figure 1.15.
Figure 1.15 The ASCII chart
Omitted from the chart are codes 1–31, since these are used for control characters,such as a backspace, line feed, or tab. A nice property is that the numeric order reflects alphabetic ordering (e.g., 65 through 90 for uppercase letters). Thus, we can easily alphabetize the letters by comparing numbers. Although we have written the base ten number, for ease of reading, the binary number is what is used internally by the computer.
You might already be familiar with ASCII or other character-encoding systems, as many communications over email and the internet inform you of different encodings. Emails come with lots of information about themselves. Specifically, MultipurposeInternet Mail Extensions (MIME) providemeta-information on the text, or information that is part of the regular message, but also tell us something about thatmessage. MIME information tells us, among other things, what the character set is; an example can be seen in Figure 1.16.
Figure 1.16 MIME example
We have just mentioned ASCII and that there are other encoding systems, and, as you may recall, one of our goals is to be able to encode any language. With only 128 possible characters, ASCII clearly is insufficient for encoding the world’s writing systems. How, then, do we go about encoding writing systems other than the Latin alphabet?
One approach is simply to extend the ASCII system with various other systems. For example, ISO-8859-1 is an 8-bit encoding that in addition to ASCII includes extra letters needed for French, German, Spanish, and related languages; ISO-8859-7 is for the Greek alphabet; ISO-8859-8 for the Hebrew alphabet; and JIS-X-0208 encodes Japanese characters. While multiple encoding systems make it possible to specify only the writing systems one wants to use, there are potential problems. First, thereis always the possibility of misidentification. Two different encodings can use thesame number for two different characters or, conversely, different numbers for the same character. If an encoding is not clearly identified and needs to be guessed, for example by a web browser displaying a web page that does not specify the encoding explicitly, the wrong characters will be displayed. Secondly, it is a hassle to install and maintain many different systems in order to deal with various languages.
At this point, we should consider the situation: Unicode allows for over four billion characters, yet only needs about 100,000. If we use 32 bits to encode everycharacter, that will take up a lot of space. It seems as if ASCII is better, at least for English, as it only takes 7 bits to encode a character. Is there any way we can allow for many characters, while at the same time only encoding what we really need to encode?
This raises the question: How is it possible to encode 232 possibilities in 8 bits, as UTF-8 does? The answer is that UTF-8 can use several bytes to represent a single character if it has to, but it encodes characters with as few bytes as possible by using the highest (leftmost) bit as a flag. If the highest bit is 0, then this is a single character or the final character of a multi-byte character. For example, 01000001 is the single-character code for A (i.e., 65). If the highest bit is 1, then it is part of a multi-byte character. In this way, sequences of bytes can unambiguously denote sequences of Unicode characters. One nice consequence of this set-up is that ASCII text is already valid UTF-8.
More details on the encoding mechanism for UTF-8 are given in Figure 1.17. An important property here is that the first byte unambiguously tells you how many bytes to expect after it. If the first byte starts with 11110xxx, for example, we know that with four 1s, it has a total of four bytes; that is, there are three more bytes to expect. Note also that all nonstarting bytes begin with 10, indicating that they are not the initial byte.
Figure 1.17 UTF–8 encoding scheme
To take one example, the Greek character α (“alpha”) has a Unicode code value of 945, which in binary representation is 11 10110001. With 32 bits, then, it would be represented as 00000000 00000000 00000011 10110001. The conversion to UTF-8 works as follows: if we look at the second row of Figure 1.17, we see that there are 11 slots (x’s), and we have 10 binary digits. The 10-digit number 11 10110001 is the same as the 11-digit 011 10110001, and we can rearrange this as 01110 110001, so what we can do is insert these numbers into x’s in the second row: 110 01110 10 110001. This is thus the UTF-8 representation.
We now know that we can encode every language, as long as it has been written down. But many languages have no written form: of the 6,912 known spoken languages listed in theEthnologue (http://www.ethnologue.com), approximately half have never been written down. These unwritten languages appear all over the world: Salar (China); Gugu Badhun (Australia); Southeastern Pomo (California); and so on.
If we want to work with an unwritten language, we need to think about dealing with spoken language. Or, more practically, even if a language has a written form, there are many situations in which we want to deal with speech. Picture yourself talking to an airline reservation system on the phone, for example; this system must have some way of encoding the spoken language that you give to it. The rest of this chapter thus gives a glimpse into how computers can work with speech. Even though the book mainly focuses on written text, it is instructive to see how spoken and written data are connected.
In order to deal with speech, we have to figure out what it looks like. It is very easy to visualize spoken language if we think of it as phonetically transcribed into individual characters, but totranscribe, or write down, the speech into aphonetic alphabet (such as the IPA we saw before) is extremely expensive and time-consuming. Tobetter visualize speech and thus encode it on a computer, we need to know moreabout how speech works and how to measure the various properties of speech. Then, we can start to talk about how these measurements correspond to the sounds we hear.
Representing speech, however, is difficult. As discussed more fully below, speech is a continuous stream of sound, but we hear it as individual sounds. Sounds run together, and it is hard for a computer to tell where one ends and another begins. Additionally, people have different dialects and different sizes of vocal tracts and thus say things differently. Two people can say the same word and it will come out differently because their vocal tracts are unique.
Furthermore, the way a particular sound is realized is not consistent across utterances, even for one person. What we think of as one sound is not always said the same. For example, there is the phenomenon known as coarticulation, in which neighboring sounds affect the way a sound is uttered. The sound fork is said differently in “key” and the first sound in “kookaburra”. (If you do not believe this, stick one finger in your mouth when you say “key” and when you say “koo”; for “key” the tongue touches the finger, but not for “koo”.) On the flipside, what we think of as two sounds are not always very different. For instance, the s in “see” is acoustically very similar to the sh in “shoe”, yet we hear them as different sounds. This becomes clear when learning another language that makes a distinction you find difficult to discern. So both articulatory and acoustic properties of speech are relevant here; let’s now take a closer look at both of these.
Before we get into what sounds look like on a computer, we need to know how sounds are produced in the vocal tract. This is studied in a branch of linguistics known as articulatory phonetics. Generally, there are three components to a sound, at least for consonants: the place of articulation, the manner of articulation, and the voicing.
The place of articulation refers to where in the mouth the sound is uttered. Consider where your tongue makes contact with your mouth when you say [t] (t in tip) as opposed to when you say [k] (k in key, c in cool
