Natural Language Processing and Computational Linguistics - Mohamed Zakaria Kurdi - E-Book

Natural Language Processing and Computational Linguistics E-Book

Mohamed Zakaria Kurdi

0,0
139,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Natural language processing (NLP) is a scientific discipline which is found at the interface of computer science, artificial intelligence and cognitive psychology. Providing an overview of international work in this interdisciplinary field, this book gives the reader a panoramic view of both early and current research in NLP. Carefully chosen multilingual examples present the state of the art of a mature field which is in a constant state of evolution.

In four chapters, this book presents the fundamental concepts of phonetics and phonology and the two most important applications in the field of speech processing: recognition and synthesis. Also presented are the fundamental concepts of corpus linguistics and the basic concepts of morphology and its NLP applications such as stemming and part of speech tagging. The fundamental notions and the most important syntactic theories are presented, as well as the different approaches to syntactic parsing with reference to cognitive models, algorithms and computer applications.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 394

Veröffentlichungsjahr: 2016

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Title

Copyright

Introduction

I.1. The definition of NLP

I.2. The structure of this book

1 Linguistic Resources for NLP

1.1. The concept of a corpus

1.2. Corpus taxonomy

1.3. Who collects and distributes corpora?

1.4. The lifecycle of a corpus

1.5. Examples of existing corpora

2 The Sphere of Speech

2.1. Linguistic studies of speech

2.2. Speech processing

3 Morphology Sphere

3.1. Elements of morphology

3.2. Automatic morphological analysis

4 Syntax Sphere

4.1. Basic syntactic concepts

4.2. Elements of formal syntax

4.3. Syntactic formalisms

4.4. Automatic parsing

Bibliography

Index

End User License Agreement

Guide

Cover

Table of Contents

Begin Reading

List of Illustrations

1 Linguistic Resources for NLP

Figure 1.1.

Extract from a parallel corpus [MCE 96]

Figure 1.2.

Lifecycle of a corpus

Figure 1.3.

Data collection system using the Wizard of Oz method

Figure 1.4.

Diagram of a corpus data collection system using a prototype

Figure 1.5.

Transcription example using the software Transcriber

Figure 1.6.

Segment of a corpus analyzed using parts of speech

Figure 1.7.

Extract from the Penn Treebank

Figure 1.8.

Extract from a tree corpus for French

Figure 1.9.

Semantic annotation with a has_target relationship

2 The Sphere of Speech

Figure 2.1.

Communication system

Figure 2.2.

Speech organs

Figure 2.3.

Position of the soft palate during the production of French vowels

Figure 2.4.

Parts and aperture of the tongue

Figure 2.5.

Degree of aperture

Figure 2.6.

Displacement of air molecules by the vibrations of a tuning fork

Figure 2.7.

Frequency and amplitude of a simple wave

Figure 2.8.

An aperiodic wave

Figure 2.9.

Analysis of a complex wave

Figure 2.10.

A collection of tuning forks plays the role of a spectrograph

Figure 2.11.

Spectrogram of a French speaker saying “la rose est rouge” generated using the Prat software

Figure 2.12.

Spectrograms of the French vowels: [a], [i] and [u]

Figure 2.13.

Spectrograms of several non-sense words with consonants in the center

Figure 2.14.

Physiology of the ear

Figure 2.15.

Lip rounding

Figure 2.16.

Front vowels and back vowels

Figure 2.17.

French vowel trapezium

Figure 2.18.

Nasal and oral consonants

Figure 2.19.

Examples of some possible syllabic structures in French

Figure 2.20.

Examples of how double consonants are dealt with by the timing tier

Figure 2.21.

Propagation of nasality in Warao

Figure 2.22.

General architecture of speech recognition systems

Figure 2.23.

Markovian model of Xavier’s moods

Figure 2.24.

HMM diagram of Xavier’s behavior and his moods

Figure 2.25.

Markov chain for the word “ouvre” (open)

Figure 2.26.

Markov chain for the recognition of vocal commands

Figure 2.27.

HMM for the word “ouvre” (open)

Figure 2.28.

Trellis with three possible paths

Figure 2.29.

Typical architecture of an SS system

Figure 2.30.

General architecture of a concatenation synthesis system

Figure 2.31.

Serial and parallel architecture of formant speech synthesis systems

3 Morphology Sphere

Figure 3.1.

FSM for expressions of encouragement

Figure 3.2.

Examples of regular expressions with their FSM equivalence

Figure 3.3.

Conjugation of the verbs poser and porter in the present indicative tense

Figure 3.4.

Correspondence pair for the word houses

Figure 3.5.

FST for some words in French with the prefix “anti–”

Figure 3.6.

Partial FST for the derivation of some French words

Figure 3.7.

Kay and Kaplan diagram

Figure 3.8.

Xerox approach to the use of FST in morphological analysis

Figure 3.9.

A micro-text tagged with POS

Figure 3.10.

Tag sequences for “The written history of the Gauls is known”

Figure 3.11.

Architecture of the Brill tagger [BRI 95]

Figure 3.12.

Example of transformation-based learning

4 Syntax Sphere

Figure 4.1.

The role of grammar according to Chomsky

Figure 4.2.

Relationships in the framework of formalism, WG [HUD 10]

Figure 4.3.

Analysis of a simple sentence by the formalism of WG

Figure 4.4.

Example of an analysis by chunks [ABN 91a]

Figure 4.5.

Example of attachment ambiguity of a prepositional phrase

Figure 4.6.

Syntax trees of some noun phrases

Figure 4.7.

Grammar for the structures as shown in Figure 4.6

Figure 4.8.

Syntax trees and rewrite rules of an adjective phrase

Figure 4.9.

Grammar for the structures presented in Figure 4.8

Figure 4.10.

Grammar for the noun phrase with a recursion

Figure 4.11.

Examples of VP with different complement types

Figure 4.12.

Analysis of two types of sentences with two types of complements

Figure 4.13.

Example of analysis of two relative sentences

Figure 4.14.

Examples of the coordination of two phrases and two sentences

Figure 4.15.

Two syntax tree for a syntactically ambiguous sentence

Figure 4.16.

Hierarchy of formal grammars

Figure 4.17.

Grammar for the language a

n

b

n

c

n

Figure 4.18.

Syntax tree for the strings: abc and aabbcc

Figure 4.19.

The derivation of strings: ab, aabb, aaabbb

Figure 4.20.

Example of a grammar in Chomsky normal form with examples of syntax trees

Figure 4.21.

Syntax tree of an NP in Chomsky normal form

Figure 4.22.

Example of grammar in Greibach normal form

Figure 4.23.

Regular grammar that generates the language a

n

b

m

Figure 4.24.

Types of branching in complex sentences

Figure 4.25.

Type-2 grammar modified to account for the agreement

Figure 4.26.

Feature structures of the noun “house” and of the verb “love”

Figure 4.27.

CFS of a simple sentence

Figure 4.28.

Feature graphs for the agreement feature for the words “house” and “love”

Figure 4.29.

Example of structures of shared value and of a reentrant structure

Figure 4.30.

Example of structures of shared value and of a reentrant structure

Figure 4.31.

Examples of feature structures with subsumption relationships

Figure 4.32.

Examples of unifications

Figure 4.33.

DCG Grammar

Figure 4.34.

DCG enriched with FS

Figure 4.35.

Rewrite rule and syntax tree of a complex noun phrase

Figure 4.36.

Examples of phrases with their heads

Figure 4.37.

Diagrams of the two basic rules

Figure 4.38.

Examples of noun phrases

Figure 4.39.

Diagram and example of a determiner phrase according to [ABN 87]

Figure 4.40.

Example of the processing of a verb phrase with the X-bar theory

Figure 4.41.

Diagram and example of analysis of entire sentences

Figure 4.42.

Analysis of a completive subordinate

Figure 4.43.

Diagram of a typed FS in HPSG

Figure 4.44.

Simplified lexical entry of “house”

Figure 4.45.

Some abbreviations of FS in HPSG

Figure 4.46.

Enriched FS of the words “house” and “John”

Figure 4.47.

Some simplified FS of verbs

Figure 4.48.

FS of the verb “sees”

Figure 4.49.

General diagram of l-rules

Figure 4.50.

Rule of plural

Figure 4.51.

Rule of derivation of an agent noun from the verb

Figure 4.52.

Head-Complement Rule

Figure 4.53.

Head-Complement Rule applied to a transitive verb

Figure 4.54.

Head-Modifier Rule

Figure 4.55.

Head-Specifier Rule

Figure 4.56.

Lexical entry of the determiner “the”

Figure 4.57.

Feature structures of the noun phrase: the house

Figure 4.58.

Analysis of the verb phrase: sees the house

Figure 4.59.

The FS of the pronoun “the”

Figure 4.60.

The analysis of the sentence: he sees the house

Figure 4.61.

Examples of initial and auxiliary elementary trees

Figure 4.62.

Diagram and example of substitution in LTAG

Figure 4.63.

General diagram and example of adjunction

Figure 4.64.

An example of a derived tree and a corresponding derivation tree

Figure 4.65.

Examples of feature structures associated with elementary trees

Figure 4.66.

An example of a substitution with unification

Figure 4.67.

Diagram of an addition with unification

Figure 4.68.

Example of a recursive transition network

Figure 4.69.

A DCG and the corresponding RTNs TRVIDF PP

Figure 4.70.

Context-free grammars for the parsing of a fragment

Figure 4.71.

Example of parsing with a top-down algorithm

Figure 4.72.

Basic top-down algorithms

Figure 4.73.

Micro-grammar with a left recursion

Figure 4.74.

Left recursion with a top-down algorithm

Figure 4.75.

Example of parsing with a bottom-up algorithm

Figure 4.76.

Basic top-down algorithms

Figure 4.77.

CFG Grammar

Figure 4.78.

Repeated backtracking with a top-down algorithm

Figure 4.79.

Left-corner algorithm

Figure 4.80.

Example of parsing with the left-corner algorithm

Figure 4.81.

Table of an incomplete parsing

Figure 4.82.

Table of a complete parsing of a sentence

Figure 4.83.

Partial active chart

Figure 4.84.

Diagram of the first fundamental rule

Figure 4.85.

Example of application of the fundamental rule

Figure 4.86.

Tabular parsing algorithm with a bottom-up approach

Figure 4.87.

Example of a probabilistic context-free grammar for a fragment of French

Figure 4.88.

Parsing tree for a sentence from the PCFG of the Figure 4.87

Figure 4.89.

Supervised learning of a PCFG

Figure 4.90.

General structure of the parse table of the CYK algorithm

Figure 4.91.

The first step in the execution of the CYK algorithm

Figure 4.92.

The second step in the execution of the CYK algorithm

Figure 4.93.

The third step in the execution of the CYK algorithm

Figure 4.94.

The fourth step in the execution of the CYK algorithm

Figure 4.95.

Architecture of a neural network for handwritten digit recognition [NIE 14]

Figure 4.96.

Example of a recurring network

List of Tables

2 The Sphere of Speech

Table 2.1.

Examples of IPA transcriptions from French and English

Table 2.2.

The three first formants of the vowels [a], [i] and [u]

Table 2.3.

Examples of rounded and unrounded vowels in French

Table 2.4.

Nasal vowels in French

Table 2.5.

Oral vowels in French

Table 2.6.

Places of articulation of French consonants

Table 2.7.

French semi-vowels

Table 2.8.

Examples of distinctive features according to the taxonomy by Chomsky and Halle [CHO 68]

Table 2.9.

Constraint forbidding three successive consonants in Egyptian Arabic

Table 2.10.

Constraints involved in the case of joining (liaison) in French

Table 2.11.

Classification parameters of speech recognition systems

Table 2.12.

Probabilities of Xavier’s moods tomorrow, with the knowledge of his mood today

Table 2.13.

Probability of Xavier’s behavior, knowing his mood

Table 2.14.

Micro-corpus unigrams

Table 2.15.

Bigrams in the micro-corpus with their frequencies

Table 2.16.

Abbreviations to be normalized before synthesis

Table 2.17.

Examples of transcriptions with the Arpabet format

3 Morphology Sphere

Table 3.1.

Examples of Arabic words derived from the stem k-t-b

Table 3.2.

Examples of words in Turkish

Table 3.3.

Examples of prefixes commonly used in English

Table 3.4.

Examples of suffixes commonly used in English

Table 3.5.

Examples of collocations in three French literary corpora [LEG 12]

Table 3.6.

Examples of colligation

Table 3.7.

Successors of the word read [FRA 92]

Table 3.8.

Bigrams of the words bonbon and bonbonne

Table 3.9.

Some regular expressions with simple sequences

Table 3.10.

Regular expressions with character categories

Table 3.11.

Priority of operators in regular expressions

Table 3.12.

FSM transition table for expressions of encouragement

Table 3.13.

A minimal list of tags

4 Syntax Sphere

Table 4.1.

Clefting patterns

Table 4.2.

Examples of restrictive negation

Table 4.3.

A few examples of variation of the word order at the oral framework

Table 4.4.

Examples of noun phrases and their morphological sequences

Table 4.5.

Summary of formal grammars

Table 4.6.

Adopted notation and variants in the literature

Table 4.7.

Types in HPSG formalism [POL 97]

Table 4.8.

Labels adopted for the annotation of RTN

Table 4.10.

Table of left-corners of the grammar of the Figure 4.70

Table 4.11.

Summary of spaces required by the three parsing approaches [RES 92a]

Pages

C1

iii

iv

v

ix

x

xi

xii

xiii

xiv

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

275

276

277

G1

G2

G3

Series EditorPatrick Paroubek

Natural Language Processing and Computational Linguistics 1

Speech, Morphology and Syntax

Mohamed Zakaria Kurdi

First published 2016 in Great Britain and the United States by ISTE Ltd and John Wiley & Sons, Inc.

Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address:

ISTE Ltd27-37 St George’s RoadLondon SW19 4EUUK

www.iste.co.uk

John Wiley & Sons, Inc.111 River StreetHoboken, NJ 07030USA

www.wiley.com

© ISTE Ltd 2016

The rights of Mohamed Zakaria Kurdi to be identified as the author of this work have been asserted by him in accordance with the Copyright, Designs and Patents Act 1988.

Library of Congress Control Number: 2016945024

British Library Cataloguing-in-Publication Data

A CIP record for this book is available from the British LibraryISBN 978-1-84821-848-2

Introduction

Language is one of the central tools in our social and professional life. Among other things, it acts as a medium for transmitting ideas, information, opinions and feelings, as well as for persuading, asking for information, giving orders, etc. Computer Science began to gain an interest in language as soon as the field itself emerged, notably within the field of Artificial Intelligence (AI). The Turing test, one of the first tests developed to judge whether a machine is intelligent or not, stipulates that to be considered intelligent, a machine must possess conversational abilities that are comparable to those of a human being [TUR 50]. This implies that an intelligent machine must possess comprehension and production abilities, in the broadest sense of these terms. Historically, natural language processing (NLP) got itself focused on the potential for applying such technology to the real world in a very short span of time, particularly with machine translation (MT) during the Cold War. This began with the first machine translation system which was developed as the brainchild of a joint project between the University of Georgetown and IBM in the United States [DOS 55, HUT 04]. This work was not crowned with the success that was expected, as the researchers soon realized that a deep understanding of the linguistic system is a prerequisite for any comprehensive application of this kind. This discovery, presented in the famous report by automatic language processing advisory committee (ALPAC), had a considerable impact upon machine translation work and on the field of NLP in general. Today, even though NLP is largely industrialized, the interest in basic language processing has not waned. In fact, whatever the application of modern NLP, the use of a basic language processing unit such as a morphological, syntactic, recognition or speech synthesis analyzer is almost always indispensable (see [JON 11] for a more complete review of the history of NLP).

I.1. The definition of NLP

Firstly, what is NLP? It is a discipline which is found at the intersection of several other branches of science such as Computer Science, Artificial Intelligence and Cognitive Psychology. In English, there are several terms for certain fields which are very close to one another. Even though the boundaries between these designated fields are not always very clear, we are going to try to give a definition without claiming that the definition is unanimously accepted in the community. For example, the terms formal linguistics or computational linguistics relate more to models or linguistic formalities developed for IT implementation. The terms Human Language Technology or Natural Language Processing, on the other hand, refer to a publishing software tool equipped with features related to language processing. Furthermore, speech processing designates a range of techniques from signal processing to the recognition or production of linguistic units such as phonemes, syllables or words. Except for the dimension dealing with the signal processing, there is no major difference between speech processing and NLP. Many techniques that have initially been applied to speech processing have found their way into applications in NLP, an example being the Hidden Markov Models (HMM). This encouraged us to follow the unifying path already taken by other colleagues, such as [JUR 00], in this book. This path involves grouping NLP and speech processing into the same discipline. Finally, it is probably worth to mention the term corpus linguistics which refers to the methods of collection, annotation and use of corpora, both in linguistic research and NLP. Since corpora have a very important role in the process of constructing an NLP system, notably those which adopt a machine learning approach, we saw fit to consider corpus linguistics as a branch of NLP.

In the following sections, we will present and discuss the relationships between NLP and related disciplines such as linguistics, AI and cognitive science.

I.1.1. NLP and linguistics

Today, with the democratization of NLP tools, such tools make up the toolkit of many linguists conducting empirical work across a corpus. Therefore, Part-Of-Speech (POS) taggers, morphological analyzers and syntactic parsers of different types are often used in quantitative studies.

They may also be used to provide the necessary data for a psycholinguistics experiment. Furthermore, NLP offers linguists and cognitive scientists a new perspective by adding a new dimension to research carried out within these fields. This new dimension is testability. Indeed, many theoretical models have been tested empirically with the help of NLP applications.

I.1.2. NLP and AI

AI is the study, design and creation of intelligent agents. An intelligent agent is a natural or artificial system with perceptual abilities that allows it to act in a given environment to satisfy its desires or successfully achieve planned objectives (see [MAR 14a] and [RUS 10] for a general introduction). Work in AI is generally classified into several sub-disciplines or branches, such as knowledge representation, planning, perception and learning. All these branches are directly related to NLP. This gives the relationship between AI and NLP a very important dimension. Many consider NLP to be a branch of AI while some prefer to consider NLP a more independent discipline.

In the field of AI, planning involves finding the steps to follow to achieve a given goal. This is achieved based on a description of the initial states and possible actions. In the case of an NLP system, planning is necessary to perform complex tasks involving several sources of knowledge that must cooperate to achieve the final goal.

Knowledge representation is important for an NLP system at two levels. On the one hand, it can provide a framework to represent the linguistic knowledge necessary for the smooth functioning of the whole NLP system, even if the size and the quantity of the declarative pieces of information in the system vary considerably according to the approach chosen. On the other hand, some NLP systems require extralinguistic information to make decisions, especially in ambiguous cases. Therefore, certain NLP systems are paired with ontologies or with knowledge bases in the form of a semantic network, a frame or conceptual graphs.

In theory, perception and language seem far from one another, but in reality, this is not the case, especially when we are talking about spoken language where the linguistic message is conveyed by sound waves produced by the vocal folds. Making the connection between perception and voice recognition (the equivalent of perception with a comprehension element) is crucial, not only for comprehension, but also to improve the quality of speech recognition. Furthermore, some current research projects are looking at the connection between the perception of spoken language and the perception of visual information.

Machine learning involves building a representation after having examined data which may or may not have previously been analyzed. Since the 2000s, machine learning has gained particular attention within the field of AI, thanks to the opportunities it offers, allowing intelligent systems to be built with minimal effort compared to rule-based symbolic systems which require more work to be done by human experts. In the field of NLP, the extent to which basic machine learning is used depends highly on the targeted linguistic level. The extent to which machine learning is used varies between almost total domination within speech recognition systems and limited usage within high level processing such as in discourse analysis and pragmatics, where the symbolic paradigm is still dominant.

I.1.3. NLP and cognitive science

As with linguistics, the relationship between cognitive science and NLP goes in two directions. On the one hand, cognitive models can act to support a source of inspiration for an NLP system. On the other hand, constructing an NLP system according to a cognitive model can be a way of testing this model. The practical benefit of an approach which mimics the cognitive process remains an open question because in many fields, constructing a system which is inspired by biological models does not prove to be productive. It should also be noted that certain tasks carried out by NLP systems have no parallel in humans, such as searching for information across search engines or searching through large volumes of text data to extract useful information. NLP can be seen as an extension of human cognitive capabilities as part of a decision support system, for example. Other NLP systems are very close to human tasks, such as comprehension and production.

I.1.4. NLP and data science

With the availability of more and more digital data, a new discipline has recently emerged: data science. It involves extracting, quantifying and visualizing knowledge, primarily from textual and spoken data. Since these data are found in natural language in many cases, the role of NLP in the extraction and treatment process is obvious. Currently, given the countless industrial uses for this kind of knowledge, especially within the fields of marketing and decision-making, data science has become extremely important, even reminiscent of the beginning of the Internet in the 1990s. This shows that NLP is as useful when applied as it is when considered as a research field.

I.2. The structure of this book

The aim of this book is to give a panoramic overview of both early and modern research in the field of NLP. It aims to give a unified vision of fields which are often considered as being separate, for example speech processing, computational linguistics, NLP and knowledge engineering. It aims to be profoundly interdisciplinary and tries to consider the various linguistic and cognitive models as well as the algorithms and computational applications on an equal footing. The main postulate adopted in this book is that the best results can only be the outcome of a solid theoretical backbone and a well thought-out empirical approach. Of course, we are not claiming that this book covers the entirety of the works that have been done, but we have tried to strike a balance between North American, European and international work. Our approach is thus based on a duel perspective, aiming to be accessible and informative on the one hand but on the other, presenting the state-of-the-art of a mature field which is in a constant state of evolution.

As a result, this work uses an approach that consists of making linguistic and computer science concepts accessible by using carefully chosen examples. Furthermore, even though this book seeks to give the maximum amount of detail possible about the approaches presented, it nevertheless remains neutral about implementation details to leave each individual some freedom regarding the choice of a programming language. This must be chosen according to personal preference as well as the specific objective needs of individual projects.

Besides the introduction, this book is made up of four chapters. The first chapter looks at the linguistic resources used in NLP. It presents the different types of corpora that exist, their collection, as well as their methods of annotation. The second chapter discusses speech and speech processing. Firstly, we will present the fundamental concepts in phonetics and phonology and then we will move to the two most important applications in the field of speech processing: recognition and synthesis. The third chapter looks at the word level and it focuses particularly on morphological analysis. Finally, the fourth chapter covers the field of syntax. The fundamental concepts and the most important syntactic theories are presented, as well as the different approaches to syntactic analysis.

1Linguistic Resources for NLP

Today, the use of good linguistic resources for the development of NLP systems seems indispensable. These resources are essential for creating grammars, in the framework of symbolic approaches or to carry out the training of modules based on machine learning. However, collecting, transcribing, annotating and analyzing these resources is far from being trivial. This is why it seems sensible for us to approach these questions in an introduction to NLP. To find out more about the matter of linguistic data and corpus linguistics, a number of works and articles can be consulted, including [HAB 97, MEY 04, WIL 06a, WIL 06b] and [MEG 03].

1.1. The concept of a corpus

At this point, a definition of the term corpus is necessary, given that it is central for the subject of this section. It is important to note that research works related to both written and spoken language data is not limited to corpus linguistics. It is actually possible to use individual texts for various forms of literary, linguistic and stylistic analyses. In Latin, the word corpus means body, but when used as a source of data in linguistics, it can be interpreted as a collection of texts. To be more specific, we will quote scholarly definitions of the term corpus from the point of view of modern linguistics:

– A collection of linguistic data, either written texts or a transcription of recorded speech, which can be used as a starting point of linguistic description or as a means of verifying hypotheses about a language [CRY 91].

– A collection of naturally occurring language text, chosen to characterize a state or variety of a language [SIN 91].

– The corpus itself cannot be considered as a constituent of the language: it reflects the character of the artificial situation in which it has been produced and recorded [DUB 94].

From these definitions, it is clear that a corpus is a collection of data selected with a descriptive or applicative aim as its purpose. However, what exactly are these collections? What are their fundamental properties? It is generally thought that a corpus must possess a common set of fundamental properties, including representativeness, a finite size and existing in electronic format.

The problem with the representativeness of a corpus has been highlighted by Chomsky. According to him, certain entirely valid linguistic phenomena exist which might never be observed due to their rarity. Given the infinite nature of language due to the possibility of generating an infinite number of different sentences from a finite number of rules and the constant addition of neologisms in living languages, it is clear that whatever be the size of a corpus, it would be impossible to include all linguistically valid phenomena. In practice, researchers construct corpora whose size is geared to the individual needs of the research project. Thus, the phenomena that Chomsky is talking about are certainly linguistically valid from a theoretical point of view but are almost never used in everyday life. A sentence that is ten thousand words long and formed in accordance with the rules of the English language is of no interest to a researcher who is trying to construct a machine translation system from English to Arabic, for example. Furthermore, we often talk about applications which are task orientated, where we are looking to cover the linguistic forms used in an applied context, which is restricted to hotel reservations or asking for tourist information, for example. In this sort of application, even though it is impossible to be exhaustive, it is possible (even though it takes a lot of work) to reach a satisfactory level.

Often, the size of a corpus is limited to the given number of words (a million words, for example). The size of a corpus is generally predetermined in advance during the design phase. Sometimes, teams, such as Professor John Sinclair’s team at the University of Birmingham in England, update their corpus continuously (in this case, the term text collection is preferred). This continuous updating is necessary to guarantee the representativeness of a corpus across time: the opening up and the infinity of the corpus constitute a means to guarantee diachronic representativeness. Infinite corpora are particularly useful for lexicographers who are looking to include neologisms in new editions of their dictionaries.

Today, the word corpus is almost automatically associated with the word digital. Historically, the term referred mainly to printed texts or even manuscripts. The advantages of digitalization are undeniable. On the one hand, research has become much easier and results are obtained more quickly and, on the other hand, annotation can be done much more flexibly. Moreover, sometimes long-distance teamwork has become much easier. Furthermore, in view of the extreme popularity of digital technology, having data in an electronic format allows such data to be exchanged and allows paper usage to be reduced (which is a good thing given the impact of paper usage on the environment). However, this gave birth to some long-term issues related to electronic corpora such as portability. With the development of operating systems and text analysis software, it sometimes becomes difficult to access documents that were coded with old versions of software with a format that is obsolete. To get around this problem, researchers try to perpetuate their data using independent versions of platforms and of text processing software. XML markup language is one of the main languages used for the annotation of data. More specialized standards such as the EAGLES Corpus Encoding Standard and XCES are also available and are under continuous development to allow researchers to understand linguistic phenomena in a precise and reliable way.

In the field of NLP, the use of corpora is uncontested. Of course, there is a debate surrounding the place of corpora within the approach to build NLP systems, but to our knowledge, everyone is in agreement that linguistic data play a very important role in this process. Corpora are also very useful within linguistics itself, especially for those who wish to carry out a study on a specific linguistic phenomenon such as collocations, fixed expressions, as well as lexical ambiguities. Furthermore, corpora are used more and more in disciplines such as cognitive science or foreign language teaching [NES 05, GRI 06, ATW 08].

1.2. Corpus taxonomy

To establish a corpus taxonomy, many criteria can be used, such as the distinction between spoken corpora, written corpora, modern corpora, corpora of an ancient form of a language or a dialect, as well as the number of languages in a given corpus.

1.2.1.Written versus spoken

This kind of corpus is made up of a collection of written texts. Often, corpora such as these contain newspaper articles, webpages, blogs, literary or religious texts, etc. Another source of data from the Internet includes written dialogues between two people communicating on the Internet (such as in a chat) or between a person and a computer program designed specifically for this kind of activity. Often, newspaper archives such as The Guardian (for English), Le Monde (for French) and Al-Hayat (for Arabic) are also a very popular source for written texts. They are especially useful within the fields of information research and lexicography. More sophisticated corpora also exist, such as the British National Corpus (BNC), the Brown Corpus and the Susanne Corpus, which consists of 130,000 words of the Brown Corpus which have been analyzed syntactically. Written corpora can appear in many forms. These forms differ as much at the level of their structures and linguistic functions as at the level of their collection method.

– Verbal dictations: these are often texts read by office software users to gather digital texts in the form of data. Speakers vary in age range and it is necessary to record speakers of different genders to guarantee phonetic variation. Sometimes, geographical variations are also included, for example (in case of American English), New York English versus Midwest English.

– Spoken commands: this kind of corpus is made up of a collection of commands whose purpose is to control a machine such as a television or a robot. The structures of utterances used are often quite limited because short imperative sentences are naturally quite frequently used. Performance phenomena such as hesitation, self-correction or incompleteness are not very common.

– Human–machine dialogues: in this kind of corpus, we try to capture a spoken exchange or a written exchange between a human user and a computer. The diversity of linguistic phenomena that we are able to observe is quite limited. The main gaps come from the fact that machines are far from being as good as humans. Therefore, humans adapt to the level of the machine by simplifying their utterances [LUZ 95].

– Human–human dialogues mediated by machines: here, we have an exchange (spoken or written) between two different human users. The mediator role of the machine could quite simply involve transmitting written sequences or sound waves (often with some extent of loss in sound quality). Machines could also be more directly involved, especially in the case of translation systems. An example of such situation could be a speaker “A” who is speaking in French and this person who tries to reserve a hotel room in Tokyo by speaking to a Japanese agent (speaker B) who does not speak French.

– Multimodal dialogues: whether they are between a human and a machine or mediated by a machine, these dialogues have the ability to combine gestures and words. For example, in a drawing task, the user could ask the machine to move a blue square from one place to another. Put this square <pointing gesture towards the blue square> here <pointing gesture towards the desired location>.

1.2.2.The historical point of view

The period that a linguistic corpus represents can be considered as a criterion for distinguishing between corpora. There are corpora representing linguistic usage at a specific period in the history of a given language. The data covered by ancient texts often consist of a collection of literary texts and official texts (political speeches, archives of a state). In view of the fleeting nature of oral speech, it is virtually impossible to accurately identify all the sensitivities of a spoken language long ago.

1.2.3.The language of corpora

A corpus must be expressed in one or several languages. This leads us to need to distinguish between: monolingual corpora, multilingual corpora or parallel corpora.

Monolingual corpora are corpora whose content is formulated with the help of a single language. The majority of corpora that are available today are of this type. Thus, examples of corpora of this type are very common: the Brown Corpus and the Switchboard Corpus for written and spoken English, respectively, and the Frantext corpus, as well as the OTG corpus for written and spoken French, respectively.

Furthermore, parallel corpora include a collection of texts where versions of the text in several languages are connected to one another. These corpora can be represented as a graph or even a matrix of two dimensions n x m: where n is the number of texts (Tx) in the source language and m is the number of languages. News reports from press agencies such as Agence France-Presse (AFP) or Reuters are classic examples of sources of such corpora: each report is translated into several languages. Furthermore, several organizations and international companies such as the United Nations, the Canadian Parliament and Caterpillar have parallel corpora for various purposes. Some research laboratories have also collected this type of corpora, such as the European corpus CRATER by the University of Lancaster, which is a parallel corpus in English, French and Spanish. For a corpus to really be useful, fine alignments must be made at levels such as sentence or word. Thus, each sentence from text “T1” in language “L1” must be connected to a sentence in text “T2” in language “L2”. An extract from a parallel corpus with aligned sentences is shown in Figure 1.1.

Figure 1.1.Extract from a parallel corpus [MCE 96]

Note that a multitude of multilingual corpora exist which are not parallel corpora. For example, the corpus CALLFRIEND Collection is a corpus of telephone conversations available in 12 languages and three dialects, and the corpus CALLHOME is made up of telephone conversations available in six languages. In these two corpora, the dialogues, which are not identical from one language to another, are not connected in the same way as in the format presented above.

Parallel corpora are a fundamental source used to build and test machine translation software (see [KOE 05]). An important question to ask after having identified multilingual data is the alignment of the content of these data. To resolve such a fundamental problem to make use of multilingual corpora, a number of approaches have been proposed. Some approaches are based on the comparison of the length of sentences in terms of the number of characters they contain [GAL 93] and in terms of the number of words [BRO 91], while others adopt the criterion of vectorial distance between the segments of the corpora considered [FUN 94]. Furthermore, there are approaches which make use of lexical information to establish links between two aligned texts [CHE 93]. Other approaches combine the length of sentences with lexical information [MEL 99, MOO 02]. Note that the GIZA++ toolbox is particularly popular for aligning multilingual corpora.

1.2.4.Thematic representativity

This criterion affects written corpora which target the representativity of an entire language or at least a large proportion of this language. To achieve representativity at such a broad level, having a selection of texts coming from a variety of domains is essential. Three types of layouts can be cited:

– Balanced corpora: to guarantee thematic representativeness, texts are collected according to their topics, so as to ensure that each topic is represented equally.

– Pyramidal corpora: in these cases, corpora are constructed using large collections for topics considered central and small collections for topics considered less important.

– Opportunistic corpora: this kind of corpora is used in cases where there are not enough linguistic resources for a given language or for a given application. Therefore, it is indispensable to make the most of all available resources, even if they are not sufficient to guarantee the representativeness aimed for.

Note that guaranteeing the topic representativity of a corpus is often complicated. In most cases, texts look at several different topics at once and it is difficult (especially in the case of an automatic collection from a corpus, with the help of a web crawler, for example) to decide exactly what topic a given text covers. Moreover, as [DEW 98] underlines, there is no commonly accepted typology used for the classification of texts. Finally, it may be useful to mention that lexicography and online information research are among the areas of application which are the most sensitive to thematic representativeness.

1.2.5.Age range of speakers

The application or scientific domains often impose constraints regarding the age range of speakers. Certain corpora are only made up of linguistic productions uttered by adult speakers, such as air travel information system (ATIS), distributed by LDC. Certain corpora that will be used to research first language acquisition are made up of baby utterances. The most well-known example of this is the child language data exchange systems (CHILDES) corpus, collected and distributed at Carnegie Mellon University in the United States. Finally, corpora exist which cover the linguistic productions of adolescents, such as the spoken conversation corpora collected at the University of Southern Demark SDU as part of the European project NICE.

1.3. Who collects and distributes corpora?

The increasingly central role of corpora in the process of creating AI applications has led to the emergence of numerous organizations and projects with a mission to create, transcribe, annotate and distribute corpora.

1.3.1.The Gutenberg project1

This is a multilingual library which distributes approximately 45,000 free books. This project makes an extensive choice of books available to Internet users, both at the linguistic level and at the level of topics available, since it distributes literary works, scientific works, historical works, etc. Nevertheless, since it is not specifically designed to be used as a corpus, the works distributed in this project need some preprocessing to make them usable as a corpus.

1.3.2.The linguistic data consortium

Founded in 1992 and based at the University of Pennsylvania in the United States, this research and development center is financed primarily by the National Science Foundation (NSF). Its main activities consist of collecting, distributing and annotating linguistic resources which correspond to the needs of research centers and American companies which work in the field of language technology. The linguistic data consortium (LDC) owns an extensive catalog of written and spoken corpora which covers a fairly large number of different languages.

1.3.3.European language resource agency

This is a European level centralized not-for-profit organization. Since its creation in 1995, the European language resource agency (ELRA2) has been collecting, distributing and validating spoken, written and terminological linguistic resources, as well as software tools. Although it is based in the European city of Paris, this organization does not only look at European languages. Indeed, many corpora of non-European languages, including Arabic, feature in its catalog. Among its scientific activities, the ELRA organizes a biannual conference: language resources and evaluation conference (LREC).

1.3.4.Open language archives community

Open language archives community (OLAC3) is a consortium of institutions and individuals which is creating a virtual library of linguistic resources on a global scale and is developing a consensus on best practices for the digital archiving of linguistic resources by creating a network of storing services for these resources.

1.3.5.Miscellaneous

Given the considerable costs of a quality corpus and the lucrative character of most existing organizations, it is often difficult for researchers who do not have a sufficient budget to get hold of corpora that they need for their studies. Moreover, many manufacturers and research laboratories jealously keep back the linguistic resources they own, even after the projects for which the corpora were collected have finished.

To confront this problem of accessibility, many centers and laboratories have begun to adopt a logic that is similar to that of free software. Laboratories such as CLIPS-IMAG and Valoria have, for example, taken the initiative of collecting and distributing two corpora of oral dialogues for free. These corpora include the Grenoble Tourism Office corpus and the Massy School corpus4 [ANT 02]. In the United States, there are examples such as the Trains Corpus collected by the University of Rochester, whose transcriptions have been made readily available to the community [HEE 95]. In addition, the ngrams of the Google books5 is a corpus which is used more and more for various purposes.

1.4. The lifecycle of a corpus

As an artificial object, corpora can only very rarely exist in the natural world. Corpora collection often requires important resources. From this point of view, in some ways, the lifecycle of a corpus resembles the lifecycle of a piece of software. To get a closer look at the lifecycle of a corpus, let us examine the flowchart shown in Figure 1.2. As we can see that there are four main steps involved in this process: preparation/planning, acquisition and preparation of the data, use of the data and evaluation of the data. It is a cyclical process and certain steps are repeated to deal with a lack of linguistic representativeness (often diachronic, geographical or empirical in nature) to improve the results of an NLP module.

Figure 1.2.Lifecycle of a corpus

Three main steps stand out within a lifecycle: