139,99 €
Natural language processing (NLP) is a scientific discipline which is found at the interface of computer science, artificial intelligence and cognitive psychology. Providing an overview of international work in this interdisciplinary field, this book gives the reader a panoramic view of both early and current research in NLP. Carefully chosen multilingual examples present the state of the art of a mature field which is in a constant state of evolution.
In four chapters, this book presents the fundamental concepts of phonetics and phonology and the two most important applications in the field of speech processing: recognition and synthesis. Also presented are the fundamental concepts of corpus linguistics and the basic concepts of morphology and its NLP applications such as stemming and part of speech tagging. The fundamental notions and the most important syntactic theories are presented, as well as the different approaches to syntactic parsing with reference to cognitive models, algorithms and computer applications.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 394
Veröffentlichungsjahr: 2016
Cover
Title
Copyright
Introduction
I.1. The definition of NLP
I.2. The structure of this book
1 Linguistic Resources for NLP
1.1. The concept of a corpus
1.2. Corpus taxonomy
1.3. Who collects and distributes corpora?
1.4. The lifecycle of a corpus
1.5. Examples of existing corpora
2 The Sphere of Speech
2.1. Linguistic studies of speech
2.2. Speech processing
3 Morphology Sphere
3.1. Elements of morphology
3.2. Automatic morphological analysis
4 Syntax Sphere
4.1. Basic syntactic concepts
4.2. Elements of formal syntax
4.3. Syntactic formalisms
4.4. Automatic parsing
Bibliography
Index
End User License Agreement
Cover
Table of Contents
Begin Reading
1 Linguistic Resources for NLP
Figure 1.1.
Extract from a parallel corpus [MCE 96]
Figure 1.2.
Lifecycle of a corpus
Figure 1.3.
Data collection system using the Wizard of Oz method
Figure 1.4.
Diagram of a corpus data collection system using a prototype
Figure 1.5.
Transcription example using the software Transcriber
Figure 1.6.
Segment of a corpus analyzed using parts of speech
Figure 1.7.
Extract from the Penn Treebank
Figure 1.8.
Extract from a tree corpus for French
Figure 1.9.
Semantic annotation with a has_target relationship
2 The Sphere of Speech
Figure 2.1.
Communication system
Figure 2.2.
Speech organs
Figure 2.3.
Position of the soft palate during the production of French vowels
Figure 2.4.
Parts and aperture of the tongue
Figure 2.5.
Degree of aperture
Figure 2.6.
Displacement of air molecules by the vibrations of a tuning fork
Figure 2.7.
Frequency and amplitude of a simple wave
Figure 2.8.
An aperiodic wave
Figure 2.9.
Analysis of a complex wave
Figure 2.10.
A collection of tuning forks plays the role of a spectrograph
Figure 2.11.
Spectrogram of a French speaker saying “la rose est rouge” generated using the Prat software
Figure 2.12.
Spectrograms of the French vowels: [a], [i] and [u]
Figure 2.13.
Spectrograms of several non-sense words with consonants in the center
Figure 2.14.
Physiology of the ear
Figure 2.15.
Lip rounding
Figure 2.16.
Front vowels and back vowels
Figure 2.17.
French vowel trapezium
Figure 2.18.
Nasal and oral consonants
Figure 2.19.
Examples of some possible syllabic structures in French
Figure 2.20.
Examples of how double consonants are dealt with by the timing tier
Figure 2.21.
Propagation of nasality in Warao
Figure 2.22.
General architecture of speech recognition systems
Figure 2.23.
Markovian model of Xavier’s moods
Figure 2.24.
HMM diagram of Xavier’s behavior and his moods
Figure 2.25.
Markov chain for the word “ouvre” (open)
Figure 2.26.
Markov chain for the recognition of vocal commands
Figure 2.27.
HMM for the word “ouvre” (open)
Figure 2.28.
Trellis with three possible paths
Figure 2.29.
Typical architecture of an SS system
Figure 2.30.
General architecture of a concatenation synthesis system
Figure 2.31.
Serial and parallel architecture of formant speech synthesis systems
3 Morphology Sphere
Figure 3.1.
FSM for expressions of encouragement
Figure 3.2.
Examples of regular expressions with their FSM equivalence
Figure 3.3.
Conjugation of the verbs poser and porter in the present indicative tense
Figure 3.4.
Correspondence pair for the word houses
Figure 3.5.
FST for some words in French with the prefix “anti–”
Figure 3.6.
Partial FST for the derivation of some French words
Figure 3.7.
Kay and Kaplan diagram
Figure 3.8.
Xerox approach to the use of FST in morphological analysis
Figure 3.9.
A micro-text tagged with POS
Figure 3.10.
Tag sequences for “The written history of the Gauls is known”
Figure 3.11.
Architecture of the Brill tagger [BRI 95]
Figure 3.12.
Example of transformation-based learning
4 Syntax Sphere
Figure 4.1.
The role of grammar according to Chomsky
Figure 4.2.
Relationships in the framework of formalism, WG [HUD 10]
Figure 4.3.
Analysis of a simple sentence by the formalism of WG
Figure 4.4.
Example of an analysis by chunks [ABN 91a]
Figure 4.5.
Example of attachment ambiguity of a prepositional phrase
Figure 4.6.
Syntax trees of some noun phrases
Figure 4.7.
Grammar for the structures as shown in Figure 4.6
Figure 4.8.
Syntax trees and rewrite rules of an adjective phrase
Figure 4.9.
Grammar for the structures presented in Figure 4.8
Figure 4.10.
Grammar for the noun phrase with a recursion
Figure 4.11.
Examples of VP with different complement types
Figure 4.12.
Analysis of two types of sentences with two types of complements
Figure 4.13.
Example of analysis of two relative sentences
Figure 4.14.
Examples of the coordination of two phrases and two sentences
Figure 4.15.
Two syntax tree for a syntactically ambiguous sentence
Figure 4.16.
Hierarchy of formal grammars
Figure 4.17.
Grammar for the language a
n
b
n
c
n
Figure 4.18.
Syntax tree for the strings: abc and aabbcc
Figure 4.19.
The derivation of strings: ab, aabb, aaabbb
Figure 4.20.
Example of a grammar in Chomsky normal form with examples of syntax trees
Figure 4.21.
Syntax tree of an NP in Chomsky normal form
Figure 4.22.
Example of grammar in Greibach normal form
Figure 4.23.
Regular grammar that generates the language a
n
b
m
Figure 4.24.
Types of branching in complex sentences
Figure 4.25.
Type-2 grammar modified to account for the agreement
Figure 4.26.
Feature structures of the noun “house” and of the verb “love”
Figure 4.27.
CFS of a simple sentence
Figure 4.28.
Feature graphs for the agreement feature for the words “house” and “love”
Figure 4.29.
Example of structures of shared value and of a reentrant structure
Figure 4.30.
Example of structures of shared value and of a reentrant structure
Figure 4.31.
Examples of feature structures with subsumption relationships
Figure 4.32.
Examples of unifications
Figure 4.33.
DCG Grammar
Figure 4.34.
DCG enriched with FS
Figure 4.35.
Rewrite rule and syntax tree of a complex noun phrase
Figure 4.36.
Examples of phrases with their heads
Figure 4.37.
Diagrams of the two basic rules
Figure 4.38.
Examples of noun phrases
Figure 4.39.
Diagram and example of a determiner phrase according to [ABN 87]
Figure 4.40.
Example of the processing of a verb phrase with the X-bar theory
Figure 4.41.
Diagram and example of analysis of entire sentences
Figure 4.42.
Analysis of a completive subordinate
Figure 4.43.
Diagram of a typed FS in HPSG
Figure 4.44.
Simplified lexical entry of “house”
Figure 4.45.
Some abbreviations of FS in HPSG
Figure 4.46.
Enriched FS of the words “house” and “John”
Figure 4.47.
Some simplified FS of verbs
Figure 4.48.
FS of the verb “sees”
Figure 4.49.
General diagram of l-rules
Figure 4.50.
Rule of plural
Figure 4.51.
Rule of derivation of an agent noun from the verb
Figure 4.52.
Head-Complement Rule
Figure 4.53.
Head-Complement Rule applied to a transitive verb
Figure 4.54.
Head-Modifier Rule
Figure 4.55.
Head-Specifier Rule
Figure 4.56.
Lexical entry of the determiner “the”
Figure 4.57.
Feature structures of the noun phrase: the house
Figure 4.58.
Analysis of the verb phrase: sees the house
Figure 4.59.
The FS of the pronoun “the”
Figure 4.60.
The analysis of the sentence: he sees the house
Figure 4.61.
Examples of initial and auxiliary elementary trees
Figure 4.62.
Diagram and example of substitution in LTAG
Figure 4.63.
General diagram and example of adjunction
Figure 4.64.
An example of a derived tree and a corresponding derivation tree
Figure 4.65.
Examples of feature structures associated with elementary trees
Figure 4.66.
An example of a substitution with unification
Figure 4.67.
Diagram of an addition with unification
Figure 4.68.
Example of a recursive transition network
Figure 4.69.
A DCG and the corresponding RTNs TRVIDF PP
Figure 4.70.
Context-free grammars for the parsing of a fragment
Figure 4.71.
Example of parsing with a top-down algorithm
Figure 4.72.
Basic top-down algorithms
Figure 4.73.
Micro-grammar with a left recursion
Figure 4.74.
Left recursion with a top-down algorithm
Figure 4.75.
Example of parsing with a bottom-up algorithm
Figure 4.76.
Basic top-down algorithms
Figure 4.77.
CFG Grammar
Figure 4.78.
Repeated backtracking with a top-down algorithm
Figure 4.79.
Left-corner algorithm
Figure 4.80.
Example of parsing with the left-corner algorithm
Figure 4.81.
Table of an incomplete parsing
Figure 4.82.
Table of a complete parsing of a sentence
Figure 4.83.
Partial active chart
Figure 4.84.
Diagram of the first fundamental rule
Figure 4.85.
Example of application of the fundamental rule
Figure 4.86.
Tabular parsing algorithm with a bottom-up approach
Figure 4.87.
Example of a probabilistic context-free grammar for a fragment of French
Figure 4.88.
Parsing tree for a sentence from the PCFG of the Figure 4.87
Figure 4.89.
Supervised learning of a PCFG
Figure 4.90.
General structure of the parse table of the CYK algorithm
Figure 4.91.
The first step in the execution of the CYK algorithm
Figure 4.92.
The second step in the execution of the CYK algorithm
Figure 4.93.
The third step in the execution of the CYK algorithm
Figure 4.94.
The fourth step in the execution of the CYK algorithm
Figure 4.95.
Architecture of a neural network for handwritten digit recognition [NIE 14]
Figure 4.96.
Example of a recurring network
2 The Sphere of Speech
Table 2.1.
Examples of IPA transcriptions from French and English
Table 2.2.
The three first formants of the vowels [a], [i] and [u]
Table 2.3.
Examples of rounded and unrounded vowels in French
Table 2.4.
Nasal vowels in French
Table 2.5.
Oral vowels in French
Table 2.6.
Places of articulation of French consonants
Table 2.7.
French semi-vowels
Table 2.8.
Examples of distinctive features according to the taxonomy by Chomsky and Halle [CHO 68]
Table 2.9.
Constraint forbidding three successive consonants in Egyptian Arabic
Table 2.10.
Constraints involved in the case of joining (liaison) in French
Table 2.11.
Classification parameters of speech recognition systems
Table 2.12.
Probabilities of Xavier’s moods tomorrow, with the knowledge of his mood today
Table 2.13.
Probability of Xavier’s behavior, knowing his mood
Table 2.14.
Micro-corpus unigrams
Table 2.15.
Bigrams in the micro-corpus with their frequencies
Table 2.16.
Abbreviations to be normalized before synthesis
Table 2.17.
Examples of transcriptions with the Arpabet format
3 Morphology Sphere
Table 3.1.
Examples of Arabic words derived from the stem k-t-b
Table 3.2.
Examples of words in Turkish
Table 3.3.
Examples of prefixes commonly used in English
Table 3.4.
Examples of suffixes commonly used in English
Table 3.5.
Examples of collocations in three French literary corpora [LEG 12]
Table 3.6.
Examples of colligation
Table 3.7.
Successors of the word read [FRA 92]
Table 3.8.
Bigrams of the words bonbon and bonbonne
Table 3.9.
Some regular expressions with simple sequences
Table 3.10.
Regular expressions with character categories
Table 3.11.
Priority of operators in regular expressions
Table 3.12.
FSM transition table for expressions of encouragement
Table 3.13.
A minimal list of tags
4 Syntax Sphere
Table 4.1.
Clefting patterns
Table 4.2.
Examples of restrictive negation
Table 4.3.
A few examples of variation of the word order at the oral framework
Table 4.4.
Examples of noun phrases and their morphological sequences
Table 4.5.
Summary of formal grammars
Table 4.6.
Adopted notation and variants in the literature
Table 4.7.
Types in HPSG formalism [POL 97]
Table 4.8.
Labels adopted for the annotation of RTN
Table 4.10.
Table of left-corners of the grammar of the Figure 4.70
Table 4.11.
Summary of spaces required by the three parsing approaches [RES 92a]
C1
iii
iv
v
ix
x
xi
xii
xiii
xiv
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
275
276
277
G1
G2
G3
Series EditorPatrick Paroubek
Mohamed Zakaria Kurdi
First published 2016 in Great Britain and the United States by ISTE Ltd and John Wiley & Sons, Inc.
Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address:
ISTE Ltd27-37 St George’s RoadLondon SW19 4EUUK
www.iste.co.uk
John Wiley & Sons, Inc.111 River StreetHoboken, NJ 07030USA
www.wiley.com
© ISTE Ltd 2016
The rights of Mohamed Zakaria Kurdi to be identified as the author of this work have been asserted by him in accordance with the Copyright, Designs and Patents Act 1988.
Library of Congress Control Number: 2016945024
British Library Cataloguing-in-Publication Data
A CIP record for this book is available from the British LibraryISBN 978-1-84821-848-2
Language is one of the central tools in our social and professional life. Among other things, it acts as a medium for transmitting ideas, information, opinions and feelings, as well as for persuading, asking for information, giving orders, etc. Computer Science began to gain an interest in language as soon as the field itself emerged, notably within the field of Artificial Intelligence (AI). The Turing test, one of the first tests developed to judge whether a machine is intelligent or not, stipulates that to be considered intelligent, a machine must possess conversational abilities that are comparable to those of a human being [TUR 50]. This implies that an intelligent machine must possess comprehension and production abilities, in the broadest sense of these terms. Historically, natural language processing (NLP) got itself focused on the potential for applying such technology to the real world in a very short span of time, particularly with machine translation (MT) during the Cold War. This began with the first machine translation system which was developed as the brainchild of a joint project between the University of Georgetown and IBM in the United States [DOS 55, HUT 04]. This work was not crowned with the success that was expected, as the researchers soon realized that a deep understanding of the linguistic system is a prerequisite for any comprehensive application of this kind. This discovery, presented in the famous report by automatic language processing advisory committee (ALPAC), had a considerable impact upon machine translation work and on the field of NLP in general. Today, even though NLP is largely industrialized, the interest in basic language processing has not waned. In fact, whatever the application of modern NLP, the use of a basic language processing unit such as a morphological, syntactic, recognition or speech synthesis analyzer is almost always indispensable (see [JON 11] for a more complete review of the history of NLP).
Firstly, what is NLP? It is a discipline which is found at the intersection of several other branches of science such as Computer Science, Artificial Intelligence and Cognitive Psychology. In English, there are several terms for certain fields which are very close to one another. Even though the boundaries between these designated fields are not always very clear, we are going to try to give a definition without claiming that the definition is unanimously accepted in the community. For example, the terms formal linguistics or computational linguistics relate more to models or linguistic formalities developed for IT implementation. The terms Human Language Technology or Natural Language Processing, on the other hand, refer to a publishing software tool equipped with features related to language processing. Furthermore, speech processing designates a range of techniques from signal processing to the recognition or production of linguistic units such as phonemes, syllables or words. Except for the dimension dealing with the signal processing, there is no major difference between speech processing and NLP. Many techniques that have initially been applied to speech processing have found their way into applications in NLP, an example being the Hidden Markov Models (HMM). This encouraged us to follow the unifying path already taken by other colleagues, such as [JUR 00], in this book. This path involves grouping NLP and speech processing into the same discipline. Finally, it is probably worth to mention the term corpus linguistics which refers to the methods of collection, annotation and use of corpora, both in linguistic research and NLP. Since corpora have a very important role in the process of constructing an NLP system, notably those which adopt a machine learning approach, we saw fit to consider corpus linguistics as a branch of NLP.
In the following sections, we will present and discuss the relationships between NLP and related disciplines such as linguistics, AI and cognitive science.
Today, with the democratization of NLP tools, such tools make up the toolkit of many linguists conducting empirical work across a corpus. Therefore, Part-Of-Speech (POS) taggers, morphological analyzers and syntactic parsers of different types are often used in quantitative studies.
They may also be used to provide the necessary data for a psycholinguistics experiment. Furthermore, NLP offers linguists and cognitive scientists a new perspective by adding a new dimension to research carried out within these fields. This new dimension is testability. Indeed, many theoretical models have been tested empirically with the help of NLP applications.
AI is the study, design and creation of intelligent agents. An intelligent agent is a natural or artificial system with perceptual abilities that allows it to act in a given environment to satisfy its desires or successfully achieve planned objectives (see [MAR 14a] and [RUS 10] for a general introduction). Work in AI is generally classified into several sub-disciplines or branches, such as knowledge representation, planning, perception and learning. All these branches are directly related to NLP. This gives the relationship between AI and NLP a very important dimension. Many consider NLP to be a branch of AI while some prefer to consider NLP a more independent discipline.
In the field of AI, planning involves finding the steps to follow to achieve a given goal. This is achieved based on a description of the initial states and possible actions. In the case of an NLP system, planning is necessary to perform complex tasks involving several sources of knowledge that must cooperate to achieve the final goal.
Knowledge representation is important for an NLP system at two levels. On the one hand, it can provide a framework to represent the linguistic knowledge necessary for the smooth functioning of the whole NLP system, even if the size and the quantity of the declarative pieces of information in the system vary considerably according to the approach chosen. On the other hand, some NLP systems require extralinguistic information to make decisions, especially in ambiguous cases. Therefore, certain NLP systems are paired with ontologies or with knowledge bases in the form of a semantic network, a frame or conceptual graphs.
In theory, perception and language seem far from one another, but in reality, this is not the case, especially when we are talking about spoken language where the linguistic message is conveyed by sound waves produced by the vocal folds. Making the connection between perception and voice recognition (the equivalent of perception with a comprehension element) is crucial, not only for comprehension, but also to improve the quality of speech recognition. Furthermore, some current research projects are looking at the connection between the perception of spoken language and the perception of visual information.
Machine learning involves building a representation after having examined data which may or may not have previously been analyzed. Since the 2000s, machine learning has gained particular attention within the field of AI, thanks to the opportunities it offers, allowing intelligent systems to be built with minimal effort compared to rule-based symbolic systems which require more work to be done by human experts. In the field of NLP, the extent to which basic machine learning is used depends highly on the targeted linguistic level. The extent to which machine learning is used varies between almost total domination within speech recognition systems and limited usage within high level processing such as in discourse analysis and pragmatics, where the symbolic paradigm is still dominant.
As with linguistics, the relationship between cognitive science and NLP goes in two directions. On the one hand, cognitive models can act to support a source of inspiration for an NLP system. On the other hand, constructing an NLP system according to a cognitive model can be a way of testing this model. The practical benefit of an approach which mimics the cognitive process remains an open question because in many fields, constructing a system which is inspired by biological models does not prove to be productive. It should also be noted that certain tasks carried out by NLP systems have no parallel in humans, such as searching for information across search engines or searching through large volumes of text data to extract useful information. NLP can be seen as an extension of human cognitive capabilities as part of a decision support system, for example. Other NLP systems are very close to human tasks, such as comprehension and production.
With the availability of more and more digital data, a new discipline has recently emerged: data science. It involves extracting, quantifying and visualizing knowledge, primarily from textual and spoken data. Since these data are found in natural language in many cases, the role of NLP in the extraction and treatment process is obvious. Currently, given the countless industrial uses for this kind of knowledge, especially within the fields of marketing and decision-making, data science has become extremely important, even reminiscent of the beginning of the Internet in the 1990s. This shows that NLP is as useful when applied as it is when considered as a research field.
The aim of this book is to give a panoramic overview of both early and modern research in the field of NLP. It aims to give a unified vision of fields which are often considered as being separate, for example speech processing, computational linguistics, NLP and knowledge engineering. It aims to be profoundly interdisciplinary and tries to consider the various linguistic and cognitive models as well as the algorithms and computational applications on an equal footing. The main postulate adopted in this book is that the best results can only be the outcome of a solid theoretical backbone and a well thought-out empirical approach. Of course, we are not claiming that this book covers the entirety of the works that have been done, but we have tried to strike a balance between North American, European and international work. Our approach is thus based on a duel perspective, aiming to be accessible and informative on the one hand but on the other, presenting the state-of-the-art of a mature field which is in a constant state of evolution.
As a result, this work uses an approach that consists of making linguistic and computer science concepts accessible by using carefully chosen examples. Furthermore, even though this book seeks to give the maximum amount of detail possible about the approaches presented, it nevertheless remains neutral about implementation details to leave each individual some freedom regarding the choice of a programming language. This must be chosen according to personal preference as well as the specific objective needs of individual projects.
Besides the introduction, this book is made up of four chapters. The first chapter looks at the linguistic resources used in NLP. It presents the different types of corpora that exist, their collection, as well as their methods of annotation. The second chapter discusses speech and speech processing. Firstly, we will present the fundamental concepts in phonetics and phonology and then we will move to the two most important applications in the field of speech processing: recognition and synthesis. The third chapter looks at the word level and it focuses particularly on morphological analysis. Finally, the fourth chapter covers the field of syntax. The fundamental concepts and the most important syntactic theories are presented, as well as the different approaches to syntactic analysis.
Today, the use of good linguistic resources for the development of NLP systems seems indispensable. These resources are essential for creating grammars, in the framework of symbolic approaches or to carry out the training of modules based on machine learning. However, collecting, transcribing, annotating and analyzing these resources is far from being trivial. This is why it seems sensible for us to approach these questions in an introduction to NLP. To find out more about the matter of linguistic data and corpus linguistics, a number of works and articles can be consulted, including [HAB 97, MEY 04, WIL 06a, WIL 06b] and [MEG 03].
At this point, a definition of the term corpus is necessary, given that it is central for the subject of this section. It is important to note that research works related to both written and spoken language data is not limited to corpus linguistics. It is actually possible to use individual texts for various forms of literary, linguistic and stylistic analyses. In Latin, the word corpus means body, but when used as a source of data in linguistics, it can be interpreted as a collection of texts. To be more specific, we will quote scholarly definitions of the term corpus from the point of view of modern linguistics:
– A collection of linguistic data, either written texts or a transcription of recorded speech, which can be used as a starting point of linguistic description or as a means of verifying hypotheses about a language [CRY 91].
– A collection of naturally occurring language text, chosen to characterize a state or variety of a language [SIN 91].
– The corpus itself cannot be considered as a constituent of the language: it reflects the character of the artificial situation in which it has been produced and recorded [DUB 94].
From these definitions, it is clear that a corpus is a collection of data selected with a descriptive or applicative aim as its purpose. However, what exactly are these collections? What are their fundamental properties? It is generally thought that a corpus must possess a common set of fundamental properties, including representativeness, a finite size and existing in electronic format.
The problem with the representativeness of a corpus has been highlighted by Chomsky. According to him, certain entirely valid linguistic phenomena exist which might never be observed due to their rarity. Given the infinite nature of language due to the possibility of generating an infinite number of different sentences from a finite number of rules and the constant addition of neologisms in living languages, it is clear that whatever be the size of a corpus, it would be impossible to include all linguistically valid phenomena. In practice, researchers construct corpora whose size is geared to the individual needs of the research project. Thus, the phenomena that Chomsky is talking about are certainly linguistically valid from a theoretical point of view but are almost never used in everyday life. A sentence that is ten thousand words long and formed in accordance with the rules of the English language is of no interest to a researcher who is trying to construct a machine translation system from English to Arabic, for example. Furthermore, we often talk about applications which are task orientated, where we are looking to cover the linguistic forms used in an applied context, which is restricted to hotel reservations or asking for tourist information, for example. In this sort of application, even though it is impossible to be exhaustive, it is possible (even though it takes a lot of work) to reach a satisfactory level.
Often, the size of a corpus is limited to the given number of words (a million words, for example). The size of a corpus is generally predetermined in advance during the design phase. Sometimes, teams, such as Professor John Sinclair’s team at the University of Birmingham in England, update their corpus continuously (in this case, the term text collection is preferred). This continuous updating is necessary to guarantee the representativeness of a corpus across time: the opening up and the infinity of the corpus constitute a means to guarantee diachronic representativeness. Infinite corpora are particularly useful for lexicographers who are looking to include neologisms in new editions of their dictionaries.
Today, the word corpus is almost automatically associated with the word digital. Historically, the term referred mainly to printed texts or even manuscripts. The advantages of digitalization are undeniable. On the one hand, research has become much easier and results are obtained more quickly and, on the other hand, annotation can be done much more flexibly. Moreover, sometimes long-distance teamwork has become much easier. Furthermore, in view of the extreme popularity of digital technology, having data in an electronic format allows such data to be exchanged and allows paper usage to be reduced (which is a good thing given the impact of paper usage on the environment). However, this gave birth to some long-term issues related to electronic corpora such as portability. With the development of operating systems and text analysis software, it sometimes becomes difficult to access documents that were coded with old versions of software with a format that is obsolete. To get around this problem, researchers try to perpetuate their data using independent versions of platforms and of text processing software. XML markup language is one of the main languages used for the annotation of data. More specialized standards such as the EAGLES Corpus Encoding Standard and XCES are also available and are under continuous development to allow researchers to understand linguistic phenomena in a precise and reliable way.
In the field of NLP, the use of corpora is uncontested. Of course, there is a debate surrounding the place of corpora within the approach to build NLP systems, but to our knowledge, everyone is in agreement that linguistic data play a very important role in this process. Corpora are also very useful within linguistics itself, especially for those who wish to carry out a study on a specific linguistic phenomenon such as collocations, fixed expressions, as well as lexical ambiguities. Furthermore, corpora are used more and more in disciplines such as cognitive science or foreign language teaching [NES 05, GRI 06, ATW 08].
To establish a corpus taxonomy, many criteria can be used, such as the distinction between spoken corpora, written corpora, modern corpora, corpora of an ancient form of a language or a dialect, as well as the number of languages in a given corpus.
This kind of corpus is made up of a collection of written texts. Often, corpora such as these contain newspaper articles, webpages, blogs, literary or religious texts, etc. Another source of data from the Internet includes written dialogues between two people communicating on the Internet (such as in a chat) or between a person and a computer program designed specifically for this kind of activity. Often, newspaper archives such as The Guardian (for English), Le Monde (for French) and Al-Hayat (for Arabic) are also a very popular source for written texts. They are especially useful within the fields of information research and lexicography. More sophisticated corpora also exist, such as the British National Corpus (BNC), the Brown Corpus and the Susanne Corpus, which consists of 130,000 words of the Brown Corpus which have been analyzed syntactically. Written corpora can appear in many forms. These forms differ as much at the level of their structures and linguistic functions as at the level of their collection method.
– Verbal dictations: these are often texts read by office software users to gather digital texts in the form of data. Speakers vary in age range and it is necessary to record speakers of different genders to guarantee phonetic variation. Sometimes, geographical variations are also included, for example (in case of American English), New York English versus Midwest English.
– Spoken commands: this kind of corpus is made up of a collection of commands whose purpose is to control a machine such as a television or a robot. The structures of utterances used are often quite limited because short imperative sentences are naturally quite frequently used. Performance phenomena such as hesitation, self-correction or incompleteness are not very common.
– Human–machine dialogues: in this kind of corpus, we try to capture a spoken exchange or a written exchange between a human user and a computer. The diversity of linguistic phenomena that we are able to observe is quite limited. The main gaps come from the fact that machines are far from being as good as humans. Therefore, humans adapt to the level of the machine by simplifying their utterances [LUZ 95].
– Human–human dialogues mediated by machines: here, we have an exchange (spoken or written) between two different human users. The mediator role of the machine could quite simply involve transmitting written sequences or sound waves (often with some extent of loss in sound quality). Machines could also be more directly involved, especially in the case of translation systems. An example of such situation could be a speaker “A” who is speaking in French and this person who tries to reserve a hotel room in Tokyo by speaking to a Japanese agent (speaker B) who does not speak French.
– Multimodal dialogues: whether they are between a human and a machine or mediated by a machine, these dialogues have the ability to combine gestures and words. For example, in a drawing task, the user could ask the machine to move a blue square from one place to another. Put this square <pointing gesture towards the blue square> here <pointing gesture towards the desired location>.
The period that a linguistic corpus represents can be considered as a criterion for distinguishing between corpora. There are corpora representing linguistic usage at a specific period in the history of a given language. The data covered by ancient texts often consist of a collection of literary texts and official texts (political speeches, archives of a state). In view of the fleeting nature of oral speech, it is virtually impossible to accurately identify all the sensitivities of a spoken language long ago.
A corpus must be expressed in one or several languages. This leads us to need to distinguish between: monolingual corpora, multilingual corpora or parallel corpora.
Monolingual corpora are corpora whose content is formulated with the help of a single language. The majority of corpora that are available today are of this type. Thus, examples of corpora of this type are very common: the Brown Corpus and the Switchboard Corpus for written and spoken English, respectively, and the Frantext corpus, as well as the OTG corpus for written and spoken French, respectively.
Furthermore, parallel corpora include a collection of texts where versions of the text in several languages are connected to one another. These corpora can be represented as a graph or even a matrix of two dimensions n x m: where n is the number of texts (Tx) in the source language and m is the number of languages. News reports from press agencies such as Agence France-Presse (AFP) or Reuters are classic examples of sources of such corpora: each report is translated into several languages. Furthermore, several organizations and international companies such as the United Nations, the Canadian Parliament and Caterpillar have parallel corpora for various purposes. Some research laboratories have also collected this type of corpora, such as the European corpus CRATER by the University of Lancaster, which is a parallel corpus in English, French and Spanish. For a corpus to really be useful, fine alignments must be made at levels such as sentence or word. Thus, each sentence from text “T1” in language “L1” must be connected to a sentence in text “T2” in language “L2”. An extract from a parallel corpus with aligned sentences is shown in Figure 1.1.
Figure 1.1.Extract from a parallel corpus [MCE 96]
Note that a multitude of multilingual corpora exist which are not parallel corpora. For example, the corpus CALLFRIEND Collection is a corpus of telephone conversations available in 12 languages and three dialects, and the corpus CALLHOME is made up of telephone conversations available in six languages. In these two corpora, the dialogues, which are not identical from one language to another, are not connected in the same way as in the format presented above.
Parallel corpora are a fundamental source used to build and test machine translation software (see [KOE 05]). An important question to ask after having identified multilingual data is the alignment of the content of these data. To resolve such a fundamental problem to make use of multilingual corpora, a number of approaches have been proposed. Some approaches are based on the comparison of the length of sentences in terms of the number of characters they contain [GAL 93] and in terms of the number of words [BRO 91], while others adopt the criterion of vectorial distance between the segments of the corpora considered [FUN 94]. Furthermore, there are approaches which make use of lexical information to establish links between two aligned texts [CHE 93]. Other approaches combine the length of sentences with lexical information [MEL 99, MOO 02]. Note that the GIZA++ toolbox is particularly popular for aligning multilingual corpora.
This criterion affects written corpora which target the representativity of an entire language or at least a large proportion of this language. To achieve representativity at such a broad level, having a selection of texts coming from a variety of domains is essential. Three types of layouts can be cited:
– Balanced corpora: to guarantee thematic representativeness, texts are collected according to their topics, so as to ensure that each topic is represented equally.
– Pyramidal corpora: in these cases, corpora are constructed using large collections for topics considered central and small collections for topics considered less important.
– Opportunistic corpora: this kind of corpora is used in cases where there are not enough linguistic resources for a given language or for a given application. Therefore, it is indispensable to make the most of all available resources, even if they are not sufficient to guarantee the representativeness aimed for.
Note that guaranteeing the topic representativity of a corpus is often complicated. In most cases, texts look at several different topics at once and it is difficult (especially in the case of an automatic collection from a corpus, with the help of a web crawler, for example) to decide exactly what topic a given text covers. Moreover, as [DEW 98] underlines, there is no commonly accepted typology used for the classification of texts. Finally, it may be useful to mention that lexicography and online information research are among the areas of application which are the most sensitive to thematic representativeness.
The application or scientific domains often impose constraints regarding the age range of speakers. Certain corpora are only made up of linguistic productions uttered by adult speakers, such as air travel information system (ATIS), distributed by LDC. Certain corpora that will be used to research first language acquisition are made up of baby utterances. The most well-known example of this is the child language data exchange systems (CHILDES) corpus, collected and distributed at Carnegie Mellon University in the United States. Finally, corpora exist which cover the linguistic productions of adolescents, such as the spoken conversation corpora collected at the University of Southern Demark SDU as part of the European project NICE.
The increasingly central role of corpora in the process of creating AI applications has led to the emergence of numerous organizations and projects with a mission to create, transcribe, annotate and distribute corpora.
This is a multilingual library which distributes approximately 45,000 free books. This project makes an extensive choice of books available to Internet users, both at the linguistic level and at the level of topics available, since it distributes literary works, scientific works, historical works, etc. Nevertheless, since it is not specifically designed to be used as a corpus, the works distributed in this project need some preprocessing to make them usable as a corpus.
Founded in 1992 and based at the University of Pennsylvania in the United States, this research and development center is financed primarily by the National Science Foundation (NSF). Its main activities consist of collecting, distributing and annotating linguistic resources which correspond to the needs of research centers and American companies which work in the field of language technology. The linguistic data consortium (LDC) owns an extensive catalog of written and spoken corpora which covers a fairly large number of different languages.
This is a European level centralized not-for-profit organization. Since its creation in 1995, the European language resource agency (ELRA2) has been collecting, distributing and validating spoken, written and terminological linguistic resources, as well as software tools. Although it is based in the European city of Paris, this organization does not only look at European languages. Indeed, many corpora of non-European languages, including Arabic, feature in its catalog. Among its scientific activities, the ELRA organizes a biannual conference: language resources and evaluation conference (LREC).
Open language archives community (OLAC3) is a consortium of institutions and individuals which is creating a virtual library of linguistic resources on a global scale and is developing a consensus on best practices for the digital archiving of linguistic resources by creating a network of storing services for these resources.
Given the considerable costs of a quality corpus and the lucrative character of most existing organizations, it is often difficult for researchers who do not have a sufficient budget to get hold of corpora that they need for their studies. Moreover, many manufacturers and research laboratories jealously keep back the linguistic resources they own, even after the projects for which the corpora were collected have finished.
To confront this problem of accessibility, many centers and laboratories have begun to adopt a logic that is similar to that of free software. Laboratories such as CLIPS-IMAG and Valoria have, for example, taken the initiative of collecting and distributing two corpora of oral dialogues for free. These corpora include the Grenoble Tourism Office corpus and the Massy School corpus4 [ANT 02]. In the United States, there are examples such as the Trains Corpus collected by the University of Rochester, whose transcriptions have been made readily available to the community [HEE 95]. In addition, the ngrams of the Google books5 is a corpus which is used more and more for various purposes.
As an artificial object, corpora can only very rarely exist in the natural world. Corpora collection often requires important resources. From this point of view, in some ways, the lifecycle of a corpus resembles the lifecycle of a piece of software. To get a closer look at the lifecycle of a corpus, let us examine the flowchart shown in Figure 1.2. As we can see that there are four main steps involved in this process: preparation/planning, acquisition and preparation of the data, use of the data and evaluation of the data. It is a cyclical process and certain steps are repeated to deal with a lack of linguistic representativeness (often diachronic, geographical or empirical in nature) to improve the results of an NLP module.
Figure 1.2.Lifecycle of a corpus
Three main steps stand out within a lifecycle:
