139,99 €
This book is at the very heart of linguistics. It provides the theoretical and methodological framework needed to create a successful linguistic project. Potential applications of descriptive linguistics include spell-checkers, intelligent search engines, information extractors and annotators, automatic summary producers, automatic translators, and more. These applications have considerable economic potential, and it is therefore important for linguists to make use of these technologies and to be able to contribute to them. The author provides linguists with tools to help them formalize natural languages and aid in the building of software able to automatically process texts written in natural language (Natural Language Processing, or NLP). Computers are a vital tool for this, as characterizing a phenomenon using mathematical rules leads to its formalization. NooJ - a linguistic development environment software developed by the author - is described and practically applied to examples of NLP.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 392
Veröffentlichungsjahr: 2016
Cover
Dedication
Title
Copyright
Acknowledgments
1 Introduction: the Project
1.1. Characterizing a set of infinite size
1.2. Computers and linguistics
1.3. Levels of formalization
1.4. Not applicable
1.5. NLP applications
1.6. Linguistic formalisms: NooJ
1.7. Conclusion and structure of this book
1.8. Exercises
1.9. Internet links
PART 1: Linguistic Units
2 Formalizing the Alphabet
2.1. Bits and bytes
2.2. Digitizing information
2.3. Representing natural numbers
2.4. Encoding characters
2.5. Alphabetical order
2.6. Classification of characters
2.7. Conclusion
2.8. Exercises
2.9. Internet links
3 Defining Vocabulary
3.1. Multiple vocabularies and the evolution of vocabulary
3.2. Derivation
3.3. Atomic linguistic units (ALUs)
3.4. Multiword units versus analyzable sequences of simple words
3.5. Conclusion
3.6. Exercises
3.7. Internet links
4 Electronic Dictionaries
4.1. Could editorial dictionaries be reused?
4.2. LADL electronic dictionaries
4.3. Dubois and Dubois-Charlier electronic dictionaries
4.4. Specifications for the construction of an electronic dictionary
4.5. Conclusion
4.6. Exercises
4.7. Internet links
PART 2: Languages, Grammars and Machines
5 Languages, Grammars, and Machines
5.1. Definitions
5.2. Generative grammars
5.3. Chomsky-Schützenberger hierarchy
5.4. The NooJ approach
5.5. Conclusion
5.6. Exercises
5.7. Internet links
6 Regular Grammars
6.1. Regular expressions
6.2. Finite-state graphs
6.3. Non-deterministic and deterministic graphs
6.4. Minimal deterministic graphs
6.5. Kleene’s theorem
6.6. Regular expressions with outputs and finite-state transducers
6.7. Extensions of regular grammars
6.8. Conclusion
6.9. Exercises
6.10. Internet links
7 Context-Free Grammars
7.1. Recursion
7.2. Parse trees
7.3. Conclusion
7.4. Exercises
7.5. Internet links
8 Context-Sensitive Grammars
8.1. The NooJ approach
8.2. NooJ contextual constraints
8.3. NooJ variables
8.4. Conclusion
8.5. Exercises
8.6. Internet links
9 Unrestricted Grammars
9.1. Linguistic adequacy
9.2. Conclusion
9.3. Exercise
9.4. Internet links
PART 3: Automatic Linguistic Parsing
10 Text Annotation Structure
10.1. Parsing a text
10.2. Annotations
10.3.
Text annotation structure (TAS)
10.4. Exercise
10.5. Internet links
11 Lexical Analysis
11.1. Tokenization
11.2. Word forms
11.3. Morphological analyses
11.4. Multiword unit recognition
11.5. Recognizing expressions
11.6. Conclusion
11.7. Exercise
12 Syntactic Analysis
12.1. Local grammars
12.2. Structural grammars
12.3. Conclusion
12.4. Exercises
12.5. Internet links
13 Transformational Analysis
13.1. Implementing transformations
13.2. Theoretical problems
13.3. Transformational analysis with NooJ
13.4. Question answering
13.5. Semantic analysis
13.6. Machine translation
13.7. Conclusion
13.8. Exercises
13.9. Internet links
Conclusion
Bibliography
Index
End User License Agreement
Cover
Table of Contents
Begin Reading
1 Introduction: the Project
Figure 1.1.
The number of any set of sentences can be doubled
Figure 1.2.
Really?
Figure 1.3.
Vietnamese–English translation with Google Translate
Figure 1.4.
Translation with Google Translate vs. with NooJ
Figure 1.5.
Article from the newspaper Le Monde (October 2014) translated with Google Translate
Figure 1.6.
Extract from Penn Treebank
Figure 1.7.
A single tool for formalization: NooJ
2 Formalizing the Alphabet
Figure 2.1.
Two electrical states: a light bulb turned on or off
Figure 2.2.
Representation of numbers in binary notation
Figure 2.3.
Extract from the Unicode table
Figure 2.4.
One possible encoding of the Latin alphabet
Figure 2.5.
ASCII encoding
Figure 2.6.
Accented Latin letters
Figure 2.7.
Character encoding is still problematic as of late 2015
Figure 2.8.
Unicode representation of the character “é”
Figure 2.9.
A Chinese character that has no Unicode code
Figure 2.10.
One Chinese character has three Unicode codes
Figure 2.11.
Four graphical variants for a single Unicode code
3 Defining Vocabulary
Figure 3.1.
Phablet, Bushism, Chipotlification, tocoupify
4 Electronic Dictionaries
Figure 4.1.
Analysis of the lexical entry “artisan”
Figure 4.2.
A lexicon-grammar table for English verbs
Figure 4.3.
Lexicon-grammar table for phrasal verbs
Figure 4.4.
Extract from DELAC dictionary (Nouns)
Figure 4.5.
Le Dictionnaire électronique des mots
Figure 4.6.
Les Verbes Français dictionary
Figure 4.7.
T grammar of constructions
Figure 4.8.
Occurrences of the verb abriter in a direct transitive construction (T)
Figure 4.9.
Occurrences of the verb “abriter” in a pronominal construction (P)
5 Languages, Grammars, and Machines
Figure 5.1.
A generative grammar
Figure 5.2.
Generation of the sentence “the cat sees a dog”
Figure 5.3.
Chomsky-Schutzenberger hierarchy
6 Regular Grammars
Figure 6.1.
Applying a regular expression to a text
Figure 6.2.
Display of a graph using XFST
Figure 6.3.
Informal time
Figure 6.4.
A non-deterministic graph
Figure 6.5.
A deterministic graph
Figure 6.6.
A minimal graph
Figure 6.7.
Five basic graphs
Figure 6.8.
Disjunction and Kleene operator
Figure 6.9.
Graph equivalent to a regular expression
Figure 6.10.
A finite-state graph
Figure 6.11.
Incorporating the node “red”
Figure 6.12.
Incorporating the node “pretty”
Figure 6.13.
Completeing the node “very”
Figure 6.14.
Final graph
Figure 6.15.
An spelling transducer
Figure 6.16.
A terminological transducer
Figure 6.17.
A morphological transducer
Figure 6.18.
A transducer for translation
Figure 6.19.
A query containing syntactic symbols
Figure 6.20.
The operator +ONE
7 Context-Free Grammars
Figure 7.1.
A NooJ context-free grammar
Figure 7.2.
A context-free grammar with syntactic symbols
Figure 7.3.
Recursive Graph
Figure 7.4.
A more general grammar
Figure 7.5.
A recursive context-free grammar
Figure 7.6.
Right recursive grammar
Figure 7.7.
Finite-state graph equivalent to a right-recursive context-free grammar
Figure 7.8.
Left recursive grammar
Figure 7.9.
Finite-state graph equivalent to a left-recursive context-free grammar
Figure 7.10.
Middle recursion
Figure 7.11.
An ambiguous grammar
Figure 7.12.
First parse tree for the ambiguous sentence: This man sees the chair from his house
Figure 7.13.
Second derivation for the sentence: This man sees the chair from his house
8 Context-Sensitive Grammars
Figure 8.1.
Context-sensitive grammar for the language a
n
b
n
c
n
Figure 8.2.
NooJ grammar for the language a
n
b
n
c
n
Figure 8.3.
NooJ grammar that recognizes the language a
n
b
n
c
n
d
n
e
n
Figure 8.4.
Grammar of language a
2
n
Figure 8.5.
Grammar that recognizes reduplications
Figure 8.6.
A German finite-state graph to describe agreement in gender, number and case.
Figure 8.7.
Agreement with constraints'
Figure 8.8.
Morphological context-sensitive grammar
Figure 8.9.
Checking the presence of a question mark
Figure 8.10.
Setting a variable
Figure 8.11.
Inheritance: $N → $NPH
9 Unrestricted Grammars
Figure 9.1.
Unrestricted grammar
Figure 9.2.
NooJ unrestricted grammar
Figure 9.3.
Respectively
10 Text Annotation Structure
Figure 10.1.
Annotations for the ambiguous sequence “black box”
Figure 10.2.
The two terms “big screen” and “screen star” overlap
Figure 10.3.
Annotating the contracted form “cannot”
Figure 10.4.
Annotating the phrasal verb “call back”
Figure 10.5.
A TAS right after the lexical analysis
11 Lexical Analysis
Figure 11.1.
Ambiguity triggered by the lack of vowels
Figure 11.2.
Hebrew and Latin alphabets together in same text
Figure 11.3.
Itogi Weekly no. 40, October 3rd 2011
Figure 11.4.
Transliteration variants
Figure 11.5.
Contractions
Figure 11.6.
Contractions of “not”
Figure 11.7.
Prefixes
Figure 11.8.
Numerical determinants
Figure 11.9.
Multiple solutions for breaking down a Chinese text
Figure 11.10.
Intonation in Armenian
Figure 11.11.
Recognizing US Phone Numbers
Figure 11.12.
Roman numerals
Figure 11.13.
Paradigm TABLE
Figure 11.14.
Inflection codes used in the English NooJ module
Figure 11.15.
Paradigm HELP
Figure 11.16.
Paradigm for KNOW
Figure 11.17.
Morphological operators
Figure 11.19.
Paradigm NN
Figure 11.20.
France and its derived forms
Figure 11.21.
Dictionary produced automatically from a morphological grammar
Figure 11.22.
A productive morphological rule
Figure 11.23.
Description of Spanish clitics (infinitive form)
Figure 11.24.
Agglutination in German
Figure 11.25.
A family of terms
Figure 11.26.
Checking context for the characteristic constituent
Figure 11.27.
Checking context, v2
Figure 11.28.
Checking context, v3
Figure 11.29.
Annotate phrasal verbs
Figure 11.30.
Discontinuous annotation in the TAS
12 Syntactic Analysis
Figure 12.1.
A local grammer for common email address
Figure 12.2.
Graph “on the 3rd of June”
Figure 12.3.
Graph “at seven o’clock”
Figure 12.4.
Date grammar
Figure 12.5.
A syntactic annotation in TAS
Figure 12.6.
Grammar of preverbal particles in French
Figure 12.7.
Detecting ambiguities in the word form “this”
Figure 12.8.
A syntax tree
Figure 12.9.
Structure of a sentence that contains a discontinuous expression
Figure 12.10.
A grammar produces structured annotations
Figure 12.11.
A structured group of syntactic annotations
Figure 12.12.
Syntactic analysis of a lexically ambiguous sentence
Figure 12.13.
Analyzing a structurally ambiguous sentence
Figure 12.14.
Simplified grammar
Figure 12.15.
Another parse tree for a simplified grammar
Figure 12.16.
Parse tree for a structured grammar
Figure 12.17.
Dependency grammar
Figure 12.18.
Dependency tree
Figure 12.19.
ALUs in the syntax tree
13 Transformational Analysis
Figure 13.1.
The sentence Joe loves Lea is transformed automatically
Figure 13.2.
Passive
Figure 13.3.
Negation
Figure 13.4.
Making the subject into a pronoun
Figure 13.5.
A few elementary transformations
Figure 13.6.
The operation [Passive-inv]
Figure 13.7.
A transformation chain
Figure 13.8.
Grammar for declarative transitive sentences
Figure 13.9.
Grammar used in mixed “analysis + generation”mode
Figure 13.10.
Linking complex sentences to their transformational properties
Figure 13.11.
Automatic transformation
Figure 13.12.
Simple French → English translation
Figure 13.13.
Translation changing the word order
Figure 13.14.
Translation with constraints
C1
ii
iii
iv
v
xi
xii
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
35
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
195
196
197
198
199
200
201
202
203
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
315
316
317
318
319
320
321
322
323
324
325
327
328
329
G1
G2
G3
For Nadia “Nooj” Malinovich Silberztein, the Mensch of the family, without whom neither this book, nor the project named after her, would have happened.
And for my two children, Avram and Rosa, who remind me every day of the priorities in my life.
Series Editor
Patrick Paroubek
Max Silberztein
First published 2016 in Great Britain and the United States by ISTE Ltd and John Wiley & Sons, Inc.
Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address:
ISTE Ltd27-37 St George’s RoadLondon SW19 4EUUKwww.iste.co.uk
John Wiley & Sons, Inc.111 River StreetHoboken, NJ 07030USAwww.wiley.com
© ISTE Ltd 2016
The rights of Max Silberztein to be identified as the author of this work have been asserted by him in accordance with the Copyright, Designs and Patents Act 1988.
Library of Congress Control Number: 2015957115
British Library Cataloguing-in-Publication DataA CIP record for this book is available from the British LibraryISBN 978-1-84821-902-1
I would like to thank the University of Franche-Comté and my colleagues in the ELLIADD laboratory for believing in the NooJ project and supporting the community of NooJ users unfailingly since its inception.
It would be impossible for me to mention every single one of the colleagues and students who have participated, in one way or another, in the extremely ambitious project described in this book – that of formalizing natural languages! The NooJ software has been in use since 2002 by a community of researchers and students; see www.nooj4nlp.net. NooJ was developed in direct cooperation with all its users who devoted their energy to this or that specific problem, or to one language or another. Spelling in Semitic languages, variation in Asian languages, intonation in Armenian, inflection in Hungarian, phrasal verbs in English, derivation in Slavic languages, composition in Greek and in Germanic languages, etc. pose a wide variety of linguistic problems, and without the high standards of these linguists the NooJ project would never have known the success it is experiencing today. Very often, linguistic questions that seemed “trivial” at the time have had a profound influence on the development of NooJ.
Among its users, there are some “NooJ experts” to whom I would like to give particular thanks, as they participated directly in its design, and had the patience to help me with long debugging sessions. I thank them for their ambition and their patience: Héla Fehri, Kristina Kocijan, Slim Mesfar, Cristina Mota, and Simonetta Vietri.
I would also like to thank Danielle Leeman and François Trouilleux for their detailed review of the original book, and Peter Machonis for his review of the English version, as well as for verifying the relevance of the English examples, which contributed greatly to the quality of this book.
Max SILBERZTEIN
November, 2015.
The project described in this book is at the very heart of linguistics; its goal is to describe, exhaustively and with absolute precision, all the sentences of a language likely to appear in written texts1. This project fulfills two needs: it provides linguists with tools to help them describe languages exhaustively (linguistics), and it aids in the building of software able to automatically process texts written in natural language (natural language processing, or NLP).
A linguistic project2 needs to have a theoretical and methodological framework (how to describe this or that linguistic phenomenon; how to organize the different levels of description); formal tools (how to write each description); development tools to test and manage each description; and engineering tools to be used in sharing, accumulating, and maintaining large quantities of linguistic resources.
There are many potential applications of descriptive linguistics for NLP: spell-checkers, intelligent search engines, information extractors and annotators, automatic summary producers, automatic translators, etc. These applications have the potential for considerable economic usefulness, and it is therefore important for linguists to make use of these technologies and to be able to contribute to them.
For now, we must reduce the overall linguistic project of describing all phenomena related to the use of language, to a much more modest project: here, we will confine ourselves to seeking to describe the set of all of the sentences that may be written or read in natural-language texts. The goal, then, is simply to design a system capable of distinguishing between the two sequences below:
a)
Joe is eating an apple
b)
Joe eating apple is an
Sequence (a) is a grammatical sentence, while sequence (b) is not.
This project constitutes the mandatory foundation for any more ambitious linguistic projects. Indeed it would be fruitless to attempt to formalize text styles (stylistics), the evolution of a language across the centuries (etymology), variations in a language according to social class (sociolinguistics), cognitive phenomena involved in the learning or understanding of a language (psycholinguistics), etc. without a model, even a rudimentary one, capable of characterizing sentences.
If the number of sentences were finite – that is, if there were a maximum number of sentences in a language – we would be able to list them all and arrange them in a database. To check whether an arbitrary sequence of words is a sentence, all we would have to do is consult this database: it is a sentence if it is in the database, and otherwise it is not. Unfortunately, there are an infinite number of sentences in a natural language. To convince ourselves of this, let us resort to a redictio ad absurdum: imagine for a moment that there are n sentences in English.
Based on this finite number n of initial sentences, we can construct a second set of sentences by putting the sequence Lea thinks that, for example, before each of the initial sentences:
Joe is sleeping → Lea thinks that Joe is sleeping
The party is over → Lea thinks that the party is over
Using this simple mechanism, we have just doubled the number of sentences, as shown in the figure below.
Figure 1.1.The number of any set of sentences can be doubled
This mechanism can be generalized by using verbs other than the verb to think; for example:
Lea (believes | claims | dreams | knows | realizes | thinks | …) that Sentence.
There are several hundred verbs that could be used here. Likewise, we could replace Lea with several thousand human nouns:
(The CEO | The employee | The neighbor | The teacher | …) thinks that Sentence.
Whatever the size n of an initial set of sentences, we can thus construct n × 100 × 1,000 sentences simply by inserting before each of the initial sentences, sequences such as Lea thinks that, Their teacher claimed that, My neighbor declared that, etc.
Language has other mechanisms that can be used to expand a set of sentences exponentially. For example, based on n initial sentences, we can construct n × n sentences by combining all of these sentences in pairs and inserting the word and between them. For example:
It is raining + Joe is sleeping →It is raining and Joe is sleeping
This mechanism can also be generalized by using several hundred connectors; for example:
It is raining (but | nevertheless | therefore | where | while |…) Joe is sleeping.
These two mechanisms (linking of sentences and use of connectors) can be used multiple times in a row, as in the following:
Lea claims that Joe hoped that Ida was sleeping. It was raining while Lea was sleeping, however Ida is now waiting, but the weather should clear up as soon as night falls.
Thus these mechanisms are said to be recursive; the number of sentences that can be constructed with recursive mechanisms is infinite. Therefore it would be impossible to define all of these sentences in extenso. Another way must be found to characterize the set of sentences.
Mathematicians have known for a long time how to define sets of infinite size. For example, the two rules below can be used to define the set of all natural numbers :
(a) Each of the ten elements of set {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} is a natural number;
(b) any word that can be written as xy is a natural number if and only if its two constituents x and y are natural numbers.
These two rules constitute a formal definition of all natural numbers. They make it possible to distinguish natural numbers from any other object (decimal numbers or others). For example:
– Is the word “123” a natural number? Thanks to rule (a), we know that “1” and “2” are natural numbers. Rule (b) allows us to deduce from this that “12” is a natural number. Thanks to rule (a) we know that “3” is a natural number; since “12” and “3” are natural numbers, then rule (b) allows us to deduce that “123” is a natural number.
– The word “2.5” is not a natural number. Rule (a) enables us to deduce that “2” is a natural number, but it does not apply to the decimal point “.”. Rule (b) can only apply to two natural numbers, therefore it does not apply to the decimal point because it is not a natural number. In this case, “2.” is not a natural number; therefore “2.5” is not a natural number either.
There is an interesting similarity between this definition of set and the problem of characterizing the sentences in a language:
– Rule (a) describes
in extenso
the finite set of numerals that must be used to form valid natural numbers. This rule resembles a dictionary in which we would list all the words that make up the vocabulary of a language.
– Rule (b) explains how numerals can be combined to construct an infinite number of natural numbers. This rule is similar to grammatical rules that specify how to combine words in order to construct an infinite number of sentences.
To describe a natural language, then, we will proceed as follows: firstly we will define in extenso the finite number of basic units in a language (its vocabulary); and secondly, we will list the rules used to combine the vocabulary elements in order to construct sentences (its grammar).
Computers are a vital tool for this linguistic project, for at least four reasons:
– From a theoretical point of view, a computer is a device that can verify automatically that an element is part of a mathematically-defined set. Our goal is then to construct a device that can automatically verify whether a sequence of words is a valid sentence in a language.
– From a methodological point of view, the computer will impose a framework to describe linguistic objects (words, for example) as well as the rules for use of these objects (such as syntactic rules). The way in which linguistic phenomena are described must be consistent with the system: any inconsistency in a description will inevitably produce an error (or “bug”).
– When linguistic descriptions have been entered into a computer, a computer can apply them to very large texts in order to extract from these texts examples or counterexamples that validate (or not) these descriptions. Thus a computer can be used as a scientific instrument (this is the
corpus linguistics
approach), as the telescope is in astronomy or the microscope in biology.
– Describing a language requires a great deal of descriptive work; software is used to help with the development of databases containing numerous linguistic objects as well as numerous grammar rules, much like engineers use computer-aided design (CAD) software to design cars, electronic circuits, etc. from libraries of components.
Finally, the description of certain linguistic phenomena makes it possible to construct NLP software applications. For example, if we have a complete list of the words in a language, we can build a spell-checker; if we have a list of rules of conjugation we can build an automatic conjugator. A list of morphological and phonological rules also makes it possible to suggest spelling corrections when the computer has detected errors, while a list of simple and compound terms can be used to build an automatic indexer. If we have bilingual dictionaries and grammars we can build an automatic translator, and so forth. Thus the computer has become an essential tool in linguistics, so much so that opposing “computational linguists” with “pure linguists” no longer makes sense.
When we characterize a phenomenon using mathematical rules, we formalize it. The formalization of a linguistic phenomenon consists of describing it, by storing both linguistic objects and rules in a computer. Languages are complicated to describe, partly because interactions between their phonological and writing systems have multiplied the number of objects to process, as well as the number of levels of combination rules. We can distinguish five fundamental levels of linguistic phenomena; each of these levels corresponds to a level of formalization.
To analyze a written text, we access letters of the alphabet rather than words; thus it is necessary to describe the link between the alphabet and the orthographic forms we wish to process (spelling). Next, we must establish a link between the orthographic forms and the corresponding vocabulary elements (morphology). Vocabulary elements are generally listed and described in a lexicon that must also show all potential ambiguities (lexicography). Vocabulary elements combine to build larger units such as phrases which then combine to form sentences; therefore rules of combination must be established (syntax). Finally, links between elements of meaning which form a predicate transcribed into an elementary sentence, as well as links between predicates in a complex sentence, must be established (semantics).
We do not always use language to represent and communicate information directly and simply; sometimes we play with language to create sonorous effects (for example in poetry). Sometimes we play with words, or leave some “obvious” information implicit because it stems from the culture shared by the speakers (anaphora). Sometimes we express one idea in order to suggest another (metaphor). Sometimes we use language to communicate statements about the real world or in scientific spheres, and sometimes we even say the opposite of what we really mean (irony).
It is important to clearly distinguish problems that can be solved within a strictly linguistic analytical framework from those that require access to information from other spheres in order to be solved.
Writers, poets, and authors of word games often take the liberty of constructing texts that violate the syntactic or semantic constraints of language. For example, consider the following text3:
For her this rhyme is penned, whose luminous eyes
Brightly expressive as the twins of Leda,
Shall find her own sweet name, that nesting lies,
Upon the page, enwrapped from every reader.
This poem is an acrostic, meaning that it contains a puzzle which readers are invited to solve. We cannot rely on linguistic analysis to solve this puzzle. But, to even understand that the poem is a puzzle, the reader must figure out that this rhyme refers to the poem itself. Linguistic analysis is not intended to figure out what in the world this rhyme might be referring to; much less to decide among the possible candidates.
… luminous eyes brightly expressive as the twins of Leda …
The association between the adjective luminous and eyes is not a standard semantic relationship; unless the eyes belong to a robot, eyes are not luminous. This association is, of course, metaphorical: we have to understand that luminous eyes means that the owner of the eyes has a luminous intelligence, and that we are perceiving this luminous intelligence by looking at her eyes.
The twins of Leda are probably the mythological heroes Castor and Pollux (the twin sons of Leda, the wife of the king of Sparta), but they are not particularly known for being expressive. These two heroes gave their names to the constellation Gemini, but I confess that I do not understand what an expressive constellation might be. I suspect the author rather meant to write:
… expressive eyes brightly luminous as the twins of Leda …
The associations between the noun name and the verbal forms lies, nestling, and enwrapped are no more direct; we need to understand that it is the written form of the name which is present on the physical page where the poem is written, and that it is hidden from the reader.
If we wish to make a poetic analysis of this text, the first thing to do is thus to note these non-standard associations, so we will know where to run each poetic interpretive analysis. But if we do not even know that eyes are not supposed to be luminous, we will not be able to even figure out that there is a metaphor, therefore we will not be able to solve it (i.e. to compute that the woman in question is intelligent), and so we will have missed an important piece of information in the poem. More generally, in order to understand a poem’s meaning, we must first note the semantic violations it contains. To do this, we need a linguistic model capable of distinguishing “standard” associations such as an intelligent woman, a bright constellation, a name written on a page, etc. from associations requiring poetic analysis, such as luminous eyes, an expressive constellation, a name lying upon a page.
Analyzing poems can pose other difficulties, particularly at the lexical and syntactic levels. In standard English, word order is less flexible than in poems. To understand the meaning of this poem, a modern reader has to start by rewriting (in his or her mind) the text in standard English, for example as follows:
This rhyme is written for her, whose luminous eyes (as brightly expressive as the twins of Leda) will find her own sweet name, which lies on the page, nestling, enwrapped from every reader.
The objective of the project described in this book is to formalize standard language without solving poetic puzzles, or figuring out possible referents, or analyzing semantically nonstandard associations.
Stylistics studies ways of formulating sentences in speech. For example, in a text we study the use of understatements, metaphors, and metonymy (“figures of style”), the order of the components of a sentence and that of the sentences in a speech, and the use of anaphora. Here are a few examples of stylistic phenomena that cannot be processed in a strictly linguistic context:
Understatement: Joe was not the fastest runner in the race
Metaphor: The CEO is a real elephant
Metonymy: The entire table burst into laughter
In reality, the sentence Joe was not the fastest runner in the race could mean here that Joe came in last; so, in a way, this sentence is not saying what it is expressing! Unless we know the result of the race, or have access to information about the real Joe, we cannot expect a purely linguistic analysis system to detect understatements, irony or lies.
To understand the meaning of the sentence The CEO is a real elephant, we need to know firstly that a CEO cannot really be an elephant, and therefore that this is a metaphor. Next we need to figure out which “characteristic property” of elephants is being used in the metaphor. Elephants are known for several things: they are big, strong, and clumsy; they have long memories; they are afraid of mice; they are an endangered species; they have big ears; they love to take mud-baths; they live in Africa or India, etc. Is the CEO clumsy? Is he/she afraid of mice? Does he/she love mud-baths? Does he/she have a good memory? To understand this statement, we would have to know the context in which the sentence was said, and we might also need to know more about the CEO in question.
To understand the meaning of the sentence The entire table burst into laughter, it is necessary first to know that a table is not really capable of bursting into laughter, and then to infer that there are people gathered around a table (during a meal or a work meeting) and that it is these people who burst out laughing. The noun table is neither a collective human noun (such as group or colony), nor a place that typically contains humans (such as meeting room or restaurant), nor an organization (such as association or bank); therefore using only the basic lexical properties associated with the noun table will not be enough to comprehend the sentence.
It is quite reasonable to expect a linguistic system to detect that the sentences The CEO is a real elephant and The entire table burst into laugther are not standard sentences; for example, by describing CEO as a human noun, describing table as a concrete noun, and requiring to burst into laughter to have a human subject, we can learn from a linguistic analysis that these sentences are not “standard”, and that it is therefore necessary to initiate an extra-linguistic computation such as metaphor or metonymy calculations in order to interpret them.
The linguistic project described in this book is not intended to solve understatements, metaphors, or metonymy, but it must be able to detect sentences that are deviant in comparison to the standard language.
Coreference: Lea invited Ida for dinner. She brought a bottle of wine.
Anaphora: Phelps returned. The champion brought back 6 medals with him.
Semantic ambiguity: The round table is in room B17.
In order to understand that in the sentence She brought a bottle of wine, she refers to Ida and not Lea, we need to know that it is usually the guest who travels and brings a bottle of wine. This social convention is commonplace throughout the modern Western world, but we would need to be sure that this story does not take place in a society where it is the person who invites who brings beverages.
In order to understand that The champion is a reference to Phelps, we have to know that Phelps is a champion. Note that dozens of other nouns could have been used in this anaphora: the American, the medal-winner, the record-holder, the swimming superstar, the young man, the swimmer, the former University of Florida student, the breakaway, the philanthropist, etc.
In order to eliminate the ambiguity of the sequence round table (between “a table with a round shape” and “a meeting”), we would need to have access to a wider context than the sentence alone.
The linguistic project described in this book is not intended to resolve anaphora or semantic ambiguities.
NOTE. – I am not saying that it is impossible to process poetry, word games, understatements, metaphors, metonymy, coreference, anaphora, and semantic ambiguities; I am only saying that these phenomena lie outside the narrow context of the project presented in this book. There are certainly “lucky” cases in which linguistic software can automatically solve some of these phenomena. For example, in the following sequence:
Joe invited Lea for dinner. She brought a bottle of wine
a simple verification of the pronoun’s gender would enable us to connect She to Lea. Conversely, it is easy to build software which, based on the two sentences Joe invited Lea to dinner and Lea brought a bottle of wine, would produce the sentence She brought a bottle of wine. Likewise, in the sentence:
The round table is taking place in room B17
a linguistic parser could automatically figure out that the noun round table refers to a meeting, provided that it has access to a dictionary in which the noun round table is described as being an abstract noun (synonymous with meeting), and the verb to take place is described as calling for an abstract subject.
Consider the following statements:
a)
Two dollars plus three dollars make four dollars.
b)
Clinton was already president in 1536.
c)
The word God has four letters.
d)
This sentence is false.
These statements are expressed using sentences that are well-formed because they comply with the spelling, morphological, syntactic, and semantic rules of the English language. However, they express statements that are incorrect in terms of mathematics (a), history (b), spelling (c), or logic (d). To detect these errors we would need to access knowledge that is not part of our strictly linguistic project4.
The project described in this book is confined to the formalization of language, without taking into account speakers’ knowledge about the real world.
Of course, there are fantastic software applications capable of processing extralinguistic problems! For example, the IBM computer Watson won on the game show Jeopardy! in spectacular fashion in 2011; I have a lot of fun asking my smart watch questions. In the car, I regularly ask Google Maps to guide me verbally to my destination; my language-professor colleagues have trouble keeping their students from using Google Translate; and the subtitles added automatically to YouTube videos are a precious resource for people who are hard of hearing [GRE 11], etc.
All of these software platforms have a NLP part, which analyzes or produces a written or verbal statement, often accompanied by a specialized module, for example a search engine or GPS navigation software. It is important to distinguish between these components: just because we are impressed by the fact that Google Maps gives us reliable directions, it does not mean it speaks perfect English. It is very possible that IBM Watson can answer a question correctly without having really “understood” the question. Likewise, a software platform might automatically summarize a text using simple techniques to filter out words, phrases or sentences it judges to be unimportant [MAN 01]5. Word-recognition systems use signal processing techniques to produce a sequence of phonemes and then determine the most “probable” corresponding sequence of words by comparing it to a reference database [JUR 00]6, etc.
Most pieces of NLP software actually produce spectacular, almost magical results, with a very low degree of linguistic competence. To produce these results, the software uses “tricks”, often based on statistical methods [MAN 99].
Unfortunately, the success of these software platforms is often used in order to show that statistical techniques have made linguistics unnecessary7. It is important, then, to understand their limitations. In the next two sections I will take a closer look at the performances of the two “flagship” statistical NLP software platforms: automatic translation and part-of-speech tagging.
Figure 1.2.Really?
Today, the best-known translation software platforms use statistical techniques to suggest a translation of texts. These software platforms are regularly cited as examples of the success of statistical techniques, and everyone has seen a “magical” translation demo8. It is not surprising, therefore, that most people think the problem of translation has already been solved. I do not agree.
Figure 1.3.Vietnamese–English translation with Google Translate
For example, Figure 1.3 shows how Google Translate has translated part of an article in Vietnamese (www.thanhnieunnews.com, October 2014). The text produced does not make very much sense; what does “I’m talking to him on the Navy’s generals” mean? The translated text even contains incorrect constructions (“could very well shielded”, for example).
Figure 1.4 allows us to compare Google Translate’s results with those obtained using a simple Arabic-English dictionary and a very basic translation grammar; see [BAN 15]. For example, the first sentence was wrongly translated by Google Translate as The man who went down to work instead of The man who fell went to work. Google Translate was also wrong about the second sentence, which means The man that you knew went to work and not I knew a man who went to work.
Figure 1.4.Translation with Google Translate vs. with NooJ
When translating languages that are more similar, such as French into English, the results produced by Google Translate are helpful, but still could not be used in a professional context or to translate a novel, a technical report, or even a simple letter, and especially not when submitting a resumé.
Figure 1.5.Article from the newspaper Le Monde (October 2014) translated with Google Translate
Let us look in detail at the result produced by Google Translate. None of the English sentences produced is correct:
– The first sentence has an opposite meaning; the expression
ils le font savoir
has been translated as
they know it
instead of
they make it known.
– The second sentence has an ungrammatical sequence
…which includes presented…
The term
action
has been wrongly translated as
share
instead of
action.
– In the third sentence, the verb
summarized
is placed incorrectly;
essais cliniques avec des traitements experimentaux
should be translated as
clinical trials
with
experimental treatments
; and
utilizable
should be translated as
useable
or
useful
and not as
used
.
– The term
results
is placed incorrectly in the last sentence.
To be fair, it should be noted that every attempt to construct good-quality automatic translation software has failed, including those based on linguistic techniques, such as the European program Eurotra (1978–1992). It is my belief that the reasons for Eurotra’s failure have to do with certain scientific and technical choices (as well as real problems with management), and not with a theoretical impossibility of using linguistics to do translations.
I will turn now to another flagship application, which is less “public-at-large” than machine translation, but just as spectacular for NLP specialists: part-of-speech tagging.
Part-of-speech (POS) tagging is often presented as the basic application of any piece of NLP software, and has historically justified the sidelining of linguistic methods in favor of stochastic ones. The authors of tagging software frequently speak of 95% precision; these results seem “magical” too, since POS taggers use neither dictionaries nor grammars to analyze the words of any text with such a great precision. Linguists have difficulty justifying their painstaking descriptive work when shown what a computer can do by itself and without linguistic data! It is also commonplace to hear that taggers’ results prove that statistical techniques have bypassed linguistic ones; for example:
Automatic part of speech tagging is an area of natural language processing where statistical techniques have been more successful than rule-based methods. [BRI 92]
In their course on NLP (available on YouTube as of December 2015), Dan Jurafsky and Chris Manning consider the problem of the construction of POS taggers as “mostly solved”; more generally, NLP researchers use the spectacular results produced by statistical taggers to validate the massive use of statistical techniques in all NLP applications, always to the detriment of the linguistic approach.
A POS tagger is an automatic program that links each word in a text with a “tag”, in practice its POS category: noun, verb, adjective, etc. To do this, taggers use reference corpora which have been manually tagged9. To analyze a text, the tagger examines the context of each word in the text and compares it with the contexts of the occurrences of this same word in the reference corpus in order to deduce which tag should be linked with the word.
Figure 1.6 shows, for example, an extract from Penn Treebank, which is one of the reference corpora10 used by English POS taggers.
Figure 1.6.Extract from Penn Treebank
I do not believe that taggers should be considered as successes, and here are my reasons why.
The number of unambiguous words is so large compared to the very small number of tags used by taggers that a simple software application that would tag words just by copying the most frequent tag in the reference corpus would already have a degree of precision greater than 90% [CHA 97].
For example, in English the words my, his, the (always determiners), at, from, of, with (always prepositions), him, himself, it, me, she, them, you (always pronouns), and, or (always conjunctions), again, always, not, rather, too (always adverbs), am, be, do, have (always verbs), and day, life, moment, thing (always nouns) are extremely frequent but have only a single possible tag, and thus are always tagged correctly.
The vast majority of ambiguous words are actually favored in terms of analysis; for example, in most of their occurrences, the forms age, band, card, detail, eye, etc. represent nouns and not the verbs to age, to band, to card, to detail, to eye, etc. A software platform systematically disregarding the rare verbal hypothesis for these words will therefore almost never be wrong.
In these conditions, obtaining a 95% correct result when a simple copy already yields 90% precision is not really spectacular; on average we get one correct result out of two for difficult cases, which is more like a coin-toss than a feat of “learning”.
The degree of precision claimed by taggers is, in reality, not that impressive.
Taggers do not take into account multiword units or expressions, though these frequently occur in texts11. In the Penn Treebank extract shown in Figure 1.6, the compound noun industrial managers, the phrasal verb to buck up, the compound determiner a boatload of, the compound noun samurai warrior, the expression to blow N ashore, the adverb from the beginning, and the expression it takes N to V-inf have all simply been disregarded.
However, processing industrial manager as a sequence of two linguistic units does not make any sense: an industrial manager is not a manager who
