Formalizing Natural Languages - Max Silberztein - E-Book

Formalizing Natural Languages E-Book

Max Silberztein

0,0
139,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

This book is at the very heart of linguistics. It provides the theoretical and methodological framework needed to create a successful linguistic project. Potential applications of descriptive linguistics include spell-checkers, intelligent search engines, information extractors and annotators, automatic summary producers, automatic translators, and more. These applications have considerable economic potential, and it is therefore important for linguists to make use of these technologies and to be able to contribute to them. The author provides linguists with tools to help them formalize natural languages and aid in the building of software able to automatically process texts written in natural language (Natural Language Processing, or NLP). Computers are a vital tool for this, as characterizing a phenomenon using mathematical rules leads to its formalization. NooJ - a linguistic development environment software developed by the author - is described and practically applied to examples of NLP.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 392

Veröffentlichungsjahr: 2016

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Dedication

Title

Copyright

Acknowledgments

1 Introduction: the Project

1.1. Characterizing a set of infinite size

1.2. Computers and linguistics

1.3. Levels of formalization

1.4. Not applicable

1.5. NLP applications

1.6. Linguistic formalisms: NooJ

1.7. Conclusion and structure of this book

1.8. Exercises

1.9. Internet links

PART 1: Linguistic Units

2 Formalizing the Alphabet

2.1. Bits and bytes

2.2. Digitizing information

2.3. Representing natural numbers

2.4. Encoding characters

2.5. Alphabetical order

2.6. Classification of characters

2.7. Conclusion

2.8. Exercises

2.9. Internet links

3 Defining Vocabulary

3.1. Multiple vocabularies and the evolution of vocabulary

3.2. Derivation

3.3. Atomic linguistic units (ALUs)

3.4. Multiword units versus analyzable sequences of simple words

3.5. Conclusion

3.6. Exercises

3.7. Internet links

4 Electronic Dictionaries

4.1. Could editorial dictionaries be reused?

4.2. LADL electronic dictionaries

4.3. Dubois and Dubois-Charlier electronic dictionaries

4.4. Specifications for the construction of an electronic dictionary

4.5. Conclusion

4.6. Exercises

4.7. Internet links

PART 2: Languages, Grammars and Machines

5 Languages, Grammars, and Machines

5.1. Definitions

5.2. Generative grammars

5.3. Chomsky-Schützenberger hierarchy

5.4. The NooJ approach

5.5. Conclusion

5.6. Exercises

5.7. Internet links

6 Regular Grammars

6.1. Regular expressions

6.2. Finite-state graphs

6.3. Non-deterministic and deterministic graphs

6.4. Minimal deterministic graphs

6.5. Kleene’s theorem

6.6. Regular expressions with outputs and finite-state transducers

6.7. Extensions of regular grammars

6.8. Conclusion

6.9. Exercises

6.10. Internet links

7 Context-Free Grammars

7.1. Recursion

7.2. Parse trees

7.3. Conclusion

7.4. Exercises

7.5. Internet links

8 Context-Sensitive Grammars

8.1. The NooJ approach

8.2. NooJ contextual constraints

8.3. NooJ variables

8.4. Conclusion

8.5. Exercises

8.6. Internet links

9 Unrestricted Grammars

9.1. Linguistic adequacy

9.2. Conclusion

9.3. Exercise

9.4. Internet links

PART 3: Automatic Linguistic Parsing

10 Text Annotation Structure

10.1. Parsing a text

10.2. Annotations

10.3.

Text annotation structure (TAS)

10.4. Exercise

10.5. Internet links

11 Lexical Analysis

11.1. Tokenization

11.2. Word forms

11.3. Morphological analyses

11.4. Multiword unit recognition

11.5. Recognizing expressions

11.6. Conclusion

11.7. Exercise

12 Syntactic Analysis

12.1. Local grammars

12.2. Structural grammars

12.3. Conclusion

12.4. Exercises

12.5. Internet links

13 Transformational Analysis

13.1. Implementing transformations

13.2. Theoretical problems

13.3. Transformational analysis with NooJ

13.4. Question answering

13.5. Semantic analysis

13.6. Machine translation

13.7. Conclusion

13.8. Exercises

13.9. Internet links

Conclusion

Bibliography

Index

End User License Agreement

Guide

Cover

Table of Contents

Begin Reading

List of Illustrations

1 Introduction: the Project

Figure 1.1.

The number of any set of sentences can be doubled

Figure 1.2.

Really?

Figure 1.3.

Vietnamese–English translation with Google Translate

Figure 1.4.

Translation with Google Translate vs. with NooJ

Figure 1.5.

Article from the newspaper Le Monde (October 2014) translated with Google Translate

Figure 1.6.

Extract from Penn Treebank

Figure 1.7.

A single tool for formalization: NooJ

2 Formalizing the Alphabet

Figure 2.1.

Two electrical states: a light bulb turned on or off

Figure 2.2.

Representation of numbers in binary notation

Figure 2.3.

Extract from the Unicode table

Figure 2.4.

One possible encoding of the Latin alphabet

Figure 2.5.

ASCII encoding

Figure 2.6.

Accented Latin letters

Figure 2.7.

Character encoding is still problematic as of late 2015

Figure 2.8.

Unicode representation of the character “é”

Figure 2.9.

A Chinese character that has no Unicode code

Figure 2.10.

One Chinese character has three Unicode codes

Figure 2.11.

Four graphical variants for a single Unicode code

3 Defining Vocabulary

Figure 3.1.

Phablet, Bushism, Chipotlification, tocoupify

4 Electronic Dictionaries

Figure 4.1.

Analysis of the lexical entry “artisan”

Figure 4.2.

A lexicon-grammar table for English verbs

Figure 4.3.

Lexicon-grammar table for phrasal verbs

Figure 4.4.

Extract from DELAC dictionary (Nouns)

Figure 4.5.

Le Dictionnaire électronique des mots

Figure 4.6.

Les Verbes Français dictionary

Figure 4.7.

T grammar of constructions

Figure 4.8.

Occurrences of the verb abriter in a direct transitive construction (T)

Figure 4.9.

Occurrences of the verb “abriter” in a pronominal construction (P)

5 Languages, Grammars, and Machines

Figure 5.1.

A generative grammar

Figure 5.2.

Generation of the sentence “the cat sees a dog”

Figure 5.3.

Chomsky-Schutzenberger hierarchy

6 Regular Grammars

Figure 6.1.

Applying a regular expression to a text

Figure 6.2.

Display of a graph using XFST

Figure 6.3.

Informal time

Figure 6.4.

A non-deterministic graph

Figure 6.5.

A deterministic graph

Figure 6.6.

A minimal graph

Figure 6.7.

Five basic graphs

Figure 6.8.

Disjunction and Kleene operator

Figure 6.9.

Graph equivalent to a regular expression

Figure 6.10.

A finite-state graph

Figure 6.11.

Incorporating the node “red”

Figure 6.12.

Incorporating the node “pretty”

Figure 6.13.

Completeing the node “very”

Figure 6.14.

Final graph

Figure 6.15.

An spelling transducer

Figure 6.16.

A terminological transducer

Figure 6.17.

A morphological transducer

Figure 6.18.

A transducer for translation

Figure 6.19.

A query containing syntactic symbols

Figure 6.20.

The operator +ONE

7 Context-Free Grammars

Figure 7.1.

A NooJ context-free grammar

Figure 7.2.

A context-free grammar with syntactic symbols

Figure 7.3.

Recursive Graph

Figure 7.4.

A more general grammar

Figure 7.5.

A recursive context-free grammar

Figure 7.6.

Right recursive grammar

Figure 7.7.

Finite-state graph equivalent to a right-recursive context-free grammar

Figure 7.8.

Left recursive grammar

Figure 7.9.

Finite-state graph equivalent to a left-recursive context-free grammar

Figure 7.10.

Middle recursion

Figure 7.11.

An ambiguous grammar

Figure 7.12.

First parse tree for the ambiguous sentence: This man sees the chair from his house

Figure 7.13.

Second derivation for the sentence: This man sees the chair from his house

8 Context-Sensitive Grammars

Figure 8.1.

Context-sensitive grammar for the language a

n

b

n

c

n

Figure 8.2.

NooJ grammar for the language a

n

b

n

c

n

Figure 8.3.

NooJ grammar that recognizes the language a

n

b

n

c

n

d

n

e

n

Figure 8.4.

Grammar of language a

2

n

Figure 8.5.

Grammar that recognizes reduplications

Figure 8.6.

A German finite-state graph to describe agreement in gender, number and case.

Figure 8.7.

Agreement with constraints'

Figure 8.8.

Morphological context-sensitive grammar

Figure 8.9.

Checking the presence of a question mark

Figure 8.10.

Setting a variable

Figure 8.11.

Inheritance: $N → $NPH

9 Unrestricted Grammars

Figure 9.1.

Unrestricted grammar

Figure 9.2.

NooJ unrestricted grammar

Figure 9.3.

Respectively

10 Text Annotation Structure

Figure 10.1.

Annotations for the ambiguous sequence “black box”

Figure 10.2.

The two terms “big screen” and “screen star” overlap

Figure 10.3.

Annotating the contracted form “cannot”

Figure 10.4.

Annotating the phrasal verb “call back”

Figure 10.5.

A TAS right after the lexical analysis

11 Lexical Analysis

Figure 11.1.

Ambiguity triggered by the lack of vowels

Figure 11.2.

Hebrew and Latin alphabets together in same text

Figure 11.3.

Itogi Weekly no. 40, October 3rd 2011

Figure 11.4.

Transliteration variants

Figure 11.5.

Contractions

Figure 11.6.

Contractions of “not”

Figure 11.7.

Prefixes

Figure 11.8.

Numerical determinants

Figure 11.9.

Multiple solutions for breaking down a Chinese text

Figure 11.10.

Intonation in Armenian

Figure 11.11.

Recognizing US Phone Numbers

Figure 11.12.

Roman numerals

Figure 11.13.

Paradigm TABLE

Figure 11.14.

Inflection codes used in the English NooJ module

Figure 11.15.

Paradigm HELP

Figure 11.16.

Paradigm for KNOW

Figure 11.17.

Morphological operators

Figure 11.19.

Paradigm NN

Figure 11.20.

France and its derived forms

Figure 11.21.

Dictionary produced automatically from a morphological grammar

Figure 11.22.

A productive morphological rule

Figure 11.23.

Description of Spanish clitics (infinitive form)

Figure 11.24.

Agglutination in German

Figure 11.25.

A family of terms

Figure 11.26.

Checking context for the characteristic constituent

Figure 11.27.

Checking context, v2

Figure 11.28.

Checking context, v3

Figure 11.29.

Annotate phrasal verbs

Figure 11.30.

Discontinuous annotation in the TAS

12 Syntactic Analysis

Figure 12.1.

A local grammer for common email address

Figure 12.2.

Graph “on the 3rd of June”

Figure 12.3.

Graph “at seven o’clock”

Figure 12.4.

Date grammar

Figure 12.5.

A syntactic annotation in TAS

Figure 12.6.

Grammar of preverbal particles in French

Figure 12.7.

Detecting ambiguities in the word form “this”

Figure 12.8.

A syntax tree

Figure 12.9.

Structure of a sentence that contains a discontinuous expression

Figure 12.10.

A grammar produces structured annotations

Figure 12.11.

A structured group of syntactic annotations

Figure 12.12.

Syntactic analysis of a lexically ambiguous sentence

Figure 12.13.

Analyzing a structurally ambiguous sentence

Figure 12.14.

Simplified grammar

Figure 12.15.

Another parse tree for a simplified grammar

Figure 12.16.

Parse tree for a structured grammar

Figure 12.17.

Dependency grammar

Figure 12.18.

Dependency tree

Figure 12.19.

ALUs in the syntax tree

13 Transformational Analysis

Figure 13.1.

The sentence Joe loves Lea is transformed automatically

Figure 13.2.

Passive

Figure 13.3.

Negation

Figure 13.4.

Making the subject into a pronoun

Figure 13.5.

A few elementary transformations

Figure 13.6.

The operation [Passive-inv]

Figure 13.7.

A transformation chain

Figure 13.8.

Grammar for declarative transitive sentences

Figure 13.9.

Grammar used in mixed “analysis + generation”mode

Figure 13.10.

Linking complex sentences to their transformational properties

Figure 13.11.

Automatic transformation

Figure 13.12.

Simple French → English translation

Figure 13.13.

Translation changing the word order

Figure 13.14.

Translation with constraints

Pages

C1

ii

iii

iv

v

xi

xii

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

35

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

195

196

197

198

199

200

201

202

203

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

315

316

317

318

319

320

321

322

323

324

325

327

328

329

G1

G2

G3

For Nadia “Nooj” Malinovich Silberztein, the Mensch of the family, without whom neither this book, nor the project named after her, would have happened.

And for my two children, Avram and Rosa, who remind me every day of the priorities in my life.

Series Editor

Patrick Paroubek

Formalizing Natural Languages

The NooJ Approach

Max Silberztein

First published 2016 in Great Britain and the United States by ISTE Ltd and John Wiley & Sons, Inc.

Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address:

ISTE Ltd27-37 St George’s RoadLondon SW19 4EUUKwww.iste.co.uk

John Wiley & Sons, Inc.111 River StreetHoboken, NJ 07030USAwww.wiley.com

© ISTE Ltd 2016

The rights of Max Silberztein to be identified as the author of this work have been asserted by him in accordance with the Copyright, Designs and Patents Act 1988.

Library of Congress Control Number: 2015957115

British Library Cataloguing-in-Publication DataA CIP record for this book is available from the British LibraryISBN 978-1-84821-902-1

Acknowledgments

I would like to thank the University of Franche-Comté and my colleagues in the ELLIADD laboratory for believing in the NooJ project and supporting the community of NooJ users unfailingly since its inception.

It would be impossible for me to mention every single one of the colleagues and students who have participated, in one way or another, in the extremely ambitious project described in this book – that of formalizing natural languages! The NooJ software has been in use since 2002 by a community of researchers and students; see www.nooj4nlp.net. NooJ was developed in direct cooperation with all its users who devoted their energy to this or that specific problem, or to one language or another. Spelling in Semitic languages, variation in Asian languages, intonation in Armenian, inflection in Hungarian, phrasal verbs in English, derivation in Slavic languages, composition in Greek and in Germanic languages, etc. pose a wide variety of linguistic problems, and without the high standards of these linguists the NooJ project would never have known the success it is experiencing today. Very often, linguistic questions that seemed “trivial” at the time have had a profound influence on the development of NooJ.

Among its users, there are some “NooJ experts” to whom I would like to give particular thanks, as they participated directly in its design, and had the patience to help me with long debugging sessions. I thank them for their ambition and their patience: Héla Fehri, Kristina Kocijan, Slim Mesfar, Cristina Mota, and Simonetta Vietri.

I would also like to thank Danielle Leeman and François Trouilleux for their detailed review of the original book, and Peter Machonis for his review of the English version, as well as for verifying the relevance of the English examples, which contributed greatly to the quality of this book.

Max SILBERZTEIN

November, 2015.

1Introduction: the Project

The project described in this book is at the very heart of linguistics; its goal is to describe, exhaustively and with absolute precision, all the sentences of a language likely to appear in written texts1. This project fulfills two needs: it provides linguists with tools to help them describe languages exhaustively (linguistics), and it aids in the building of software able to automatically process texts written in natural language (natural language processing, or NLP).

A linguistic project2 needs to have a theoretical and methodological framework (how to describe this or that linguistic phenomenon; how to organize the different levels of description); formal tools (how to write each description); development tools to test and manage each description; and engineering tools to be used in sharing, accumulating, and maintaining large quantities of linguistic resources.

There are many potential applications of descriptive linguistics for NLP: spell-checkers, intelligent search engines, information extractors and annotators, automatic summary producers, automatic translators, etc. These applications have the potential for considerable economic usefulness, and it is therefore important for linguists to make use of these technologies and to be able to contribute to them.

For now, we must reduce the overall linguistic project of describing all phenomena related to the use of language, to a much more modest project: here, we will confine ourselves to seeking to describe the set of all of the sentences that may be written or read in natural-language texts. The goal, then, is simply to design a system capable of distinguishing between the two sequences below:

a)

Joe is eating an apple

b)

Joe eating apple is an

Sequence (a) is a grammatical sentence, while sequence (b) is not.

This project constitutes the mandatory foundation for any more ambitious linguistic projects. Indeed it would be fruitless to attempt to formalize text styles (stylistics), the evolution of a language across the centuries (etymology), variations in a language according to social class (sociolinguistics), cognitive phenomena involved in the learning or understanding of a language (psycholinguistics), etc. without a model, even a rudimentary one, capable of characterizing sentences.

If the number of sentences were finite – that is, if there were a maximum number of sentences in a language – we would be able to list them all and arrange them in a database. To check whether an arbitrary sequence of words is a sentence, all we would have to do is consult this database: it is a sentence if it is in the database, and otherwise it is not. Unfortunately, there are an infinite number of sentences in a natural language. To convince ourselves of this, let us resort to a redictio ad absurdum: imagine for a moment that there are n sentences in English.

Based on this finite number n of initial sentences, we can construct a second set of sentences by putting the sequence Lea thinks that, for example, before each of the initial sentences:

Joe is sleeping → Lea thinks that Joe is sleeping

The party is over → Lea thinks that the party is over

Using this simple mechanism, we have just doubled the number of sentences, as shown in the figure below.

Figure 1.1.The number of any set of sentences can be doubled

This mechanism can be generalized by using verbs other than the verb to think; for example:

Lea (believes | claims | dreams | knows | realizes | thinks | …) that Sentence.

There are several hundred verbs that could be used here. Likewise, we could replace Lea with several thousand human nouns:

(The CEO | The employee | The neighbor | The teacher | …) thinks that Sentence.

Whatever the size n of an initial set of sentences, we can thus construct n × 100 × 1,000 sentences simply by inserting before each of the initial sentences, sequences such as Lea thinks that, Their teacher claimed that, My neighbor declared that, etc.

Language has other mechanisms that can be used to expand a set of sentences exponentially. For example, based on n initial sentences, we can construct n × n sentences by combining all of these sentences in pairs and inserting the word and between them. For example:

It is raining + Joe is sleeping →It is raining and Joe is sleeping

This mechanism can also be generalized by using several hundred connectors; for example:

It is raining (but | nevertheless | therefore | where | while |…) Joe is sleeping.

These two mechanisms (linking of sentences and use of connectors) can be used multiple times in a row, as in the following:

Lea claims that Joe hoped that Ida was sleeping. It was raining while Lea was sleeping, however Ida is now waiting, but the weather should clear up as soon as night falls.

Thus these mechanisms are said to be recursive; the number of sentences that can be constructed with recursive mechanisms is infinite. Therefore it would be impossible to define all of these sentences in extenso. Another way must be found to characterize the set of sentences.

1.1. Characterizing a set of infinite size

Mathematicians have known for a long time how to define sets of infinite size. For example, the two rules below can be used to define the set of all natural numbers :

(a) Each of the ten elements of set {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} is a natural number;

(b) any word that can be written as xy is a natural number if and only if its two constituents x and y are natural numbers.

These two rules constitute a formal definition of all natural numbers. They make it possible to distinguish natural numbers from any other object (decimal numbers or others). For example:

– Is the word “123” a natural number? Thanks to rule (a), we know that “1” and “2” are natural numbers. Rule (b) allows us to deduce from this that “12” is a natural number. Thanks to rule (a) we know that “3” is a natural number; since “12” and “3” are natural numbers, then rule (b) allows us to deduce that “123” is a natural number.

– The word “2.5” is not a natural number. Rule (a) enables us to deduce that “2” is a natural number, but it does not apply to the decimal point “.”. Rule (b) can only apply to two natural numbers, therefore it does not apply to the decimal point because it is not a natural number. In this case, “2.” is not a natural number; therefore “2.5” is not a natural number either.

There is an interesting similarity between this definition of set and the problem of characterizing the sentences in a language:

– Rule (a) describes

in extenso

the finite set of numerals that must be used to form valid natural numbers. This rule resembles a dictionary in which we would list all the words that make up the vocabulary of a language.

– Rule (b) explains how numerals can be combined to construct an infinite number of natural numbers. This rule is similar to grammatical rules that specify how to combine words in order to construct an infinite number of sentences.

To describe a natural language, then, we will proceed as follows: firstly we will define in extenso the finite number of basic units in a language (its vocabulary); and secondly, we will list the rules used to combine the vocabulary elements in order to construct sentences (its grammar).

1.2. Computers and linguistics

Computers are a vital tool for this linguistic project, for at least four reasons:

– From a theoretical point of view, a computer is a device that can verify automatically that an element is part of a mathematically-defined set. Our goal is then to construct a device that can automatically verify whether a sequence of words is a valid sentence in a language.

– From a methodological point of view, the computer will impose a framework to describe linguistic objects (words, for example) as well as the rules for use of these objects (such as syntactic rules). The way in which linguistic phenomena are described must be consistent with the system: any inconsistency in a description will inevitably produce an error (or “bug”).

– When linguistic descriptions have been entered into a computer, a computer can apply them to very large texts in order to extract from these texts examples or counterexamples that validate (or not) these descriptions. Thus a computer can be used as a scientific instrument (this is the

corpus linguistics

approach), as the telescope is in astronomy or the microscope in biology.

– Describing a language requires a great deal of descriptive work; software is used to help with the development of databases containing numerous linguistic objects as well as numerous grammar rules, much like engineers use computer-aided design (CAD) software to design cars, electronic circuits, etc. from libraries of components.

Finally, the description of certain linguistic phenomena makes it possible to construct NLP software applications. For example, if we have a complete list of the words in a language, we can build a spell-checker; if we have a list of rules of conjugation we can build an automatic conjugator. A list of morphological and phonological rules also makes it possible to suggest spelling corrections when the computer has detected errors, while a list of simple and compound terms can be used to build an automatic indexer. If we have bilingual dictionaries and grammars we can build an automatic translator, and so forth. Thus the computer has become an essential tool in linguistics, so much so that opposing “computational linguists” with “pure linguists” no longer makes sense.

1.3. Levels of formalization

When we characterize a phenomenon using mathematical rules, we formalize it. The formalization of a linguistic phenomenon consists of describing it, by storing both linguistic objects and rules in a computer. Languages are complicated to describe, partly because interactions between their phonological and writing systems have multiplied the number of objects to process, as well as the number of levels of combination rules. We can distinguish five fundamental levels of linguistic phenomena; each of these levels corresponds to a level of formalization.

To analyze a written text, we access letters of the alphabet rather than words; thus it is necessary to describe the link between the alphabet and the orthographic forms we wish to process (spelling). Next, we must establish a link between the orthographic forms and the corresponding vocabulary elements (morphology). Vocabulary elements are generally listed and described in a lexicon that must also show all potential ambiguities (lexicography). Vocabulary elements combine to build larger units such as phrases which then combine to form sentences; therefore rules of combination must be established (syntax). Finally, links between elements of meaning which form a predicate transcribed into an elementary sentence, as well as links between predicates in a complex sentence, must be established (semantics).

1.4. Not applicable

We do not always use language to represent and communicate information directly and simply; sometimes we play with language to create sonorous effects (for example in poetry). Sometimes we play with words, or leave some “obvious” information implicit because it stems from the culture shared by the speakers (anaphora). Sometimes we express one idea in order to suggest another (metaphor). Sometimes we use language to communicate statements about the real world or in scientific spheres, and sometimes we even say the opposite of what we really mean (irony).

It is important to clearly distinguish problems that can be solved within a strictly linguistic analytical framework from those that require access to information from other spheres in order to be solved.

1.4.1. Poetry and plays on words

Writers, poets, and authors of word games often take the liberty of constructing texts that violate the syntactic or semantic constraints of language. For example, consider the following text3:

For her this rhyme is penned, whose luminous eyes

Brightly expressive as the twins of Leda,

Shall find her own sweet name, that nesting lies,

Upon the page, enwrapped from every reader.

This poem is an acrostic, meaning that it contains a puzzle which readers are invited to solve. We cannot rely on linguistic analysis to solve this puzzle. But, to even understand that the poem is a puzzle, the reader must figure out that this rhyme refers to the poem itself. Linguistic analysis is not intended to figure out what in the world this rhyme might be referring to; much less to decide among the possible candidates.

… luminous eyes brightly expressive as the twins of Leda …

The association between the adjective luminous and eyes is not a standard semantic relationship; unless the eyes belong to a robot, eyes are not luminous. This association is, of course, metaphorical: we have to understand that luminous eyes means that the owner of the eyes has a luminous intelligence, and that we are perceiving this luminous intelligence by looking at her eyes.

The twins of Leda are probably the mythological heroes Castor and Pollux (the twin sons of Leda, the wife of the king of Sparta), but they are not particularly known for being expressive. These two heroes gave their names to the constellation Gemini, but I confess that I do not understand what an expressive constellation might be. I suspect the author rather meant to write:

… expressive eyes brightly luminous as the twins of Leda …

The associations between the noun name and the verbal forms lies, nestling, and enwrapped are no more direct; we need to understand that it is the written form of the name which is present on the physical page where the poem is written, and that it is hidden from the reader.

If we wish to make a poetic analysis of this text, the first thing to do is thus to note these non-standard associations, so we will know where to run each poetic interpretive analysis. But if we do not even know that eyes are not supposed to be luminous, we will not be able to even figure out that there is a metaphor, therefore we will not be able to solve it (i.e. to compute that the woman in question is intelligent), and so we will have missed an important piece of information in the poem. More generally, in order to understand a poem’s meaning, we must first note the semantic violations it contains. To do this, we need a linguistic model capable of distinguishing “standard” associations such as an intelligent woman, a bright constellation, a name written on a page, etc. from associations requiring poetic analysis, such as luminous eyes, an expressive constellation, a name lying upon a page.

Analyzing poems can pose other difficulties, particularly at the lexical and syntactic levels. In standard English, word order is less flexible than in poems. To understand the meaning of this poem, a modern reader has to start by rewriting (in his or her mind) the text in standard English, for example as follows:

This rhyme is written for her, whose luminous eyes (as brightly expressive as the twins of Leda) will find her own sweet name, which lies on the page, nestling, enwrapped from every reader.

The objective of the project described in this book is to formalize standard language without solving poetic puzzles, or figuring out possible referents, or analyzing semantically nonstandard associations.

1.4.2. Stylistics and rhetoric

Stylistics studies ways of formulating sentences in speech. For example, in a text we study the use of understatements, metaphors, and metonymy (“figures of style”), the order of the components of a sentence and that of the sentences in a speech, and the use of anaphora. Here are a few examples of stylistic phenomena that cannot be processed in a strictly linguistic context:

Understatement: Joe was not the fastest runner in the race

Metaphor: The CEO is a real elephant

Metonymy: The entire table burst into laughter

In reality, the sentence Joe was not the fastest runner in the race could mean here that Joe came in last; so, in a way, this sentence is not saying what it is expressing! Unless we know the result of the race, or have access to information about the real Joe, we cannot expect a purely linguistic analysis system to detect understatements, irony or lies.

To understand the meaning of the sentence The CEO is a real elephant, we need to know firstly that a CEO cannot really be an elephant, and therefore that this is a metaphor. Next we need to figure out which “characteristic property” of elephants is being used in the metaphor. Elephants are known for several things: they are big, strong, and clumsy; they have long memories; they are afraid of mice; they are an endangered species; they have big ears; they love to take mud-baths; they live in Africa or India, etc. Is the CEO clumsy? Is he/she afraid of mice? Does he/she love mud-baths? Does he/she have a good memory? To understand this statement, we would have to know the context in which the sentence was said, and we might also need to know more about the CEO in question.

To understand the meaning of the sentence The entire table burst into laughter, it is necessary first to know that a table is not really capable of bursting into laughter, and then to infer that there are people gathered around a table (during a meal or a work meeting) and that it is these people who burst out laughing. The noun table is neither a collective human noun (such as group or colony), nor a place that typically contains humans (such as meeting room or restaurant), nor an organization (such as association or bank); therefore using only the basic lexical properties associated with the noun table will not be enough to comprehend the sentence.

It is quite reasonable to expect a linguistic system to detect that the sentences The CEO is a real elephant and The entire table burst into laugther are not standard sentences; for example, by describing CEO as a human noun, describing table as a concrete noun, and requiring to burst into laughter to have a human subject, we can learn from a linguistic analysis that these sentences are not “standard”, and that it is therefore necessary to initiate an extra-linguistic computation such as metaphor or metonymy calculations in order to interpret them.

The linguistic project described in this book is not intended to solve understatements, metaphors, or metonymy, but it must be able to detect sentences that are deviant in comparison to the standard language.

1.4.3. Anaphora, coreference resolution, and semantic disambiguation

Coreference: Lea invited Ida for dinner. She brought a bottle of wine.

Anaphora: Phelps returned. The champion brought back 6 medals with him.

Semantic ambiguity: The round table is in room B17.

In order to understand that in the sentence She brought a bottle of wine, she refers to Ida and not Lea, we need to know that it is usually the guest who travels and brings a bottle of wine. This social convention is commonplace throughout the modern Western world, but we would need to be sure that this story does not take place in a society where it is the person who invites who brings beverages.

In order to understand that The champion is a reference to Phelps, we have to know that Phelps is a champion. Note that dozens of other nouns could have been used in this anaphora: the American, the medal-winner, the record-holder, the swimming superstar, the young man, the swimmer, the former University of Florida student, the breakaway, the philanthropist, etc.

In order to eliminate the ambiguity of the sequence round table (between “a table with a round shape” and “a meeting”), we would need to have access to a wider context than the sentence alone.

The linguistic project described in this book is not intended to resolve anaphora or semantic ambiguities.

NOTE. – I am not saying that it is impossible to process poetry, word games, understatements, metaphors, metonymy, coreference, anaphora, and semantic ambiguities; I am only saying that these phenomena lie outside the narrow context of the project presented in this book. There are certainly “lucky” cases in which linguistic software can automatically solve some of these phenomena. For example, in the following sequence:

Joe invited Lea for dinner. She brought a bottle of wine

a simple verification of the pronoun’s gender would enable us to connect She to Lea. Conversely, it is easy to build software which, based on the two sentences Joe invited Lea to dinner and Lea brought a bottle of wine, would produce the sentence She brought a bottle of wine. Likewise, in the sentence:

The round table is taking place in room B17

a linguistic parser could automatically figure out that the noun round table refers to a meeting, provided that it has access to a dictionary in which the noun round table is described as being an abstract noun (synonymous with meeting), and the verb to take place is described as calling for an abstract subject.

1.4.4. Extralinguistic calculations

Consider the following statements:

a)

Two dollars plus three dollars make four dollars.

b)

Clinton was already president in 1536.

c)

The word God has four letters.

d)

This sentence is false.

These statements are expressed using sentences that are well-formed because they comply with the spelling, morphological, syntactic, and semantic rules of the English language. However, they express statements that are incorrect in terms of mathematics (a), history (b), spelling (c), or logic (d). To detect these errors we would need to access knowledge that is not part of our strictly linguistic project4.

The project described in this book is confined to the formalization of language, without taking into account speakers’ knowledge about the real world.

1.5. NLP applications

Of course, there are fantastic software applications capable of processing extralinguistic problems! For example, the IBM computer Watson won on the game show Jeopardy! in spectacular fashion in 2011; I have a lot of fun asking my smart watch questions. In the car, I regularly ask Google Maps to guide me verbally to my destination; my language-professor colleagues have trouble keeping their students from using Google Translate; and the subtitles added automatically to YouTube videos are a precious resource for people who are hard of hearing [GRE 11], etc.

All of these software platforms have a NLP part, which analyzes or produces a written or verbal statement, often accompanied by a specialized module, for example a search engine or GPS navigation software. It is important to distinguish between these components: just because we are impressed by the fact that Google Maps gives us reliable directions, it does not mean it speaks perfect English. It is very possible that IBM Watson can answer a question correctly without having really “understood” the question. Likewise, a software platform might automatically summarize a text using simple techniques to filter out words, phrases or sentences it judges to be unimportant [MAN 01]5. Word-recognition systems use signal processing techniques to produce a sequence of phonemes and then determine the most “probable” corresponding sequence of words by comparing it to a reference database [JUR 00]6, etc.

Most pieces of NLP software actually produce spectacular, almost magical results, with a very low degree of linguistic competence. To produce these results, the software uses “tricks”, often based on statistical methods [MAN 99].

Unfortunately, the success of these software platforms is often used in order to show that statistical techniques have made linguistics unnecessary7. It is important, then, to understand their limitations. In the next two sections I will take a closer look at the performances of the two “flagship” statistical NLP software platforms: automatic translation and part-of-speech tagging.

Figure 1.2.Really?

1.5.1. Automatic translation

Today, the best-known translation software platforms use statistical techniques to suggest a translation of texts. These software platforms are regularly cited as examples of the success of statistical techniques, and everyone has seen a “magical” translation demo8. It is not surprising, therefore, that most people think the problem of translation has already been solved. I do not agree.

Figure 1.3.Vietnamese–English translation with Google Translate

For example, Figure 1.3 shows how Google Translate has translated part of an article in Vietnamese (www.thanhnieunnews.com, October 2014). The text produced does not make very much sense; what does “I’m talking to him on the Navy’s generals” mean? The translated text even contains incorrect constructions (“could very well shielded”, for example).

Figure 1.4 allows us to compare Google Translate’s results with those obtained using a simple Arabic-English dictionary and a very basic translation grammar; see [BAN 15]. For example, the first sentence was wrongly translated by Google Translate as The man who went down to work instead of The man who fell went to work. Google Translate was also wrong about the second sentence, which means The man that you knew went to work and not I knew a man who went to work.

Figure 1.4.Translation with Google Translate vs. with NooJ

When translating languages that are more similar, such as French into English, the results produced by Google Translate are helpful, but still could not be used in a professional context or to translate a novel, a technical report, or even a simple letter, and especially not when submitting a resumé.

Figure 1.5.Article from the newspaper Le Monde (October 2014) translated with Google Translate

Let us look in detail at the result produced by Google Translate. None of the English sentences produced is correct:

– The first sentence has an opposite meaning; the expression

ils le font savoir

has been translated as

they know it

instead of

they make it known.

– The second sentence has an ungrammatical sequence

…which includes presented…

The term

action

has been wrongly translated as

share

instead of

action.

– In the third sentence, the verb

summarized

is placed incorrectly;

essais cliniques avec des traitements experimentaux

should be translated as

clinical trials

with

experimental treatments

; and

utilizable

should be translated as

useable

or

useful

and not as

used

.

– The term

results

is placed incorrectly in the last sentence.

To be fair, it should be noted that every attempt to construct good-quality automatic translation software has failed, including those based on linguistic techniques, such as the European program Eurotra (1978–1992). It is my belief that the reasons for Eurotra’s failure have to do with certain scientific and technical choices (as well as real problems with management), and not with a theoretical impossibility of using linguistics to do translations.

I will turn now to another flagship application, which is less “public-at-large” than machine translation, but just as spectacular for NLP specialists: part-of-speech tagging.

1.5.2. Part-of-speech (POS) tagging

Part-of-speech (POS) tagging is often presented as the basic application of any piece of NLP software, and has historically justified the sidelining of linguistic methods in favor of stochastic ones. The authors of tagging software frequently speak of 95% precision; these results seem “magical” too, since POS taggers use neither dictionaries nor grammars to analyze the words of any text with such a great precision. Linguists have difficulty justifying their painstaking descriptive work when shown what a computer can do by itself and without linguistic data! It is also commonplace to hear that taggers’ results prove that statistical techniques have bypassed linguistic ones; for example:

Automatic part of speech tagging is an area of natural language processing where statistical techniques have been more successful than rule-based methods. [BRI 92]

In their course on NLP (available on YouTube as of December 2015), Dan Jurafsky and Chris Manning consider the problem of the construction of POS taggers as “mostly solved”; more generally, NLP researchers use the spectacular results produced by statistical taggers to validate the massive use of statistical techniques in all NLP applications, always to the detriment of the linguistic approach.

A POS tagger is an automatic program that links each word in a text with a “tag”, in practice its POS category: noun, verb, adjective, etc. To do this, taggers use reference corpora which have been manually tagged9. To analyze a text, the tagger examines the context of each word in the text and compares it with the contexts of the occurrences of this same word in the reference corpus in order to deduce which tag should be linked with the word.

Figure 1.6 shows, for example, an extract from Penn Treebank, which is one of the reference corpora10 used by English POS taggers.

Figure 1.6.Extract from Penn Treebank

I do not believe that taggers should be considered as successes, and here are my reasons why.

1.5.2.1. The results of statistical methods are not actually so spectacular

The number of unambiguous words is so large compared to the very small number of tags used by taggers that a simple software application that would tag words just by copying the most frequent tag in the reference corpus would already have a degree of precision greater than 90% [CHA 97].

For example, in English the words my, his, the (always determiners), at, from, of, with (always prepositions), him, himself, it, me, she, them, you (always pronouns), and, or (always conjunctions), again, always, not, rather, too (always adverbs), am, be, do, have (always verbs), and day, life, moment, thing (always nouns) are extremely frequent but have only a single possible tag, and thus are always tagged correctly.

The vast majority of ambiguous words are actually favored in terms of analysis; for example, in most of their occurrences, the forms age, band, card, detail, eye, etc. represent nouns and not the verbs to age, to band, to card, to detail, to eye, etc. A software platform systematically disregarding the rare verbal hypothesis for these words will therefore almost never be wrong.

In these conditions, obtaining a 95% correct result when a simple copy already yields 90% precision is not really spectacular; on average we get one correct result out of two for difficult cases, which is more like a coin-toss than a feat of “learning”.

The degree of precision claimed by taggers is, in reality, not that impressive.

1.5.2.2. POS taggers disregard the existence of multiword units

Taggers do not take into account multiword units or expressions, though these frequently occur in texts11. In the Penn Treebank extract shown in Figure 1.6, the compound noun industrial managers, the phrasal verb to buck up, the compound determiner a boatload of, the compound noun samurai warrior, the expression to blow N ashore, the adverb from the beginning, and the expression it takes N to V-inf have all simply been disregarded.

However, processing industrial manager as a sequence of two linguistic units does not make any sense: an industrial manager is not a manager who