Information Theory Meets Power Laws - Lukasz Debowski - E-Book

Information Theory Meets Power Laws E-Book

Lukasz Debowski

0,0
100,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Discover new theoretical connections between stochastic phenomena and the structure of natural language with this powerful volume! Information Theory Meets Power Laws: Stochastic Processes and Language Models presents readers with a novel subtype of a probabilistic approach to language, which is based on statistical laws of texts and their analysis by means of information theory. The distinguished author insightfully and rigorously examines the linguistic and mathematical subject matter while eschewing needlessly abstract and superfluous constructions. The book begins with a less formal treatment of its subjects in the first chapter, introducing its concepts to readers without mathematical training and allowing those unfamiliar with linguistics to learn the book's motivations. Despite its inherent complexity, Information Theory Meets Power Laws: Stochastic Processes and Language Models is a surprisingly approachable treatment of idealized mathematical models of human language. The author succeeds in developing some of the theory underlying fundamental stochastic and semantic phenomena, like strong nonergodicity, in a way that has not previously been seriously attempted. In doing so, he covers topics including: * Zipf's and Herdan's laws for natural language * Power laws for information, repetitions, and correlations * Markov, finite-state,and Santa Fe processes * Bayesian and frequentist interpretations of probability * Ergodic decomposition, Kolmogorov complexity, and universal coding * Theorems about facts and words * Information measures for fields * Rényi entropies, recurrence times, and subword complexity * Asymptotically mean stationary processes Written primarily for mathematics graduate students and professionals interested in information theory or discrete stochastic processes, Information Theory Meets Power Laws: Stochastic Processes and Language Models also belongs on the bookshelves of doctoral students and researchers in artificial intelligence, computational and quantitative linguistics as well as physics of complex systems.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 535

Veröffentlichungsjahr: 2020

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Information Theory Meets Power Laws

Copyright

dedication-page

Preface

Acknowledgments

Basic Notations

1 Guiding Ideas

1.1 The Motivating Question

1.2 Further Questions About Texts

1.3 Zipf's and Herdan's Laws

1.4 Markov and Finite‐State Processes

1.5 More General Stochastic Processes

1.6 Two Interpretations of Probability

1.7 Insights from Information Theory

1.8 Estimation of Entropy Rate

1.9 Entropy of Natural Language

1.10 Algorithmic Information Theory

1.11 Descriptions of a Random World

1.12 Facts and Words Related

1.13 Repetitions and Entropies

1.14 Decay of Correlations

1.15 Recapitulation

2 Probabilistic Preliminaries

2.1 Probability Measures

2.2 Product Measurable Spaces

2.3 Discrete Random Variables

2.4 From IID to Finite‐State Processes

3 Probabilistic Toolbox

3.1 Borel ‐Fields and a Fair Coin

3.2 Integral and Expectation

3.3 Inequalities and Corollaries

3.4 Semidistributions

3.5 Conditional Probability

3.6 Modes of Convergence

3.7 Complete Spaces

4 Ergodic Properties

4.1 Plain Relative Frequency

4.2 Birkhoff Ergodic Theorem

4.3 Ergodic and Mixing Criteria

4.4 Ergodic Decomposition

5 Entropy and Information

5.1 Shannon Measures for Partitions

5.2 Block Entropy and Its Limits

5.3 Shannon Measures for Fields

5.4 Block Entropy Limits Revisited

5.5 Convergence of Entropy

5.6 Entropy as Self‐Information

6 Equipartition and Universality

6.1 SMB Theorem

6.2 Universal Semidistributions

6.3 PPM Probability

6.4 SMB Theorem Revisited

6.5 PPM‐based Statistics

7 Coding and Computation

7.1 Elements of Coding

7.2 Kolmogorov Complexity

7.3 Algorithmic Coding Theorems

7.4 Limits of Mathematics

7.5 Algorithmic Randomness

8 Power Laws for Information

8.1 Hilberg Exponents

8.2 Second Order SMB Theorem

8.3 Probabilistic and Algorithmic Facts

8.4 Theorems About Facts and Words

9 Power Laws for Repetitions

9.1 Rényi–Arimoto Entropies

9.2 Generalized Entropy Rates

9.3 Recurrence Times

9.4 Subword Complexity

9.5 Two Maximal Lengths

9.6 Logarithmic Power Laws

10 AMS Processes

10.1 AMS and Pseudo AMS Measures

10.2 Quasiperiodic Coding

10.3 Synchronizable Coding

10.4 Entropy Rate in the AMS Case

11 Toy Examples

11.1 Finite and Ultrafinite Energy

11.2 Santa Fe Processes and Alike

11.3 Encoding into a Finite Alphabet

11.4 Random Hierarchical Association

11.5 Toward Better Models

Future Research

Bibliography

Index

End User License Agreement

List of Tables

Chapter 1

Table 1.1 The beginning of the rank list for Shakespeare's First Folio/35 Pla...

Chapter 7

Table 7.1 Examples of code words

and

.

List of Illustrations

Chapter 1

Figure 1.1 Zipf's law.Zipf's law for Shakespeare's First Folio/35 Plays an...

Figure 1.2 Herdan's law for Shakespeare's First Folio/35 Plays and a random ...

Figure 1.3 Histograms of word length for Shakespeare's First Folio/35 Plays ...

Figure 1.4 Graphs of the bounds for conditional entropy

for English in fun...

Figure 1.5 The PPM order

versus the input length

for Shakespeare's First...

Figure 1.6 The cardinality of the PPM vocabulary

versus the input length

Figure 1.7 Maximal repetition for Shakespeare's First Folio/35 Plays and a r...

Guide

Cover

Table of Contents

Begin Reading

Pages

iv

v

ix

x

xi

xii

xiii

xiv

xv

xvi

1

2

3

4

5

6

7

8

9

10

12

13

14

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

35

36

37

38

39

40

41

42

43

44

45

46

47

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

365

366

367

368

369

370

Information Theory Meets Power Laws

Stochastic Processes and Language Models

Łukasz Dębowski

Polish Academy of Sciences

 

 

 

 

 

 

 

 

Copyright

This edition first published 2021

© 2021 John Wiley & Sons, Inc.

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.

The right of Łukasz Dębowski to be identified as the author of this work has been asserted in accordance with law.

Registered Office

John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA

Editorial Office

111 River Street, Hoboken, NJ 07030, USA

For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.

Wiley also publishes its books in a variety of electronic formats and by print‐on‐demand. Some content that appears in standard print versions of this book may not be available in other formats.

Limit of Liability/Disclaimer of Warranty

While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

Library of Congress Cataloging‐in‐Publication Data

Names: Dębowski, Łukasz Jerzy, 1975‐ author.

Title: Information theory meets power laws : stochastic processes and

 language models / Łukasz Jerzy Dębowski, Polish Academy of Sciences.

Description: Hoboken : Wiley, 2021. | Includes bibliographical references

 and index.

Identifiers: LCCN 2020015366 (print) | LCCN 2020015367 (ebook) | ISBN

 9781119625278 (hardback) | ISBN 9781119625360 (adobe pdf) | ISBN

 9781119625377 (epub)

Subjects: LCSH: Computational linguistics. | Stochastic processes.

Classification: LCC P98 .D35 2020 (print) | LCC P98 (ebook) | DDC

 410.1/5195–dc23

LC record available at https://lccn.loc.gov/2020015366

LC ebook record available at https://lccn.loc.gov/2020015367

Cover design by Wiley

Cover image: © Jan Hakan Dahlstrom/Getty Images

In memory of my Mother

Preface

This book concerns some idealized mathematical models of human language. Nowadays, there are recognized two approaches to mathematical modeling of language. One stems historically from theoretical computer science and formal linguistics, another applies probability, statistics, and information theory. The first approach gave rise to formal language theory, formal syntax, and formal semantics, whereas the second approach is mostly followed by engineers working in computational linguistics and machine learning. In this book, I would like to lay foundations to a novel subtype of a probabilistic approach to language, based on empirical quantitative laws of texts and their deductive analysis by means of information theory and stochastic processes.

Having originally done my master studies in theoretical physics, gaining some experience in computational linguistics, and later switching to information theory and stochastic processes, I believe that a good science is a dialogue between theory and experiment. Concerning probability models for language, I think that there are many computational applications and ambitious experiments. In contrast, the theoretical understanding of some relatively basic stochastic‐semantic phenomena, such as strong nonergodicity, remains surprisingly underdeveloped, untouched by linguists, engineers, or mathematicians. Hence, in my opinion, comes the necessity of the present work, which aims to sketch at least some boundaries of this no‐man's land.

The core idea of this book is to explain a few quantitative power laws satisfied by texts in natural language in terms of non‐Markovian and nonhidden Markovian discrete stochastic processes with some sort of long‐range dependence. The most important novel results discussed in this book, the theorems about facts and words, characterize arbitrary nonergodic or perigraphic stationary processes but have a clear linguistic inspiration. To achieve my goals, I use information theory, probability measures, and some related technical tools such as the ergodic decomposition and the Kolmogorov complexity. These tools, although they arise in a natural way, are black magic for linguists, and they are not the everyday bread for engineers. For this reason, this book will be likely treated as a monograph in mathematics rather than linguistics.

In fact, I wrote this book as dedicated primarily for mathematics graduate students and professionals who are interested in information theory or discrete stochastic processes and are intrigued by statistical modeling of natural language – to show them that there are a few novel formal ideas and open problems in this domain. However, since the use of mathematics in this book is motivated linguistically in quite an unusual way, as a secondary audience of the book, I foresee doctoral students and researchers in artificial intelligence, computational and quantitative linguistics, as well physics of complex systems. Those may find this book a helpful reference explaining mathematical foundations of information theory, discrete stochastic processes, and statistical modeling of natural language, of course. In the future, there may be also a need for a more popular book touching the topics of this monograph, but I am convinced that a rigorous exposition should be published in its own right first.

My father, who is a professor of engineering, once joked that there are no scientific monographs but textbooks for fewer and fewer students. I hope that this book will attract quite a few students, however. To increase the number of potential readers, I tried to avoid too abstract or redundant constructions while striving to be rigorous (in particular, no topology or linear algebra). I also resumed the main ideas of this book in the first chapter in a lighter form, so that an interested reader without a sufficient mathematical training could sense what the rest is about, whereas a mathematician could learn about the linguistic motivations. The remaining contents of the book are driven by a series of papers, mostly mine, which were preceded with an introductory material of a similar volume. This introductory material is due to many good people but likely has not been put together into a single convenient reference.

To be concrete, the book is divided into 11 chapters with the following contents:

Chapter

1

. Guiding ideas

: As stated in the previous paragraph, this chapter resumes the principal ideas of the book in a half‐formal way. I state the question what kind of a probability distribution should be assigned to particular human utterances. Sketching possible answers, I discuss quantitative linguistic laws and hypotheses such as Zipf's law and Hilberg's hypothesis and I introduce helpful mathematical concepts and results. In particular, I argue that a reasonable statistical language model should be either nonergodic or uncomputable.

Chapter

2

. Probabilistic preliminaries

: This chapter opens the formal lecture, beginning with an established introductory material. I define probability measures and some discrete stochastic processes, I introduce Markov and hidden Markov processes, and I recall the ergodic theorem for irreducible Markov processes.

Chapter

3

. Probabilistic toolbox

: Here real random variables are introduced and many important technical results of measure‐theoretic probability calculus are stated. These include properties of nonatomic measures, Lebesgue integrals, important inequalities, conditional probability, convergence of random variables, and completeness of probability spaces.

Chapter

4

. Ergodic properties

: In this chapter, I discuss the frequency interpretation of probability and demonstrate several celebrated theorems concerning ergodic properties of stationary processes such as the Birkhoff ergodic theorem and the ergodic decomposition. These results matter for language modeling since, as argued in Chapter

1

, a reasonable statistical language model should be either nonergodic or uncomputable.

Chapter

5

. Entropy and information

: Here I introduce the concepts of entropy, mutual information, entropy rate, and excess entropy. These are some functionals of random variables or stationary stochastic processes, which can help to capture the nondeterminism or infinite memory of the respective processes. I also generalize conditional entropy and mutual information to functionals of arbitrary fields, which allows to prove ergodic decomposition of entropy rate and excess entropy. This decomposition yields a simple insight into strongly nonergodic processes, which seem to arise naturally in statistical language modeling.

Chapter

6

. Equipartition and universality

: In this chapter, the Shannon–MacMillan–Breiman theorem is stated and an example of a universal distribution called prediction by partial matching (PPM) is exhibited. I show that the PPM distribution can be used to estimate not only the entropy rate but also to upper bound the amount of memory stored by a stationary stochastic process.

Chapter

7

. Coding and computation

: Here I introduce prefix‐free codes, the idea of Shannon–Fano coding, the notion of Kolmogorov complexity, and the concept of algorithmic randomness. The fundamental Kraft inequality satisfied by prefix‐free codes resurfaces in this chapter several times, in particular, to show the chain rule for Kolmogorov complexity. This chapter closes the cycle of introductory results.

Chapter

8

. Power laws for information

: This chapter opens a discussion of more original research, done mostly by the author of this book. The point of departure is the power‐law growth of mutual information hypothetically obeyed by texts in natural language. Within the chapter discussed are the concept of Hilberg exponents, strong nonergodicity, Santa Fe processes, and theorems about facts and words. The latter propositions seem to formalize some important intuitions about natural language but can be applied to general stationary processes.

Chapter

9

. Power laws for repetitions

: The point of departure of this chapter is the power‐law logarithmic growth of maximal repetition also hypothetically obeyed by texts in natural language. Trying to characterize this phenomenon mathematically, in this chapter I discuss Rényi entropies and entropy rates, recurrence times, the subword complexity, the longest match length, and the maximal repetition. The respective results are of a preliminary character.

Chapter

10

. AMS processes

: This chapter concerns asymptotically mean stationary (AMS) processes, a subject usually deemed quite technical. However, in the context of language modeling, AMS processes arise naturally when we switch between two views of a stochastic process: one in which the process consists of letters and another in which the process consists of words. Therefore, AMS processes are a convenient tool for the subsequent chapter where I construct some examples of processes.

Chapter

11

. Toy examples

: Here I discuss some particular examples of stochastic processes which can or cannot model some specific quantitative laws of natural language. Several important classes of processes are discussed, such as: finite and ultrafinite energy processes, aforementioned Santa Fe processes and their stationary encodings into a finite alphabet, as well as random hierarchical association (RHA) processes. Progressively, these processes can model more and more quantitative linguistic laws but none of them is fully satisfactory, which opens an avenue for future research.

The book is concluded by a list of open problems. I hope that these problems are salient enough to draw interest also of greater mathematical minds to problems of statistical language modeling. I suppose that many important developments are to be made if sufficient attention is devoted. The motivating intuitions lying at the heart of this book are not very complicated but I hope that they may be fertile when translated into the language of mathematics.

Warsaw, 2 January 2020

Łukasz Dębowski

Acknowledgments

This book recapitulates the path of stubborn research onto which I stepped around year 2000, when I quit physics in favor of computational linguistics and accidentally came across the work of Hilberg (1990), concerning a hypothetical power‐law growth of entropy for natural language. In order to understand it properly, I started learning information theory and related mathematical subjects. Although I mostly wrote research papers alone, it may be difficult to name all important people that may have influenced the general outlook of this book. Let me give a try.

First, I express my gratitude to my high‐school teachers Olga Stande, Krzysztof Przybyszewski, and Lilianna Szymańska, who stimulated my independent thinking about mathematics, physics, and language, respectively.

Subsequently, I thank my doctoral mentor Jan Mielniczuk and my former boss Jacek Koronacki, who provided me with a friendly and very patient environment, gave me lots of precious advice, and assisted me in discoveries leading toward the conception of this book, whose first critical readers they were later as well.

I am grateful to David Adger, Gabriel Altmann, Janusz Bień, Richard Bradley, Mark Braverman, Gregory Chaitin, Alexander Clark, James Crutchfield, Magdalena Derwojedowa, Jarosław Duda, David Feldman, Ramon Ferrer‐i‐Cancho, Richard Futrell, Richard Gill, Peter Grünwald, Michael Hahn, Jan Hajič, Peter Harremoës, Frederick Jelinek, Brunon Kamiński, Reinhard Köhler, Andrzej Komisarski, Alfonso Martinez, Krzysztof Oleszkiewicz, Rajmund Piotrowski, Adam Przepiórkowski, Arthur Ramer, Alexandr Rosen, Zygmunt Saloni, Cosma Shalizi, Marek Świdziński, Hayato Takahashi, Kumiko Tanaka‐Ishii, Flemming Topsøe, Paul Vitányi, Vladimir Vovk, Vladimir V'yugin, Mark Daniel Ward, and En‐Hui Yang for correspondence, conversations, or invitations, longer or shorter, which over years guided or refined my positions on topics discussed in this book.

More directly, I thank Christian Bentz, Antoni Hernández‐Fernández, Dariusz Kalociński, Ioannis Kontoyiannis, Rodrigo Lambert, Dmitrii Manin, Tomasz Steifer, and three anonymous referees, who read some parts of this book at different stages of their writing and offered me helpful comments regarding their composition.

I am grateful to my late uncle Tadeusz Krauze and my colleagues Marcin Woliński and Joanna Rączaszek‐Leonardi, with whom I exercised discussing the guiding ideas of this book over dinner, during a walk in the forest, or hiking in the mountains.

Finally, I am indebted to my close family for their love and care, without which this book would never come to daylight, either.

Basic Notations

In the main matter of this book, let us switch to the customary plural personal pronoun. Throughout the book, we use the following general notations:

is the set of natural numbers without zero,

is the set of integers,

is the set of rational numbers, and

is the set of real numbers.

A set is called countable if its elements can be mapped one‐to‐one to a subset of natural numbers. Sets

,

, and

are countable but

is not.

For a countable set

, called an alphabet,

is the set of sequences of length

and

, called the Kleene plus, is the set of nonempty finite sequences. Symbol

denotes the empty sequence, whereas

. Then,

, called the Kleene star, is the set of all finite sequences, also called strings,

is the set of one‐sided infinite sequences, and

is the set of two‐sided infinite sequences.

Relation

will be pronounced as

is greater than

and

is less than

. Relation

will be pronounced as

is strictly greater than

and

is strictly less than

. Relations

or

will be pronounced as

is positive

or

is negative

, respectively. Relations

or

will be pronounced as

is strictly positive

or

is strictly negative

, respectively. Similar conventions concerning the use of adverb

strictly

will concern notions such as

more

,

fewer

,

smaller

,

larger

,

growing/increasing

,

decreasing

,

convex

, and

concave

, which are formally defined using inequalities.

Relation

is the inclusion of sets, whereas

is the proper inclusion of sets (i.e.

and

). Similarly, we write

when string

is a substring of sequence

(i.e.

for some possibly empty strings

1Guiding Ideas

This book concerns mathematical foundations of statistical language modeling, i.e. the question what kind of a probability distribution should be assigned to particular utterances of human languages. In this chapter, we will describe the core ideas of this book in a way which is less formalized mathematically, but more motivated linguistically. Based on the intuitions sketched in this chapter, in the following chapters, we will build rigorous mathematical constructions. The general goal is to develop a theory of discrete stochastic processes so that it would be able to account for certain statistical phenomena exhibited by human texts. The considered statistical phenomena take form of several power laws. We hope that if we were to succeed in a better modeling of these power laws, then in the long run, we may also obtain probabilistic models of language which are better in terms of performance measures used by engineers in computational linguistics. In other words, we hope that our quest for stochastic processes may turn out to be fruitful not only for purely theoretical interest but also for practical applications in engineering. We hope that the considered problems are also interesting enough on the theoretical side, and they can draw interest of professional mathematicians.

1.1 The Motivating Question

The fundamental question that motivates this book is

What kind of a statistical model may explain generation of texts in natural language, such as books, our daily dialogues, or maybe even our internal stream of consciousness?

We perceive ourselves as free to say or think whatever we like, but, just to raise some doubts about our naive feeling of free will, are we more deterministic or more random in what we think? Either of these two extreme options seem scary. Surely, we are constrained by rules of grammar, meanings of words, habits of culture, daily chores, as well as search for meaning, good, beauty, truth, or happiness, which also impact our sense of freedom and the patterns of our deeds. But does it follow that the question of typical paths of our thoughts is worth investigation? Is there any mathematics there? Can we meaningfully quantify complexity of a song (Knuth, 1984) or a poem (Manin, 2019)? Is it a scientific question, does it bear practical importance, is it a cultural taboo of searching which one should not break, or all of that simultaneously?

There were many eminent intellectuals and scholars so far who asked the question about probability of texts. For example, Jonathan Swift, an Irish writer and clergyman, in his highly ironic novel popularly called Gulliver's Travels, in the chapter about the academy of sciences of Lagado, described a machine for automated text generation. The author of the machine assured Gulliver that

this invention had employed all his thoughts from his youth; that he had emptied the whole vocabulary into his frame, and made the strictest computation of the general proportion there is in books between the numbers of particles, nouns, and verbs, and other parts of speech.

(Swift, 1726)

We can then suppose that this machine while being deterministic, but making use of the frequencies of particular words should have implemented a certain statistical model of texts. Swift derided this machine as a highly unsuccessful project.

Despite many conceptual difficulties, the question about the probability of texts in natural language arises from time to time in the minds of scientists representing several domains: mathematics, linguistics, computer science, psychology, as well as physics of complex systems. In this book, we will build mostly on the expertise of mathematicians. Thus, let us recall that a few important mathematical concepts were in fact inspired by our question, such as Markov chains (Markov, 1913, translated in: Markov, 2006), entropy (Shannon, 1948, 1951), fractals (Mandelbrot, 1953, 1954), and algorithmic complexity (Kolmogorov, 1965). In contrast, among linguists, the question of probability in language caused a debate. Some linguists, such as Zipf (1935, 1949) and Herdan (1964), were highly interested in statistical properties of texts, whereas Chomsky (1956, 1957, 1959), while being more influential at that time, dismissed the question of probabilities in language as irrelevant for linguistics for the reason that human language cannot be in principle a finite‐state stochastic process, see also Chomsky and Miller (1959). A brief historical account of this debate was resumed by Pereira (2000). As Christian Bentz, my friend, has also put it:

It is an irony of 20th century linguistics that Shannon's theory of information, though explicitly linked to semantics, was deemed irrelevant by linguists, while Chomsky's formal syntax, though explicitly dissociated from semantics, was adopted as the default theory of natural language.

(Bentz, 2018, p. 62)

 

 

 

Nowadays, in spite of the debates and reservations among the older generation of linguists, there is a growing pressure coming from engineering, technology, and experimental science to seriously ponder upon the question about statistical models of texts. While modern technology provides us with large amounts of textual data available to relatively powerful computers, applications such as predictive keyboards and speech‐to‐text recognizers in our smartphones actually already rely on some imperfect statistical models of texts. Machine learning engineers working on these problems continuously release better and better models. Initially, natural language engineers used finite‐state models for this goal (Jelinek, 1997; Manning and Schütze, 1999), exactly those criticized by Chomsky. However, since the introduction of deep neural networks to the domain of natural language processing, the progression of statistical language model quality has been gaining a breathtaking momentum.

In particular, there was quite much publicity in the media concerning GPT‐2, a statistical model of English announced by the OpenAI team on their blog on 14 February 2019. GPT‐2 is nothing but a statistical language model encoded as a neural network with 1.5 billion parameters trained on 40 gigabytes of texts. Given this, it is capable of generating quite a coherent narration of the length of a few paragraphs, such as the following text, which we quote in full since it is spectacular:

SYSTEM PROMPT (HUMAN-WRITTEN):

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

MODEL COMPLETION:

The scientist named the population, after their distinctive horn, Ovid's Unicorn. These four‐horned, silver‐white unicorns were previously unknown to science.

Now, after almost two centuries, the mystery of what sparked this odd phenomenon is finally solved.

Dr. Jorge Pérez, an evolutionary biologist from the University of La Paz, and several companions, were exploring the Andes Mountains when they found a small valley, with no other animals or humans. Pérez noticed that the valley had what appeared to be a natural fountain, surrounded by two peaks of rock and silver snow.

Pérez and the others then ventured further into the valley. “By the time we reached the top of one peak, the water looked blue, with some crystals on top,” said Pérez.

Pérez and his friends were astonished to see the unicorn herd. These creatures could be seen from the air without having to move too much to see them – they were so close they could touch their horns.

While examining these bizarre creatures the scientists discovered that the creatures also spoke some fairly regular English. Pérez stated, “We can see, for example, that they have a common ‘language,’ something like a dialect or dialectic.”

Dr. Pérez believes that the unicorns may have originated in Argentina, where the animals were believed to be descendants of a lost race of people who lived there before the arrival of humans in those parts of South America.

While their origins are still unclear, some believe that perhaps the creatures were created when a human and a unicorn met each other in a time before human civilization. According to Pérez, “In South America, such incidents seem to be quite common.”

However, Pérez also pointed out that it is likely that the only way of knowing for sure if unicorns are indeed the descendants of a lost alien race is through DNA. “But they seem to be able to communicate in English quite well, which I believe is a sign of evolution, or at least a change in social organization,” said the scientist.

(Radford et al., 2019)

 

 

A simple numerical trick called calibration can improve the outputs of GPT‐2 even more (Braverman et al., 2019). Takahashi and Tanaka‐Ishii (2019) show that neural network language models can also mimic many statistical patterns of texts created by humans. But most importantly, as we can simply understand the machine‐generated text, GPT‐2 can write a text which is a complete fiction but sounds like a readable story. (Even the self‐contradictory “four‐horned unicorns” could be wishfully attributed to the machine's singular sense of humor.) For the reason that similar statistical models could possibly flood the Internet not only with nice literary fiction but also with dangerous fake news, the OpenAI team initially decided not to publish the full GPT‐2 model. This decision caused some ado in the media. A less publicized, related achievement in natural language processing was a publication by Springer of a book (Beta Writer, 2019) which is a fully machine‐generated summary of 150 scientific articles in chemistry concerning lithium‐ion batteries. This achievement concerns, however, summarizing of existing texts rather than generation of a novel text.

We agree that there is a far‐reaching ethical question concerning our ambitious dream of statistical language modeling, as embodied in a stream of consciousness. Creating an artificial intelligence, be it in form of an artificial stream of consciousness, may be dangerous, also because of the risk that this intelligence will be smarter than us and simply desires to get rid of us. Thus, we can perfectly understand fears of those who think that any progress in imitating human intelligence will inevitably contribute to constructing another intelligence which will replace our own. But given the current state of engineering, it is possible that we may create a dangerous artificial intelligence accidentally. In such a situation, paradoxically, it is advisable to put as large effort as possible in understanding the dynamics of intelligence in general, exactly not to create the dangerous intelligent entity by accident.

Consequently, our initial question about statistical language models is no longer so strange. It is one of the fundamental research topics in artificial intelligence. Moreover, besides purely engineering approaches, we need some orchestrated fundamental mathematical research to understand the largely black‐boxed progress of neural network language models, see Tishby and Zaslavsky (2015) and Shwartz‐Ziv and Tishby (2017). Hopefully, for some and disappointingly to others, in this book, we will not give the ultimate and potentially risky answer to the question about the statistical model for natural language, but we hope to provide some insights which may inspire further research. The modest goal of this book is to understand some aspects of ideal statistical language models as seen via quantitative laws exhibited by texts spontaneously created by humans. That is, we will look into some general statistical patterns of texts.

1.2 Further Questions About Texts

What are the general statistical properties of texts created by humans? Let us briefly call the texts or sequences that get reproduced in the process of human culture natural texts. We may observe that in all human languages, natural texts can be written down – at least phonetically – using symbols from a finite set. In information theory, such a set is called an alphabet. For a finite alphabet, the number of all combinatorially possible sequences grows exponentially with the considered length of the sequence.

The vision of this potential exponential growth must have haunted Jorge Luis Borges, an Argentinian writer and librarian. In the short story La Biblioteca de Babel (Borges, 1941) – see also Bloch (2008), he described a fictitious library, which contained every combinatorially possible text of a given length, in form of a book meticulously printed out and stored on some bookshelf. When a random book was selected from this collection, one could only read a meaningless sequence of random letters. Surely, real books and libraries do not look like this! We may suppose that there are some patterns in natural texts – maybe there are unboundedly many distinct sorts of patterns – and hence the number of texts that remain in cultural discourse is acutely limited.

Using word patterns, we must be careful, however. Obviously, some simple patterns such as regular cycles arise in simple deterministic sequences. But, also completely random sequences of symbols exhibit some statistical order, such as the law of large numbers. Thus, what kind of patterns are specific for natural texts? We suppose that natural texts are neither of two extremes – they are neither deterministic in a simple way nor completely random. In fact, George Kingsley Zipf already put this intuition into the following beautiful words:

If a Martian scientist sitting before his radio in Mars accidentally received from Earth the broadcast of an extensive speech which he recorded perfectly through the perfection of Martian apparatus and studied at his leisure, what criteria would he have to determine whether the reception represented the effect of animate process on Earth, or merely the latest thunderstorm on Earth? It seems that the only criteria would be the arrangement of occurrences of the elements, and the only clue to the animate origin would be this: the arrangement of the occurrences would be neither of rigidly fixed regularity such as frequently found in wave emissions of purely physical origin nor yet a completely random scattering of the same.

(Zipf, 1935, p. 187)

 

 

Rejecting simple determinism from the process of natural text generation means that we have to consider some model of randomness. Consequently, searching for models of natural texts, in this book, we will use the framework of probability calculus and information theory. This approach does lead to certain insights.

Whereas the question about the probability of a particular text or a sequence of symbols seems very difficult and dependent on their context, we may consider relaxed versions of the problem. If we abstract from the concrete numerical value of a text probability and we assume that all texts with a strictly positive probability are equally important, then our motivating question from Section 1.1 boils down to:

Which texts or sequences of symbols get reproduced in the process of human culture or in the stream of consciousness?

Let us observe that different sequences may get reproduced in the streams of consciousness of different persons. Moreover, if there is some randomness involved in the process of preselecting texts or sequences that get further reproduced in a particular stream of consciousness, the possibility which we cannot exclude a priori, then we may not be able to answer the above question constructively.

In view of this simple idea, we may find the task of finding a statistical model of language completely vane. But we think that we should not lose hope in this scientific project. There is a way of looking at the problem that may bring us closer to the solution. The idea is to introduce numbers. So let us ask rather:

How many different texts or sequences of symbols of a given length get reproduced in the process of human culture or in the stream of consciousness?

We can see that answering how many sequences get reproduced does not automatically answer which sequences these are. In particular, even if the numbers of distinct sequences are the same for different streams of consciousness, there may be different sequences in different streams of consciousness. But building a partial quantitative model may shed some light onto principles of human communication and inspire further progress. At some point, we may be able to answer also the question:

Why only a certain well‐defined number of texts or sequences of symbols of a given length gets reproduced in the process of human culture or in the stream of consciousness?

This question concerns both the dynamics and the purpose of thinking and culture. If we were able to answer this question, then, given enough computational resources, we might be able to simulate a random process of culture or a stream of consciousness in silico, or in other words create an artificial intelligence.

Asking about the dynamics and purpose of thinking and culture is a difficult and potentially controversial question. Personally, we suppose that there may be some purpose of culture. The argument goes like this. We suppose that there is some objective truth that is possible to discover. We suppose that this truth cannot be expressed in finitely many words but can be learned partially and incrementally, given a certain effort. Moreover, we suppose that under certain physical circumstances, it is profitable to accumulate the partial knowledge of the truth. Hence, according to our views, the ultimate purpose of culture is a never ending accumulation of the partial knowledge of the truth. In the beginning of culture development, this purpose is rather a derivative of special physical circumstances which make knowledge accumulation profitable in some sense. This, however, does not exclude the possibility that culture, understood as a physical system of knowledge accumulation, gains its autonomy and begins to create conditions under which knowledge accumulation remains profitable. Then the ultimate purpose of culture becomes mere accumulation of knowledge.

It is a good question whether the system of knowledge accumulation can actually assume total control over its physical environment to ensure the unbounded growth. Unbounded growth of anything besides thermodynamical entropy seems in general physically impossible, but we refrain from giving a definite answer. Maybe knowledge is a peculiar sort of thermodynamical entropy (Lem, 1974; Feistel and Ebeling, 2011).

There is a large gap between the informal considerations of the above paragraph and the actual contents of this book. We suppose, however, that this gap is not infinite and may be filled some day with mathematical models. Therefore, it may be good to look at this gap from time to time to measure our progress. In the following, we will investigate some particular statistical patterns of natural texts that may inform us of more general theoretical properties of human language.

1.3 Zipf's and Herdan's Laws

Speaking of concrete statistical patterns of language, we should begin with Zipf's law. Zipf's law, discovered by Estoup (1916) and later popularized by Zipf (1935) and Mandelbrot (1953, 1954), is undoubtedly the most famous of all quantitative patterns exhibited by natural texts known so far – for an overview of many other patterns see Köhler et al. (2005). Also more recent books on quantitative linguistics by Esposti et al. (2016) and Hernández and Ferrer i Cancho (2019) begin with a discussion of Zipf's law. Thus, commencing quantitative considerations about natural language with Zipf's law is quite expected and natural.

To explain what Zipf's law is, we have to introduce the concepts of the word frequency and the word rank. The word frequency is simply the number of occurrences of a given word in a given text. Now let us sort the words according to their frequencies and assign rank to the most frequent word , rank to the second most frequent word , and in general rank to the th most frequent word . In Table 1.1, we present an example of a rank list, i.e. the list of ranks , frequencies , and words sorted according to the frequencies. By definition, the word frequency is a decreasing function of the word rank . Zipf's law, called also the Zipf–Mandelbrot law, says something more, namely, it says that the word frequency is approximately inversely proportional to a power of the word rank,

(1.1)

where the exponent is . In particular, if we plot the frequency versus the rank in the doubly logarithmic scale, we obtain almost a straight line with the slope , see Figure 1.1. Miracle?

Table 1.1 The beginning of the rank list for Shakespeare's First Folio/35 Plays.

Source: Text downloaded from Project Gutenberg (https://www.gutenberg.org).

 1

21 557

I

 2

19 059

and

 3

16 571

to

 4

14 921

of

 5

14 491

a

 6

12 077

my

 7

10 463

you

 8

9789

in

 9

8754

is

10

7428

that

Figure 1.1 Zipf's law.

Zipf's law for Shakespeare's First Folio/35 Plays and a random permutation of the text's characters. We note that we use the doubly logarithmic scale here and later. Source: Author's figure based on the text downloaded from Project Gutenberg (https://www.gutenberg.org).

Unfortunately, Zipf's law, as stated above, is not a pattern that is unique to natural texts. Let us take a unigram text, that is a random permutation of characters of a given natural text – to state it clearly, a space between the words is also considered a character. (This process is metaphorically called “monkey typing” but, of course, real monkeys do not type in this way.) Subsequently, let us make a list of “words” defined as space‐to‐space sequences of letters in the random permutation of characters, compute ranks and frequencies of those “words,” and plot the obtained numbers in the doubly logarithmic scale. Then we obtain a very similar plot as for the text in natural language, see Figure 1.1! Hence, some simple kinds of random texts, such as unigram texts, also exhibit Zipf's law. This observation has been made by Mandelbrot (1954) and Miller (1957) and rigorously explained by Conrad and Mitzenmacher (2004).

Before we start producing alternative explanations for Zipf's law, let us make a general methodological remark. There can be at least four different sorts of reasons for quantitative laws of natural texts:

1.

General laws of randomness

: We have just seen an example thereof in the instance of the “monkey typing” explanation of Zipf's law.

2.

Physical constraints of the human vocal tract

: Speech is the primary mode of human language communication and the distribution of breath length may impose certain constraints on the structure of syllables and the lengths of words. For recent relevant research, see Torre et al. (

2019

) and Coupé et al. (

2019

).

3.

Physical constraints of the human brain

: Language is processed by a three‐dimensional network of neurons, whose configuration may imply some unrecognized bounds on the time and space complexity of language processing. For potentially relevant research, see Burger et al. (

2019

), West (

2017

), and Zador (

2019

).

4.

Abstract semantic constraints

: Texts created by humans are meaningful, i.e. they describe, associate, and control entities which are not the texts themselves. Meaningfulness may harness the texts to exhibit certain patterns or be attracted by some particular asymptotic structure. This sort of explanations will be pursued in this book. We will argue that they are within reach of mathematics.

For different empirical laws of natural language, different sorts of explanations can be suitable. It is possible that a given law can be given multiple explanations of varying degrees of plausibility. Some laws can be true for a bunch of roughly equally plausible reasons – and if these reasons are of a similar complexity, even Occam's razor would not help us to single out the best one. However, finding several equally plausible reasons can also enrich our understanding of the explained phenomenon – like finding multiple proofs of a single mathematical theorem.

Indeed, not only unigram texts provide a basis for formal explanations of Zipf's law. Some other identified mathematical or computational mechanisms leading to Zipf's law involve: multiplicative processes, originally discovered by Simon (1955) and, nowadays, called preferential attachment; large number of rare events (LNRE) distributions by Khmaladze (1988), later researched by Baayen (2001); coding games with entropy loss discovered by Harremoës and Topsøe (2001); a formalization of Pareto's principle of least effort by Ferrer‐i‐Cancho and Solé (2003), curiously resembling the bottleneck method in machine learning introduced by Tishby et al. (1999). Most of these models, however, either ignore the statistical dependence between occurrences of particular words or assume that there is no such dependence.

In contrast, in this book, we will seek for explanations of word distributions which combine general laws of randomness with abstract global semantic constraints. We will assume that texts in natural language attempt to describe an infinitely complex reality, be it fictitious or not, whose state can be partitioned into an infinite number of independent elementary facts, see Section 1.11. In Section 1.12, we will see that a power law for the distribution of word‐like strings arises naturally in this case, as predicted by so‐called theorems about facts and words. These theorems provide an explication of Herdan's law, a corollary of Zipf's law to be discussed in the next paragraph.

Herdan's law also has many fathers – it was independently rediscovered by Kuraszkiewicz and Łukaszewicz (1951), Guiraud (1954), Herdan (1964), and Heaps (1978). Quite often Herdan's law is also called Heaps' law. The law describes a relationship between the number of word types , i.e. the number of different words in some initial part of a text, and the number of word tokens , i.e. the number of all words in the same part. Herdan's law states that the number of word types is proportional to a power of the number of word tokens ,

(1.2)

where . In particular, if we plot the number of word types versus the number of word tokens in the doubly logarithmic scale, we obtain something similar to a straight line with slope , see Figure 1.2.

Herdan's law.

Figure 1.2 Herdan's law for Shakespeare's First Folio/35 Plays and a random permutation of characters. Source: Author's figure based on the text downloaded from Project Gutenberg (https://www.gutenberg.org).

smaller than for Zipf's law (1.1). In particular, the number of word types is smaller in natural texts than in unigram texts.

In spite of many existing mathematical explanations of Zipf's and Herdan's laws, we should be also aware that the actual reasons for these laws for natural texts can be less obvious than in any idealized mathematical model. For example, words in natural texts differ statistically to “words” in unigram texts. If we compute the histograms of word length, see Figure 1.3, we can see that words in natural texts are quite differently distributed than “words” in unigram texts. In particular, for words in natural texts, we obtain a power‐law tail, whereas for “words” in unigram texts, we obtain an exponential tail. Hence, we may conclude that Zipf's law for words in natural texts is qualitatively a different phenomenon than Zipf's law for “words” in unigram texts. In particular, Zipf's law for natural language may require a distinct explanation. Another important phenomenon, which cannot be ignored, is a gradual change of Zipf's law exponent from for ranks roughly smaller than to for ranks roughly larger than , as discovered by Ferrer‐i‐Cancho and Solé (2001) and later confirmed by Montemurro and Zanette (2002). Thus, we may still think that Zipf's law combined with the special distribution of word length can be really a pattern that distinguishes natural texts language from “inanimate processes.”

Histograms of word length.

Figure 1.3 Histograms of word length for Shakespeare's First Folio/35 Plays and a random permutation of characters. Source: Author's figure based on the text downloaded from Project Gutenberg (https://www.gutenberg.org).

letters (or phonemes) on the fundamental level, then we may obtain a simpler statistical description of language.

This question is not obvious at all! In particular, both sentences and words can be some phenomena emerging from some simpler statistical rules acting on the most fundamental level of language description, the level of letters or phonemes. Whereas linguists prefer to discuss units with a meaning, such as morphemes, words, or sentences, the number of these units can be in principle countably infinite or at least very large, so it is unlikely that we obtain a finite or at least a reasonably compact statistical model of language if we treat these high‐level units as primitive units. In contrast, for random texts consisting of finitely many distinct symbols, such as letters or phonemes, we can use a bunch of nice theoretical results from probability and information theory. So we are given a certain advantage in the beginning.

1.4 Markov and Finite‐State Processes

The atomic view of natural texts as being sequences of relatively few distinct meaningless symbols is well known to linguists but too trivial for them to consider it a sufficient description of language. Linguists usually express justified interest in where the meaning of text comes from, and for various reasons, prefer to work with meaningful units, such as morphemes, words, or sentences. However, following the quotation from Zipf (1935) in Section 1.1, we suppose that meaning cannot be captured by the marginal distribution of words or other text units, but it emerges from their linear interaction – the order of symbols, which is neither purely random nor purely deterministic. The question arises whether we can describe a statistical order of meaningless symbols that necessarily leads to emergence of somehow understood meaningfulness for sufficiently long texts, i.e. strings of these symbols.

In Section 1.3, we have introduced the concept of a unigram text, i.e. a random text obtained from a given text by permuting its characters at random. In fact, it is not the only possible way of introducing randomness into text generation. There are infinitely many mathematically admissible schemes – so many different ones that it is nontrivial to single out those realistic. To begin, in this section, we will consider a sequence of Markov processes, which imitate a given text by allowing for more and more statistical dependence among the consecutive letters. By the end of this section, we will also consider a generalization of Markov processes, called hidden Markov processes. There exist also radically different processes, which are neither Markov nor hidden Markov, to be discussed in further sections.

Suppose such that

(1.3)
(1.4)
(1.5)

Additionally, a probability distribution is called stationary if

(1.6)

There are infinitely many possible probability distributions, also stationary ones. The essence of statistical language modeling lies in picking up the probability distribution that resembles natural language the most.

The simplest probability distribution on strings is obtained when we assume that all symbols in the alphabet are sampled independently with equal probabilities,

(1.7)

where

(1.8)

Probability model (1.7) will be called the uniform distribution. It is easy to check that the uniform distribution is stationary.

In general, probabilities need not satisfy conditions (1.7) and (1.8). Suppose that we have some particular text that we want to imitate, which we will denote as . The stationary frequency of a substring in text is

(1.9)

where we will assume a periodic condition for and . Subsequently, we will define the empirical distribution

(1.10)

It is easy to check that the empirical distribution is also stationary.

Having the empirical distribution, we can also define a ladder of Markov processes, as envisaged by Markov (1913) and Shannon (1948). The first step is the Markov process of order 0, which reads

(1.11)

This model has the advantage that the probabilities of individual symbols are equal to the relative frequencies of individual symbols in the imitated text .

Continuing the sequence of the Markov processes, we have the Markov process of order 1,

(1.12)