Introduction to Corpus Linguistics - Sandrine Zufferey - E-Book

Introduction to Corpus Linguistics E-Book

Sandrine Zufferey

0,0
139,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Over the past decades, the use of quantitative methods has become almost generalized in all domains of linguistics. However, using these methods requires a thorough understanding of the principles underlying them. Introduction to quantitative methods in linguistics aims at providing students with an up-to-date and accessible guide to both corpus linguistics and experimental linguistics. The objectives are to help students developing critical thinking about the way these methods are used in the literature and helping them to devise their own research projects using quantitative data analysis.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 478

Veröffentlichungsjahr: 2020

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Title Page

Copyright Page

Preface

1 How to Define Corpus Linguistics

1.1. Defining elements

1.2. Empiricism versus rationalism in linguistics

1.3. Chomsky’s arguments against empiricism in linguistics

1.4. Corpus linguistics and computer tools

1.5. Quantitative versus qualitative methods

1.6. Differences between corpus linguistics and experimental linguistics

1.7. Different types of corpora

1.8. Conclusion

1.9. Revision questions and answer key

1.10. Further reading

2 How to Use Corpora in Theoretical Linguistics

2.1. Phonetics and phonology

2.2. Morphology

2.3. Syntax

2.4. Lexicon

2.5. Discourse analysis

2.6. Pragmatics

2.7. Sociolinguistics

2.8. Diachronic linguistics

2.9. Conclusion

2.10. Revision questions and answer key

2.11. Further reading

3 How to Use Corpora in Applied Linguistics

3.1. Language acquisition

3.2. Language impairments

3.3. Second language acquisition

3.4. Language teaching

3.5. Lexicography

3.6. Stylistics

3.7. Legal linguistics

3.8. Conclusion

3.9. Revision questions and answer key

3.10. Further reading

4 How to Use Multilingual Corpora

4.1. Comparable corpora and parallel corpora

4.2. Looking for a

tertium comparationis

4.3. Translations as a discursive genre

4.4. Multilingual corpora and contrastive linguistics

4.5. Parallel corpora and translation studies

4.6. Parallel corpora and bilingual dictionaries

4.7. Conclusion

4.8. Revision questions and answer key

4.9. Further reading

5 How to Find and Analyze Corpora in French

5.1. Corpora formats and their availability

5.2. Reference corpora

5.3. Written French corpora

5.4. Spoken French corpora

5.5. Children and learner corpora

5.6. Multilingual corpora including French

5.7. Corpus consultation tools

5.8. Conclusion

5.9. Revision questions and answer key

5.10. Further reading

6 How to Build a Corpus

6.1. Before deciding to build a corpus

6.2. Establishing the size and representativeness of data

6.3. Choosing language samples

6.4. Preparing and coding corpus files

6.5. Recording and transcribing spoken data

6.6. Ethical and legal issues

6.7. Conclusion

6.8. Revision questions and answer key

6.9. Further reading

7 How to Annotate a Corpus

7.1. Corpus annotations

7.2. Different types of annotations

7.3. Standardization of annotation schemes

7.4. The stages of the annotation process

7.5. Annotation tools

7.6. Measuring the quality and reliability of an annotation

7.7. Sharing your annotations

7.8. Conclusion

7.9. Revision questions and answer key

7.10. Further reading

8 How to Analyze Corpus Data

8.1. Descriptive statistics for corpus data

8.2. Measuring the lexical richness of a corpus

8.3. Measuring lexical dispersion in a corpus

8.4. Basics of inferential statistics

8.5. Typical variables in corpus studies

8.6. Measuring the differences between categories

8.7. Conclusion

8.8. Revision questions and answer key

8.9. Further reading

Conclusion The Stages for Carrying Out a Corpus Study

C.1. Stage 0: wanting to know more

C.2. Stage 1: identify relevant literature

C.3. Stage 2: formulating research hypotheses

C.4. Stage 3: operationalizing your hypotheses and choosing data

C.5. Stage 4: extracting and annotating corpus data

C.6. Stage 5: analyzing data

C.7. Stage 6: presenting your study in a report or an article

C.8. Conclusion

References

Index

Other titles from ISTE in Cognitive Science and Knowledge Management

End User License Agreement

List of Tables

Chapter 6

Table 6.1. List of the 15 most frequent words in the Sciences Humaines corpus

Table 6.2. Example of a table summarizing corpora metadata

Chapter 7

Table 7.1. Cross-tabulation of the results of a double annotation

Chapter 8

Table 8.1. Total number of passive sentences per text

Table 8.2. Relative frequency of causal connectives every 10,000 words in the Sci...

Table 8.3. Translations of “toutefois” and “néanmoins” into English in a journali...

Table 8.4. Data used for calculating the difference in proportions as a dispersio...

Table 8.5. Examples of relevant variables in corpus linguistics and their types

Table 8.6. Frequency of the word “huitante” every 100,000 words per group of cant...

Table 8.7. Relative frequency of “huitante” every 100,000 words for the three can...

Table 8.8. Number of occurrences of the two words used for expressing 80, by grou...

Table 8.9. Occurrences of regional words by groups of cantons expressed in percen...

Table 8.10. Number of regional word occurrences per canton

Table 8.11. Observed (and expected) frequencies of the two words used for denotin...

Table 8.12. Standardized residuals for the χ 2 test corresponding to Table 8.10, w...

List of Illustrations

Chapter 4

Figure 4.1. Comparable and parallel corpora that can be retrieved from a bi-...

Figure 4.2. Tertium comparationis for past tenses in English and German

Chapter 5

Figure 5.1. Search results for the word “avis ” with AntConc. For a col...

Figure 5.2. Occurrence sorting of the word “avis ” according to its nei...

Figure 5.3. Occurrence sorting of the word “avis ” according to its nei...

Figure 5.4. Frequency of the words clé and clef from 1800 to 2000 in the Goo...

Figure 5.5. Frequency of the words Saussure and Chomsky from 1900 to 2000 in...

Figure 5.6. Frequency of the words Saussure and Chomsky from 1900 to 2000 in...

Figure 5.7. Frequency of the word orange as a noun and as an adjective from ...

Chapter 6

Figure 6.1. Example of a CLAPI corpus transcription. For a color version of ...

Chapter 7

Figure 7.1. Syntactic representation of a sentence in the form of a tree

Figure 7.2. Stages of the annotation process. For a color version of this fi...

Chapter 8

Figure 8.1. Value dispersion around the mean

Figure 8.2. Chi-square test result, as displayed in VassarStats

Guide

Cover

Table of Contents

Begin Reading

Pages

iii

iv

ix

x

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

256

257

258

259

Introduction to Corpus Linguistics

 

 

 

 

Sandrine Zufferey

 

 

 

 

 

 

 

 

First published 2020 in Great Britain and the United States by ISTE Ltd and John Wiley & Sons, Inc.

Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address:

ISTE Ltd27-37 St George’s RoadLondon SW19 4EUUKwww.iste.co.uk

John Wiley & Sons, Inc.111 River StreetHoboken, NJ 07030USAwww.wiley.com

© ISTE Ltd 2020

The rights of Sandrine Zufferey to be identified as the author of this work have been asserted by her in accordance with the Copyright, Designs and Patents Act 1988.

Library of Congress Control Number: 2020938264

British Library Cataloguing-in-Publication DataA CIP record for this book is available from the British LibraryISBN 978-1-78630-417-9

Preface

Since the 1990s, linguistics has progressively experienced a fundamental methodological turning point. Following the works of American linguist Noam Chomsky, it changed from the essentially rationalist discipline it had been since the middle of the 20th Century, and gradually (re)opened up the empirical approaches represented by corpus linguistics and experimental linguistics. Over the past decade, this transition has accelerated even more, in such a way that the majority of linguistic works published in international journals currently make use of empirical data. Thus, linguistic corpora have gradually established themselves as fundamental tools for linguists, and their use has spread to other fields in linguistics, including those traditionally favoring a rationalist approach, such as syntax. The development of corpus linguistics has led to the creation of new methods for collecting and analyzing linguistic data, which were made possible thanks to the development of computers and the arrival of the Internet. This new direction in linguistics has encouraged spectacular advances for dealing with the multiple facets of human language in all its complexity from a scientific perspective. Our book intends to introduce such a wealth to readers who are not particularly used to reading linguistics-oriented literature.

In our times, the ability to quantitatively analyze corpus data has become an integral part of the linguist’s toolbox. Nevertheless, the use of such data is based on precise theoretical and methodological principles, which require a thorough understanding. This turning point in linguistics implies the need to introduce the new generations of students to the use of these methods which will help them understand the issues underlying their use in scientific literature, to critically assess the results obtained, and to use them in the context of their academic work. Our book is intended as an educational support for students and, in general, for all those wishing to learn the use of corpora in linguistics.

The material introduced in this book does not presuppose prior skills other than basic linguistic knowledge, as well as a minimum command of the most common computer tools, such as spreadsheet software. This book has been designed as study material for teaching corpus linguistics at university initiatory phases, as well as a tool for students wishing to be trained in the use of corpora. Students will be able to work independently thanks the revision questions presented at the end of each chapter, and the detailed answers provided.

As it is an introductory work, this book is necessarily partial and does not deal with all the questions raised by the use of corpora in different linguistic disciplines. It does not cover certain advanced analysis methods which require a high level of computer and statistical skills for data analysis. However, further readings are suggested at the end of each chapter that will enable those who wish to deepen one or other of the aspects presented to go a step beyond.

Finally, this book places a special emphasis on French as an object of study. While it is true that corpus linguistics has imposed itself in an incontestable manner in the English-speaking world and that a significant proportion of French-speaking researchers currently use these methods, the teaching of corpus linguistics still remains marginalized in France. Therefore, this book also aims to highlight the vitality and richness of corpus studies devoted to French, as well as identify the most important resources which have been developed for this language, in the hope of making a contribution to the rise of this discipline for the study of French.

Sandrine ZUFFEREY

June 2020

1How to Define Corpus Linguistics

This chapter aims to offer the main defining elements of corpus linguistics in order to understand what this field includes. It also aims to lay the theoretical and methodological bases on which the discipline is based. In particular, we will introduce the difference between empirical and rationalist methodologies in linguistics, the important role of computer science for corpus linguistics, the difference between quantitative and qualitative studies, as well as the differences between corpus linguistics and experimental linguistics. In conclusion, we will briefly review the different types of corpora. In the upcoming chapters, this introduction will help us to tackle the research questions that can be answered by means of a corpus study.

1.1. Defining elements

The term corpus has a Latin origin and means “body”. A text corpus literally embodies a set of texts, a collection of a certain number of texts for study. For example, it is possible to collect a series of newspaper articles and make a corpus of them in order to study the specificities of the journalistic genre. In the field of language teaching, it is also possible to collect texts written by students having different levels, and to build a corpus of these writings in order to study the typical errors that students produce at different learning stages. A methodology using data from the outside world rather than using one’s own knowledge of the language is called an empirical methodology. Corpus linguistics can be defined as an empirical discipline par excellence, since it aims to draw conclusions based on the analysis of external data, rather than on the linguistic knowledge pertaining to researchers.

Working with corpus linguistics therefore implies being in contact with linguistic data in the form of texts, and also in the form of recordings, videos or any other sample containing language. Most of the time, these samples are collected in a computerized format, which makes it possible to study them more effectively than if they were on paper. Let us imagine, for example, we wish to know how many times and in what passages Flaubert evokes the feeling of love in his novel Madame Bovary. If we have a paper version of that book, finding these passages will be a long and tedious task, which will require going through the entire text. However, having a computerized version would make the task much easier. We simply need to look up for the terms love, in love or the verb to love in its different forms with the search function of the word processor so as to locate the appearances and easily count them. For most of the questions addressed by corpus linguistics, it would be impossible to search through a paper database, and that is why having computerized corpora becomes essential.

The problem of manual tracking and counting of occurrences is all the more acute since corpus linguistics is often based on large amounts of data which have not been drawn from a single book, in view of observing the multiple occurrences of a certain linguistic phenomenon and thus apprehending its specificities. For example, let us suppose that we wish to know whether Flaubert talks about love in his work. In this case, focusing solely on Madame Bovary would induce a bias, because this novel is not representative of the whole of his work. So, in order to be able to answer this question, it is necessary to go through the entirety of his novels, making the task even more complex to perform manually. Let us now imagine that this time we want to know whether the French authors of the 19th Century all deal with the question of love as much as Flaubert does. In this case, it would be impossible for us to look up the occurrence of terms related to love in all of the novels written by French authors in the 19th Century. In order to avoid this problem, it would be necessary to collect a sample of texts, representative of the works of this period. We will discuss this topic in Chapter 6, which is devoted to the methodological principles underlying the construction of a corpus. For the moment, the important point to bear in mind is that corpus linguistics often resorts to a quantitative methodology (see section 1.5) so as to be able to generalize the conclusions observed on the basis of a linguistic sample to the whole of the language, or belonging to a particular language register.

As we will see in the following chapters, corpus linguistics may be of use in all areas of linguistics, for instance in fundamental (see Chapter 2) or applied (see Chapter 3) linguistics. For example, it is crucial in lexicography, since it makes it possible to make an exhaustive inventory of a language’s lexicon. It also makes it easy to find examples of uses in different types of sources (literary, journalistic and others), while bringing to light the expressions in which a word is frequently used. In other words, it makes it possible to establish very useful phraseology elements for dictionaries. For example, it is useful to know what the word “knowledge” means, but it is just as important to know that this word is frequently used in phrases such as “acquire knowledge” or “having good knowledge of”, etc. Corpus linguistics is a particularly effective method for establishing the frequent contexts in which a word or an expression is used. But corpus linguistics is also used for conducting research in fundamental areas of linguistics such as the study of syntax, since it makes it possible to identify the types of syntactic structures used in different languages. For example, by making a corpus study, it is possible to determine in which textual genres the passive voice is most commonly used. Finally, thanks to the existence of a corpus of oral data, corpus linguistics also makes it possible to answer questions related to phonology and sociolinguistics. For instance, it makes it possible to establish the area of geographical distribution of certain pronunciation traits, such as differentiating the short /a/ form in the French word “patte” (paw), from the long /ɑ/ form in the word “pâte” (pastry). Answering these different questions requires the use of different types of corpora, as well as having available data regarding their contents. For example, in order to determine the geographical area of diffusion of a certain pronunciation trait, it is necessary to know where each speaker having contributed to the corpus came from. This type of information is called corpus metadata. We will review the main types of existing corpora at the end of this chapter, and discuss the issue of metadata in Chapter 6.

To sum up, in this section, we have defined corpus linguistics as an empirical discipline, which observes and analyzes quantitative language samples gathered in a computerized format. In the following sections, we will discuss in depth the different central points of the definition, indicated in bold, in order to better understand the theoretical and methodological anchoring of corpus linguistics.

1.2. Empiricism versus rationalism in linguistics

Corpus linguistics is an empirical discipline, which means that it uses data produced by speakers in order to study language. This methodology is opposed to the rationalist method, which functions by looking for answers by relying on one’s own linguistic knowledge, rather than looking for it in external data. Let us take an example. In order to determine whether the phrase “When do you think he will prepare which cake?” is grammatically correct or not, the use of empirical methodology would go through large corpora to find whether this syntactic structure is used by English speakers or not.

If sentences following such a syntactic structure never or almost never appear in the corpus, linguists might conclude that this sentence is only rarely used in English. Rationalist methodology, on the contrary, might respond to the same issue by relying on the intuitions of linguists. In this particular case, they might wonder whether they could produce such a sentence or not, whether it seems correct or incorrect depending on their knowledge of the language and might infer a grammaticality judgment from it. Grammaticality judgments are often classified into three types: correct, incorrect or marked, in the event that a sentence may seem possible, but sounds unnatural.

This example illustrates a fundamental difference between empirical and rationalist methodology. While the rationalist methodology leads to the formulation of categorical judgments, the empirical methodology provides a more refined answer to this question, since the observation of corpus data offers a precise indication of frequency, rather than a result in terms of absence or presence. This is one of the reasons why many linguists currently consider that the empirical methodology better matches a scientific approach (in the sense of confrontation against the facts) than a purely rationalist method for studying language.

Nonetheless, the choice between the use of empirical or rationalist methods is not limited to the field of linguistics. Certain scientific branches such as physics, chemistry, as well as sociology and history are essentially empirical disciplines. In fact, both physicists and historians base their insights on external data, which they collect in the world, in order to build a theory, test it and draw conclusions from it. On the other hand, other disciplines such as mathematics or philosophy are traditionally based on a rationalist approach, since mathematicians and philosophers use their own reasoning to build theories and to draw conclusions, rather than from the collection and observation of external data. Philosophers often resort to thought experiments, but these are not experiments in the empirical sense of the term, because they are based on the reflective abilities of researchers.

1.3. Chomsky’s arguments against empiricism in linguistics

Although corpus linguistics has experienced a strong growth over the past 20 years, the empirical grounding of linguistics is not new. Linguists have long used observational data. In the 19th Century, for example, linguists used to work on the comparison of Indo-European languages in an attempt to reconstruct their common origin. Research was based on existing data about the languages spoken in Europe such as German, French and English. Similarly, in the first half of the 20th Century in the United States, the so-called distributionist approach to syntax focused on the study of sentence formation in syntactic structures as they appeared in text corpora, and from there, tried to infer language’s general functioning. Around the late 1950s, the use of corpora in linguistics was almost completely interrupted in certain fields such as syntax, following the works of the American linguist Noam Chomsky. In fact, Chomsky defended a strictly rationalist methodological approach to linguistics, and fiercely opposed any use of external data. The objections made by Chomsky against the use of external data in linguistics have been numerous. We will briefly review them, to show in what ways most of them have lost their raison d’être in the context of current research.

Chomsky’s first objection to the use of corpora, which is also the most fundamental one, is that corpora contain language samples produced by speakers. According to him, linguistics should not focus on the linguistic performance of speakers, but on the competence they have in their mother tongue, something he calls their internal language. Now, here is the problem. When people speak, what they produce (their performance) does not necessarily reflect what they know about their language (their competence). For example, under the effect of stress or fatigue, speakers sometimes produce verbal slip-ups or make language mistakes. From time to time, almost everybody happens to badly conjugate an irregular verb and mistakenly produce the form “he eated” instead of “he ate”. However, if the person who produced this wrong form were recorded, and then asked whether he or she thought he or she had spoken correctly or not, we can almost be sure that he or she would realize his or her mistake and would be able to state the correct form, “he ate”. Conversely, a speaker could pronounce a word like “serendipity” after having heard it from somebody else’s lips, but without really knowing its meaning. These examples illustrate the fact that the words speakers “utter” are not always a true reflection of their linguistic competence. In this way, according to Chomsky, the fact of studying corpora places linguists on the wrong track, because they lead them to consider language from the point of view of “production”, which merely represents a biased reflection of the rules of language.

According to Chomsky, another problem related to corpus linguistics stems from the fact that corpora are not representative of the language as a whole. He illustrates this problem in an extreme way, by picking the case of an aphasic speaker recorded in a corpus. Linguists analyzing this corpus would draw totally incorrect conclusions about the language in question, since this person does not represent the linguistic competence of a typical speaker. Furthermore, even if we were not to include an atypical speaker, a corpus could never represent more than a tiny language sample when compared to all the oral and written productions in any language. It is for this very same reason that it is impossible to conclude that a word simply does not exist in a language just because it is absent from a corpus. It could simply never have been pronounced in such particular context, while it could exist in other language registers or have been mentioned by other speakers not included in the corpus. This problem is particularly acute in the case of rare linguistic phenomena, such as infrequent words or little used linguistic structures.

This limitation has led to Chomsky’s third criticism of corpora, namely the fact that a corpus can never contain the whole of a language and that, therefore, the above-mentioned biases are not solvable. According to him, this problem is all the more serious because even if a corpus were of a very large size and included a representative portion of the language, it would not be fully analyzable by linguists, given the fact that it is impossible to manually analyze the content of billions of sentences.

Chomsky’s last two objections have largely become obsolete due to the advances made in computer science. In fact, the size of corpora has increased exponentially over the past 20 years, and corpus analysis tools have also made considerable progress. It has thus become possible to analyze very large amounts of data, which represent a much more accurate mirror of the language than when Chomsky formulated his objections. We will return to this in section 1.4, devoted to the connections between computer science and corpus linguistics. In addition to these technological advances, theoretical and methodological advances have also largely made it possible to eliminate or control the other types of biases mentioned by Chomsky. For example, good practice for building a corpus is to accurately document the type of language it contains. This helps to avoid analyzing the language of a single aphasic subject by mistake, for example, as Chomsky might suggest. It is nonetheless true that a corpus can only show that which it contains, and therefore the absence of evidence that a word or a structure exists in a corpus cannot constitute definitive proof of their absence from the language. Thus, for certain research questions relating to rare or hardly observable phenomena in a corpus, it might be advisable to complement research with another empirical method, namely with the experimental method. As we will see later in this chapter, this method shares the use of a quantitative methodology with corpus linguistics.

In conclusion, we should point out that the rationalist method suggested by Chomsky is also accompanied by biases and limitations which are not negligible and can be corrected by the use of empirical methods. In particular, this method leaves a large space for the subjectivity of linguists while it overestimates the linguistic skills of speakers. Indeed, the use of grammaticality judgments presupposes that all speakers have a definite and consistent intuition regarding all the sentences in their mother tongue. However, such is not the case. If all English speakers agree that a sentence like “Mary dog her walks” is incorrect in English, whereas the sentence “Mary walks her dog” is correct, judgment will not be so unanimous in the case of complex sentences, as the one mentioned above: “When do you think he will prepare which cake?”. These divergences become problematic as soon as these judgments are used for building a linguistic theory. What is more, while it is likely that many English speakers would reject a sentence such as “He does be working” for being grammatically incorrect, in certain areas of the English-speaking world (such as Ireland), this sentence would be acceptable. By resorting to many different speakers and including them in reference corpora of speakers coming from different geographical areas, corpus linguistics makes it possible to respond to this problem in a much more satisfactory way.

What is more, in many areas of linguistics such as lexicology, language acquisition and sociolinguistics, the idea of relying on the internal judgments of linguists is simply not conceivable. No one can study children’s language by remembering how he or she spoke as a child, or make assumptions about language differences between men and women by imagining how he or she would speak if he/she were a man or a woman. In all these fields, the use of text corpora has been obvious for a long time and corpora use was never interrupted as a result of Chomsky’s work. The paradigm shift in recent decades has taken place in areas where it is conceivable to use a purely rationalist methodology, for example syntax.

Finally, it is important to remember that the role of linguistic theory and the intuition of researchers is not absent in most corpora studies. Indeed, a majority of linguists consider corpora studies as a tool, making it possible to validate or invalidate hypotheses on language, formulated in advance, on the basis of scientific literature and their linguistic intuitions. We will see many examples of this approach (empirical validation) throughout this book. This corpus-based research approach is opposed to an approach which considers corpus data as the only point of reference, both in a theoretical and a methodological sense. In this approach, linguists begin their research without an a priori and simply let hypotheses emerge from corpus data (this is called a corpus-driven approach). This approach is almost unanimous among linguists working with an empirical methodology. On this point, we agree with Chomsky’s metaphorically explained opinion where he states that working with linguistics in this way would be the equivalent for physicists of hoping to discover the physical laws of the universe by looking out of their window. Observing data without a hypothesis often leads to not being able to make sense of data. It is for this reason that the approach that we will adopt in this book corresponds to a corpus-based approach, considering these as available tools for linguists to be able to test their hypotheses.

1.4. Corpus linguistics and computer tools

As we have seen above, corpus linguistics, as performed nowadays, cannot do without computers. Even if works related to corpus linguistics have existed for a long time (such as the indexing of the Bible by theologians or the file-based construction of dictionaries by scholars like Antoine Furetière in French or Samuel Johnson in English), this discipline was only able to properly take off after the arrival of computing.

Corpus linguistics depends on computer science for various reasons. The first one, which we have already mentioned above, is related to the need for computerized texts in order to be able to carry out truly quantitative research. Nevertheless, looking for elements in a corpus, even a computerized one, by using a simple word processing tool is rather inconvenient. Going back to the example of the search for terms related to love in Flaubert, which we discussed earlier, we find that the use of the search function of a typical word processor quickly reaches its limits. First of all, in order to verify that all occurrences found when looking for the verb to love correspond to expressions of love as a feeling rather than to modal uses as in the phrase “I would love it that you kept quiet”, it is necessary to examine each occurrence and thus browse the entire text. Second, to find all the occurrences of the verb to love, it is necessary to perform a different search for each verbal form, for example love, loved, etc. It is for this reason that other computing tools, specifically devoted to corpus linguistics, have been developed.

In particular, concordancers are useful for searching all the occurrences of a word, plus their context of use and for displaying the results line by line in a single query. These tools also make it possible to establish the list of words contained in the corpus, together with their frequency, and to generate a list of keywords matching the content of a corpus. In the case of corpora containing texts as well as their translation, certain tools called aligners make it possible to align the content of the corpus sentence by sentence. That being done, bilingual concordancers search directly for the occurrences of a word in one of the two languages of the corpus, and simultaneously extract the matching sentence in the other language. We will learn how to use these tools in Chapter 5, which is devoted to the presentation of the main French corpora, as well as the tools for analyzing them.

Then, in Chapter 7, we will also see that in order to answer certain research questions, it is necessary to annotate the content of a corpus. For example, let us imagine that we wish to study the different contexts in which we can use the causal adverb since. If we only look up the word since in the corpus, we will also find occurrences which do not correspond to the use of this word as a causal adverb, but to its use as a preposition, for example in “I haven’t seen Mary since Christmas”. So, to be able to correctly look up the uses of since we are interested in, we should only keep those which are adverbs and exclude prepositions. This search can be greatly simplified if the corpus has been annotated by determining, for each word, its grammatical category. This operation, called part-of-speech tagging, can be performed automatically by certain software.

Another problem might arise if we decide to study the use of relative phrases such as “the girl who is intelligent” or “the violin which was left on the bus”. For this study, a good starting point would be to look for relative pronouns such as who or which in order to find occurrences of relative sentences in the corpus. The problem is that these pronouns are also used in interrogative sentences such as “Who do you prefer?” or “Which hat is yours?” In this case, looking for the grammatical category of the word will not solve the problem, because they are both pronouns. In order to find only the occurrences of who and which as relative pronouns, we should use a corpus in which the syntactic structure of each sentence has been analyzed in such a way that we can assign a grammatical function to each word and group them into syntactic constituents. Tools for analyzing the syntactic structure of sentences have also been developed in the context of works for automatic language processing. These automatic analyses still require human checks so as to avoid any form of error, but their performance is continually improving. The arrival of these tools has greatly accelerated research in corpus linguistics. We will discuss this issue in Chapter 7, which is devoted to annotations.

But corpus linguistics was not only developed thanks to the creation of such tools. Above all, it is the general development of computers and the digital revolution which have made the greatest advances possible. In fact, the increase in the computing power of machines – as well as in their memory – has made it possible to build ever larger corpora. Until the 1980s, a corpus of a million words was considered to be a very large corpus. For instance, the first reference corpora (such as the Brown corpus developed for American English in the early 1960s) were about this size. At the same time, the arrival of cassette recorders to the market enabled the first creations of oral corpora containing an exact transcription of spoken speech, rather than a synthesis taken in shorthand.

The marketing of scanners in the 1980s later made it possible to digitize a significant amount of data and corpora began to reach larger sizes, up to 20 million words. Then, with the democratization of computer use, the amount of digitally disseminated texts greatly accelerated the growth of corpora. Finally, since the beginning of 21st Century, the wide dissemination of documents online via the Internet has given another dimension to the size of corpora available to researchers. At present, the Google Books corpus, for example, contains more than 500 billion words, which represents approximately 4% of all the published books of all time (Michel et al. 2011). We will discuss the possible uses of such a corpus in the following chapters. In Chapter 6, we will also see that the Internet potentially offers an exceptional data resource for corpus linguistics, but that Internet research cannot be used without an additional processing step if we are to grant data quality.

1.5. Quantitative versus qualitative methods

We have seen that computers help us to work on very large corpora and automatically count word occurrences, find keywords, etc. The need to use a large amount of data and the desire to quantify the presence of linguistic elements in a corpus corresponds to a quantitative research methodology. This methodology involves observing or manipulating variables, as well as the use of statistical tests. The main objective is to test a limited number of variables, in a highly controlled environment whenever possible and on a language sample that can be representative of the phenomenon studied. This can later make it possible to generalize the results obtained to the whole language or to a part of the target language (e.g. journalistic language). These methods nonetheless imply a certain form of reductionism and a simplification of reality. Ultimately, the addition of studies with well-defined and properly controlled variables may provide a global and realistic picture of a phenomenon.

Let us take an example. Suppose we want to test the hypothesis that women talk more about their feelings than men. To test this hypothesis by means of a corpus study, we should first make sure that we are comparing records of men and women produced in the same context, for example, in the context of friendly discussions around a topic, or a face-to-face interview with a researcher. We will also need to make sure that the corpus collected in this way includes approximately the same speaking time or the same number of words pronounced by men and women. This control over the linguistic context and the duration of interactions helps us to ensure that men and women have had fairly equal motives to pronounce words related to emotions/feelings, and as many chances of doing so. Second, we would have to choose a list of words to search within the corpus, representative of the vocabulary related to emotions, for example verbs such as to annoy, adjectives like furious or nouns like anger. Then, by comparing the number of times these words have been produced by the two groups and by validating the significance of the differences observed between the groups through statistical tests, we would be able to provide an answer to the research question. In this study, we have sought to reduce the number of confounding variables by controlling the context of production of the statements, as well as by limiting the word choice in the examined vocabulary. It is precisely this limited and reductionist aspect that the opponents to quantitative methods criticize, thinking that the constructed and unnatural context in which structured interviews take place does not reflect the richness of natural and spontaneous exchanges between speakers.

The other major methodological paradigm includes so-called qualitative studies. The main objective of these studies is holistic: they aim to study a phenomenon understanding it as a whole, as detailed and as thoroughly as possible, but in a small number of people. Due to their nature, qualitative studies are interpretative. In linguistics, research paradigms involving a qualitative methodology typically resort to the administration of questionnaires with open questions, interviews, observations or introspective techniques, such as think-aloud protocols. For example, in order to study the differences in the way of expressing emotions between men and women, a qualitative methodology could involve asking a reduced number of speakers, for example three men and three women, to describe the way in which they express their emotions, either by talking freely with the experimenter or by talking to each other. The analysis would then require an in-depth study of some of the examples found interesting during the discussion.

One of the main criticisms aimed at qualitative methods is that they are very subjective in nature, insofar as they are largely based on the interpretations made by linguists and the subjective impressions of a few speakers. Thus, the specific cases they describe cannot often be generalized to a population, which, by the way, is not the aim pursued by such studies. Rather than the generalization of results, these studies are based on the possibility of making a transfer from a particular situation so as to understand another one with which it shares common traits. For example, an in-depth case study on the difficulties of expressing emotions in an aphasic patient may help to highlight similar difficulties existing in other patients with the same disorder.

To summarize, each of the two methodological paradigms introduced in this section has both advantages and disadvantages. Quantitative methods enable the generalization of results to the whole of a population, whereas qualitative methods offer a more detailed and nuanced panorama of a real case. Recently, the complementarity between these approaches has started to be broadly accepted in research and many studies are crossing the two types of methodologies, in order to benefit from their advantages and limit their disadvantages.

For example, if we want to know whether learners of French as a foreign language at an advanced level are able to use collocations as native speakers do (collocations such as “prendre une décision” – to make a decision – or “pleuvoir à verse” – to pour with rain), we can search for occurrences of these expressions in text corpora produced by learners and compare the number of times these expressions appear – and their frequency – in a corpus of similar textual productions made by native speakers. By comparing these frequencies through statistical tests, we will know whether learners actually use these expressions as often as native speakers do, or not. Even if we find a difference between the two groups, something which this study will not tell us is why learners do not use these expressions as often as native speakers do or which expressions they use instead. To find out, we can complete this study with a qualitative analysis, by observing, for example, which words often accompany the occurrences of the noun décision in French, which are not the verb prendre. If we observe that several times the verb used is faire (make), rather than prendre (take), a decision in English-speaking learners, but not in German-speaking learners, we will conclude that these errors could come from a problem of transfer from their mother tongue and, more specifically, from the expression to make a decision in English.

In summary, a corpus can be analyzed using a quantitative or qualitative methodology. While we acknowledge the use and importance of combining these two approaches, in the rest of the book we will focus on the quantitative approach to corpus linguistics, which poses its own theoretical and methodological challenges.

1.6. Differences between corpus linguistics and experimental linguistics

Corpus linguistics and experimental linguistics share very important methodological properties, since both are empirical in nature and both generally involve a quantitative rather than a qualitative approach. However, these two types of approaches differ in one very important point. On the one hand, corpus linguistics focuses on data observation as found in collections of texts, recordings, etc. On the other hand, experimental linguistics points to the manipulation of one or more variables in order to study their effect on other variables.

Let us imagine once again that we are interested in the types of language errors produced by learners of French. By means of a corpus study, we will be able to identify all the types of errors produced and then quantify each of them: for example, 30 spelling mistakes, 12 lexicon errors, 20 syntax mistakes, etc., made every 100 words. Then, by applying statistical tests, we will be able to determine whether one of the error categories is significantly more frequent than the others. We will also be able to compare the number of errors produced in each category by students of different levels and, thanks to statistical tests, determine whether students make significant progress faster in certain categories than in others. In contrast, what a corpus study will not help you to do is establish with certainty the factors influencing the number of errors. The corpus only shows you the result of the speakers’ production, but not what led to these results. In order to determine the factors that lead learners to make mistakes or not, we will need to resort to experimental methodology.

When we conduct an experiment, the goal is to manipulate the possible causes and then to observe their effects. Going back to our example research question, we may wonder what makes some students produce more errors than others, and in certain contexts, what makes the same student produce more errors than in other contexts. As regards the difference between students, we may think that one possible cause is the level of general intelligence of each student, the assumption being that overall smarter students should produce fewer errors than less intelligent students. The level of intelligence thus constitutes the cause that we will manipulate in order to observe its effect on the number of errors produced. In order to measure the effect of the intelligence variable, we will first need to measure the students’ intelligence, for example by means of an IQ test. We will then use the result of this test to determine whether the students who have a higher IQ are also the ones who make the fewest language errors.

In the case of the second research question, which seeks to determine why the same student makes more mistakes in certain contexts, we may assume that stress promotes the production of errors. In order to test this hypothesis, we will have to conduct an experiment in which half of the students are placed in a stressful situation such as an examination context or, for instance, a test with a limited amount of time to complete the task, whereas the other half of the students are placed in a low-stress situation, for example, without any time constraint, performing a task which does not involve marked assessment, etc. Then, we will compare the number of errors in the two groups so as to determine, by means of a statistical test, whether the students under a stressful situation make significantly more errors than the other students, or not. In the two examples of studies that we have just discussed, the approach is the same: to identify a possible cause and to assess its effect through experimental manipulation. Conversely, a corpus study focuses on linguistic productions without manipulating the data before collecting them.

The study of linguistic productions in a corpus and the manipulation of experimental variables both have their advantages and disadvantages. On the one hand, corpus linguistics has the advantage of favoring the observation of natural data, that is, those which are not influenced by an experimental context. A corpus of journalistic texts includes real productions by journalists, which are not produced for the purpose of being observed. Likewise, a text produced by a learner is also natural, insofar as it is produced in its usual conditions, without there having been any particular manipulation. In addition, the use of corpora favors the observation of a very large amount of linguistic data, whereas experiments are based on a limited number of linguistic items for the task to remain feasible for participants, who would not be able to read thousands of sentences at a laboratory, for example. Finally, once a corpus has been created, it can be used for numerous research questions without requiring any additional time or financial costs. On the other hand, experiments require significant time resources as well as the usual obligation of having to financially compensate participants for their cooperation.

Experimental studies also have definite advantages over corpus studies. The first advantage, mentioned above, is that experiments allow us to test the existence of a causal relationship between two variables, such as the fact of being stressed and producing more errors. Corpus studies do not make it possible to draw this type of conclusion. Second, while an experimental paradigm can be developed to test almost any kind of phenomenon, there are some rare linguistic phenomena which may be absent or too little represented in a corpus to be examined in this way. For example, if we want to decide whether learners are fluent in French idioms such as “mettre le feu aux poudres” (to stir up a hornet’s nest) or “avoir un poil dans la main” (to be extremely lazy) through a corpus study, we will have to look for them in a corpus of learners’ productions. Now, it is quite possible that these expressions are never found there, but this does not necessarily mean that the learners do not know how to use them. It only means that they did not have an opportunity to produce them in the corpus. Using experimental methodology, we will be able to test whether learners have mastered these expressions. For instance, we can encourage them to read the expressions and then ask them to choose, from among several definitions, the one corresponding to their meaning. Finally, experimental linguistics makes it possible to study the linguistic competence of speakers, through different language comprehension tasks which can be more or less explicit or implicit, such as the conscious evaluation of sentences, their intuitive reading, etc. Corpora can only reflect the linguistic productions of speakers.

To conclude, corpus studies and experimental studies can often be used in a complementary way, and, when put together, they represent powerful tools for answering a good number of research questions.

1.7. Different types of corpora

As we will see in the following chapters, corpora represent linguistic samples of a very varied nature, and it is precisely this variety that makes it possible to answer diverse research questions in all fields of linguistics. In this last section, we will introduce a first classification of the types of existing corpora, in order to be able to refer back to it in the following chapters.

The first distinction we can make among all the existing corpora is the one that classifies them into a sample corpus and a monitor corpus. Sample corpora are those in which data have been collected once and for all, and which no longer evolve thereafter. For this reason, they are also known as closed corpora in the specialized literature. The advantage of these corpora is that they have been designed to contain a set of texts representative of the language, or a part of the language to be studied, with a balanced representation of the different text genres, for example. Thus, these corpora make it possible to draw conclusions which can be generalized. On the other hand, their main defect is that they age quickly and do not follow changes in the language. Therefore, sample corpora need to be recollected at regular intervals.

On the other hand, monitor corpora are never finished and constantly continue to integrate new elements, which is why they are described as open corpora in the literature. A typical example of this type of data is the corpus that contains newspaper archives or parliamentary debates. Every year, the number of available data increases. It is for this reason that it is difficult to maintain a perfect balance between the different parts of these corpora, whose representativeness cannot be fully guaranteed. We will return to the problem of representativeness in Chapter 6. On the other hand, these corpora remain up to date. In cases where they comprise a period of a few decades, they make it possible to observe the appearance of certain changes in language.

The second major distinction to be made among existing corpora differentiates general language corpora from specialized language corpora. General language corpora aim to offer a panorama of the whole of a language at a given time. It is evidently impossible to collect a sample of the whole language, but in the same way that a general language dictionary aims to describe the common lexicon of a language, the general corpus seeks to offer a global image, including the main textual genres found in language. These corpora are really valuable when it comes to studying a language as a whole, but they cannot offer precise answers on linguistic phenomena present in certain specific communication means, such as mobile texting, social media, medical reports, etc.

In order to study one of these areas specifically, it is preferable to resort to a specialized corpus. In fact, there are corpora especially devoted to texting, social media, etc. In addition, general corpora include productions by adults who are native speakers of the language represented. Other corpora specialize in representing other population categories, regardless of whether they are monolingual children in the process of acquiring their mother tongue, bilingual children, foreign-language learners, or even children with neuro-developmental disorders influencing language acquisition, such as autism and specific language impairment. Finally, by default, a general corpus includes examples of the variety considered as a language standard, or one of its main varieties. In French, it generally refers to the French language from France and, more precisely, from the Parisian region. In English, general corpora can refer to the English language from the UK or to American English. Conversely, some corpora specialize in the productions of speakers of a certain language variety, such as French from French-speaking Switzerland, Belgium, Canada, etc.

General or specialized language corpora can contain either written language or spoken language samples. For a long time, written language corpora were the norm, but analysis of the spoken language has developed broadly since the 2000s. Corpora of spoken language are typically of smaller size than written language ones, since they require manual transcription. As a matter of fact, it is easy to record voices, but what is difficult is to carry out searches directly on an audio file. At the same time, speech recognition software does not always fully allow reliable automatic transcriptions. It is for this reason that the oral data must be transcribed manually, which often limits the size of the spoken corpora. More recently, audio-visual recording corpora (also called “multimodal” corpora) have been created, in order to facilitate, for instance, the study of gestures and facial expressions as well as their role in communication. These corpora still pose many codification and interpretation challenges. Finally, let us point out that video corpora are also used for the study of sign language.

Another distinction that can be made regarding the types of existing corpora relates to the type of processing carried out on the linguistic data of the corpus. On the one hand, raw corpora contain nothing but language samples. This scenario represents the majority of the French corpora. On the other hand, some annotated corpora