E-Book
46,99 €

Practical Corpus Linguistics E-Book

Martin Weisser

0,0

46,99 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.

Herausgeber: John Wiley & Sons
Kategorie: Geisteswissenschaft
Sprache: Englisch

Beschreibung

This is the first book of its kind to provide a practical and student-friendly guide to corpus linguistics that explains the nature of electronic data and how it can be collected and analyzed. * Designed to equip readers with the technical skills necessary to analyze and interpret language data, both written and (orthographically) transcribed * Introduces a number of easy-to-use, yet powerful, free analysis resources consisting of standalone programs and web interfaces for use with Windows, Mac OS X, and Linux * Each section includes practical exercises, a list of sources and further reading, and illustrated step-by-step introductions to analysis tools * Requires only a basic knowledge of computer concepts in order to develop the specific linguistic analysis skills required for understanding/analyzing corpus data

Details

Sie lesen das E-Book in den Legimi-Apps auf:

Android

iOS

von Legimi
zertifizierten E-Readern

Seitenzahl: 572

Veröffentlichungsjahr: 2015

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Practical Corpus Linguistics

An Introduction to Corpus-Based Language Analysis

Martin Weisser

Registered OfficeJohn Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK

Editorial Offices350 Main Street, Malden, MA 02148-5020, USA 9600 Garsington Road, Oxford, OX4 2DQ, UK The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK

For details of our global editorial offices, for customer services, and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com/wiley-blackwell.

The right of Martin Weisser to be identified as the author of this work has been asserted in accordance with the UK Copyright, Designs and Patents Act 1988.

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.

Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The publisher is not associated with any product or vendor mentioned in this book.

Limit of Liability/Disclaimer of Warranty: While the publisher and authors have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. It is sold on the understanding that the publisher is not engaged in rendering professional services and neither the publisher nor the author shall be liable for damages arising herefrom. If professional advice or other expert assistance is required, the services of a competent professional should be sought.

Library of Congress Cataloging-in-Publication Data

Weisser, Martin, author. Practical corpus linguistics : an introduction to corpus-based language analysis / Martin Weisser. – First edition. pages cm Includes bibliographical references and index. ISBN 978-1-118-83187-8 (hardback) – ISBN 978-1-118-83188-5 (paper) 1. Linguistic analysis (Linguistics)–Databases. 2. Linguistic analysis (Linguistics)–Software. 3. Corpora (Linguistics)–Methodology. 4. Corpora (Linguistics)–Technological innovations. 5. Computational linguisitics–Methodology. 6. Computer network resources–Evaluation. 7. Citation of electronic information resources. I. Title. P128.D37W45 2016 410.1′88–dc23

2015023709

A catalogue record for this book is available from the British Library.

Cover image: © Martin Weisser

To Ye & Emma, who've had to suffer from an undue lack of attention throughout the final months of writing this book

Acknowledgements

Chapter 1: Introduction

1.1 Linguistic Data Analysis

1.2 Outline of the Book

1.3 Conventions Used in this Book

1.4 A Note for Teachers

1.5 Online Resources

Chapter 2: What's Out There?

2.1 What's a Corpus?

2.2 Corpus Formats

2.3 Synchronic vs. Diachronic Corpora

2.4 General vs. Specific Corpora

2.5 Static Versus Dynamic Corpora

2.6 Other Sources for Corpora

Solutions to/Comments on the Exercises

Note

Sources and Further Reading

Chapter 3: Understanding Corpus Design

3.1 Food for Thought – General Issues in Corpus Design

3.2 What's in a Text? – Understanding Document Structure

3.3 Understanding Encoding: Character Sets, File Size, etc.

Solutions to/Comments on the Exercises

Sources and Further Reading

Chapter 4: Finding and Preparing Your Data

4.1 Finding Suitable Materials for Analysis

4.2 Collecting Written Materials Yourself (‘Web as Corpus’)

4.3 Collecting Spoken Data

4.4 Preparing Written Data for Analysis

Solutions to/Comments on the Exercises

Sources and Further Reading

Chapter 5: Concordancing

5.1 What's Concordancing?

5.2 Concordancing with AntConc

Solutions to/Comments on the Exercises

Sources and Further Reading

Chapter 6: Regular Expressions

6.1 Character Classes

6.2 Negative Character Classes

6.3 Quantification

6.4 Anchoring, Grouping and Alternation

6.5 Further Exercises

Solutions to/Comments on the Exercises

Sources and Further Reading

Chapter 7: Understanding Part-of-Speech Tagging and Its Uses

7.1 A Brief Introduction to (Morpho-Syntactic) Tagsets

7.2 Tagging Your Own Data

Solutions to/Comments on the Exercises

Sources and Further Reading

Chapter 8: Using Online Interfaces to Query Mega Corpora

8.1 Searching the BNC with BNCweb

8.2 Exploring COCA through the BYU Web-Interface

Solutions to/Comments on the Exercises

Sources and Further Reading

Chapter 9: Basic Frequency Analysis – or What Can (Single) Words Tell Us About Texts?

9.1 Understanding Basic Units in Texts

9.2 Word (Frequency) Lists in AntConc

9.3 Word Lists in BNCweb

9.4 Keyword Lists in AntConc and BNCweb

9.5 Comparing and Reporting Frequency Counts

9.6 Investigating Genre-Specific Distributions in COCA

Solutions to/Comments on the Exercises

Sources and Further Reading

Chapter 10: Exploring Words in Context

10.1 Understanding Extended Units of Text

10.2 Text Segmentation

10.3 N-Grams, Word Clusters and Lexical Bundles

10.4 Exploring (Relatively) Fixed Sequences in BNCweb

10.5 Simple, Sequential Collocations and Colligations

10.6 Exploring Colligations in COCA

10.7 N-grams and Clusters in AntConc

10.8 Investigating Collocations Based on Statistical Measures in AntConc, BNCweb and COCA

Solutions to/Comments on the Exercises

Sources and Further Reading

Chapter 11: Understanding Markup and Annotation

11.1 From SGML to XML – A Brief Timeline

11.2 XML for Linguistics

11.3 ‘Simple XML’ for Linguistic Annotation

11.4 Colour Coding and Visualisation

11.5 More Complex Forms of Annotation

Solutions to/Comments on the Exercises

Sources and Further Reading

Chapter 12: Conclusion and Further Perspectives

Appendix A: The CLAWS C5 Tagset

Appendix B: The Annotated Dialogue File

Appendix C: The CSS Style Sheet

Glossary

References

Index

EULA

List of Tables

Chapter 2

Table 2.1

Table 2.2

Table 2.3

Table 2.4

Table 2.5

Table 2.6

Table 2.7

Table 2.8

Table 2.9

Table 2.10

Chapter 3

Table 3.1

Chapter 7

Table 7.1

Table 7.2

Table 7.3

Chapter 8

Table 8.1

Table 8.2

Table 8.3

Chapter 9

Table 9.1

Table 9.2

Chapter 11

Table 11.1

Table 11.2

Table 11.3

List of Illustrations

Chapter 3

Figure 3.1 Illustration of basic document structure

Chapter 4

Figure 4.1 The ICEweb interface

Chapter 5

Figure 5.1 Example of a KWIC concordance output

Figure 5.2 The AntConc startup screen

Figure 5.3 AntConc file opening options

Figure 5.4 AntConc file settings

Figure 5.5 AntConc ‘Corpus Files’ window (two files loaded)

Figure 5.6 AntConc ‘Search Term’ and search options

Figure 5.7 AntConc results for

round

in two novels by Jane Austen

Figure 5.8 AntConc ‘Search Window Size’ options

Figure 5.9 AntConc ‘Kwic Sort’ options

Chapter 6

Figure 6.1 Sample paragraph for practising and understanding regex patterns

Chapter 7

Figure 7.1 Sample output of the Simple PoS Tagger

Chapter 8

Figure 8.1 The BNCweb startup screen

Figure 8.2 Results for simple search for

assume

Figure 8.3 BNCweb query follow-on options

Figure 8.4 The basic COCA interface

Figure 8.5 Display of antonyms

thoughtful

and

thoughtless

as alternatives

Figure 8.6 Side-by-side comparison for the lemma of

movie

in the COCA and BNC

Chapter 9

Figure 9.1 Output of a basic frequency list in AntConc

Figure 9.2 Token (word) (re-)definition in AntConc

Figure 9.3 AntConc Word List preferences

Figure 9.4 BNCweb frequency list selection options

Figure 9.5 Options for defining subcorpora in BNCweb

Figure 9.6 Excel text import wizard (stage 1)

Figure 9.7 Excel text import wizard (stage 3)

Figure 9.8 Sort options in Excel

Figure 9.9 Options for defining subcorpora according to genre

Figure 9.10 BNCweb keyword and title scan

Figure 9.11 AntConc Keyword preferences

Figure 9.12 Keyword options in BNCweb

Figure 9.13 Keyword comparison of university essays and written component of the BNC (top 31 entries)

Chapter 10

Figure 10.1 Illustration of a collocation span

Figure 10.2 Options for statistical collocation measures in BNCweb

Chapter 11

Figure 11.1 A brief SGML sample

Figure 11.2 CSS sample paragraph styling

Figure 11.3 TEI header for BNC file KST

Guide

Cover

Table of Contents

Pages

xvii

xviii

100

101

102

103

104

105

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

259

260

261

262

263

264

265

266

267

269

270

271

272

273

274

275

277

283

Acknowledgements

I'd first like to start by thanking my former students in Chemnitz, Bayreuth and Hong Kong who ‘suffered through’ the initial sets of teaching materials that eventually formed the basis for writing this textbook. The next big thanks needs to go to my colleagues here at Guangdong University of Foreign Studies, who attended a series of workshops where I tested out the materials from the preliminary drafts of several chapters and who provided me with highly useful feedback. Particular mention here deserves to go to Junyu (Mike) ZHANG, who not only commented on the content of several chapters, but also pointed out certain issues of style that have hopefully helped me to make the writing more accessible to an international readership.

My next round of thank yous goes to Yanping DONG and Hai XU, for allowing me to join the National Key Research Center for Linguistics and Applied Linguistics at Guangdong University of Foreign Studies, which has provided me with more of the desperately needed time to focus on writing this book, while also allowing me to conduct other types of research that have influenced the contents of the book in various ways. To Hai XU, I give additional thanks for sharing his experience in, and knowledge of, corpora of Chinese, which has, unfortunately, only partially found its way into this book, due to limits of space. To my other colleagues, especially Yiqiong ZHANG, I also give thanks for providing a more moral type of support through engaging in further discussions and making me feel at home in the Center.

I also owe a great debt to Laurence Anthony for allowing me to pester him with a series of questions about and suggestions for improving AntConc. More or less the same, though possibly to a slightly lesser extent, goes for Sebastian Hoffmann and Mark Davies for answering questions about particular features of, and again partly responding to requests for improving, BNCweb and the COCA interface, respectively.

Next, I'd sincerely like to thank the anonymous reviewers of this book, who, through their many invaluable constructive comments, have not only encouraged me in writing the book, but also hopefully allowed me to improve the contents substantially.

My final – but most important and heartfelt – note of thanks goes to Geoff Leech. Although credit for introducing me to the study of corpus linguistics has to go to someone else, he's certainly been the single most important influence on my career and thinking as a corpus linguist. I'll forever be grateful for his ongoing support throughout my years spent at Lancaster, working with him, as external examiner of my PhD, and later up to his untimely demise in August 2014. I sincerely hope that he would have appreciated the design and critical aims of this textbook, and perhaps also recognised his implicit hand in shaping it…

CHAPTER 1Introduction

This textbook aims to teach you how to analyse and interpret language data in written or orthographically transcribed form (i.e. represented as if it were written, if the original data is spoken). It will do so in a way that should not only provide you with the technical skills for such an analysis for your own research purposes, but also raise your awareness of how corpus evidence can be used in order to develop a better understanding of the forms and functions of language. It will also teach you how to use corpus data in more applied contexts, such as e.g. in identifying suitable materials/examples for language teaching, investigating socio- linguistic phenomena, or even trying to verify existing linguistic theories, as well as to develop your own hypotheses about the many different aspects of language that can be investigated through corpora. The focus will primarily be on English-language data, although we may occasionally, whenever appropriate, refer to issues that could be relevant to the analysis of other languages. In doing so, we'll try to stay as theory-neutral as possible, so that no matter which ‘flavour(s)’ of linguistics you may have been exposed to before, you should always be able to understand the background to all the exercises or questions presented here.

The book is aimed at a variety of readers, ranging mainly from linguistics students at senior undergraduate, Masters, or even PhD levels who are still unfamiliar with corpus linguistics, to language teachers or textbook developers who want to create or employ more real-life teaching materials. As many of the techniques we'll be dealing with here also allow us to investigate issues of style in both literary and non-literary text, and much of the data we'll initially use actually consists of fictional works because these are easier to obtain and often don't cause any copyright issues, the book should hopefully also be useful to students of literary stylistics. To some extent, I also hope it may be beneficial to computer scientists working on language processing tasks, who, at least in my experience, often lack some crucial knowledge in understanding the complexities and intricacies of language, and frequently tend to resort to mathematical methods when more linguistic (symbolic) ones would be more appropriate, even if these may make the process of writing ‘elegant’ and efficient algorithms more difficult.

You may also be asking yourself why you should still be using a textbook at all in this day and age, when there are so many video tutorials available, and most programs offer at least some sort of online help to get you started. Essentially, there are two main reasons for this: a) such sources of information are only designed to provide you with a basic overview, but don't actually teach you, simply demonstrating how things are done. In other words they may do a relatively good job in showing you one or more ways of doing a few things, but often don't really allow you to use a particular program independently and for more complex tasks than the author of the tutorial/help file may actually have envisaged. And b) online tutorials, such as the ones on YouTube, may not only take a rather long time to (down)load, but might not even be (easily) accessible in some parts of the world at all, due to internet censorship.

If you're completely new to data analysis on the computer and working with – as opposed to simply opening and reading – different file types, some of the concepts and methods we'll discuss here may occasionally make you feel like you're doing computer science instead of working with language. This is, unfortunately, something you'll need to try and get used to, until you begin to understand the intricacies of working with language data on the computer better, and, by doing so, will also develop your understanding of the complexity inherent in language (data) itself. This is by no means an easy task, so working with this book, and thereby trying to develop a more complete understanding of language and how we can best analyse and describe it, be it for linguistic or language teaching purposes, will often require us to do some very careful reading and thinking about the points under discussion, so as to be able to develop and verify our own hypotheses about particular language features. However, doing so is well worth it, as you'll hopefully realise long before reaching the end of the book, as it opens up possibilities for understanding language that go far beyond a simple manual, small-scale, analysis of texts.

In order to achieve the aims of the book, we'll begin by discussing which types of data are already readily available, exploring ways of obtaining our own data, and developing an understanding of the nature of electronic documents and what may make them different from the more traditional types of printed documents we're all familiar with. This understanding will be developed further throughout the book, as we take a look at a number of computer programs that will help us to conduct our analyses at various levels, ranging from words to phrases, and to even larger units of text. At the same time, of course, we cannot ignore the fact that there may be issues in corpus linguistics related to lower levels, such as that of morphology, or even phonology. Having reached the end of the book, you'll hopefully be aware of many of the different issues involved in collecting and analysing a variety of linguistic – as well as literary – data on the computer, which potential problems and pitfalls you may encounter along the way, and ideally also how to deal with them efficiently. Before we start discussing these issues, though, let's take a few minutes to define the notion of (linguistic) data analysis properly.

1.1 Linguistic Data Analysis

1.1.1 What's data?

In general, we can probably see all different types of language manifestation as language data that we may want/need to investigate, but unfortunately, it's not always possible to easily capture all such ‘available’ material for analysis. This is why, apart from the ‘armchair’ data available through introspection (cf. Fillmore 1992: 35), we usually either have to collect our materials ourselves or use data that someone else has previously collected and provided in a suitable form, or at least a form that we can adapt to our needs with relative ease. In both of these approaches, there are inherent difficulties and problems to overcome, and therefore it's highly important to be aware of these limitations in preparing one's own research, be it in order to write a simple assignment, a BA dissertation, MA/PhD thesis, research paper, etc.

Before we move on to a more detailed discussion of the different forms of data, it's perhaps also necessary to clarify the term data itself a little more, in order to avoid any misunderstandings. The word itself originally comes from the plural of the Latin word datum, which literally means ‘(something) given’, but can usually be better translated as ‘fact’. In our case, the data we'll be discussing throughout this book will therefore represent the ‘facts of language’ we can observe. And although the term itself, technically speaking, is originally a plural form referring to the individual facts or features of language (and can be used like this), more often than not we tend to use it as a singular mass noun that represents an unspecified amount or body of such facts.

1.1.2 Forms of data

Essentially, linguistic data comes in two general forms, written or spoken. However, there are also intermediate categories, such as texts that are written to be spoken (e.g. lectures, plays, etc.), and which may therefore exhibit features that are in between the two clear-cut variants. The two main media types often require rather radically different ways of ‘recording’ and analysis, although at least some of the techniques for analysing written language can also be used for analysing transliterated or (orthographically) transcribed speech, as we'll see later when looking at some dialogue data. Beyond this distinction based on medium, there are of course other classification systems that can be applied to data, such as according to genre , register , text type , etc., although these distinctions are not always very clearly formalised and distinguished from one another, so that different scholars may sometimes be using distinct, but frequently also overlapping, terminology to represent similar things. For a more in-depth discussion of this, see Lee (2002).

To illustrate some of the differences between the various forms of language data we might encounter, let's take a look at some examples, taken from the Corpus of English Novels (CEN) and Corpus of Late Modern English Texts, version 3.0 (CLMET3.0; De Smet, 2005), respectively. To get more detailed information on these corpora, you can go to https://perswww.kuleuven.be/∼u0044428/, but for our purposes here, it's sufficient for you to know that these are corpora that are mainly of interest to researchers engaged in literary stylistic analyses or historical developments within the English language. However, as previously stated, throughout the book, we'll often resort to literary data to illustrate specific points related to both the mechanics of processing language and as examples of genuinely linguistic features. In addition to being fictional, this data will often not be contemporary, simply because much contemporary data is often subject to copyright. Once you understand more about corpora and how to collect and compile them yourself, though, you'll be able to gather your own contemporary data, should you wish so, and explore actual, modern language in use.

Apart from being useful examples of register differences , the extracts provided below also exhibit some characteristics that make them more difficult to process using the computer. We'll discuss these further below, but I've here highlighted them with boxes.

Sample A – from

The Glimpses Of The Moon

by Edith Wharton, published 1922

IT rose for them--theirhoney-moon--over the waters of a lake so famed as the scene of romantic raptures that they were rather proud of not having been afraid to choose it as the setting of their own.

“It required a total lack of humour, or as great a gift for it as ours, to risk the experiment,” Susy Lansing opined, as they hung over the inevitable marble balustrade and watched their tutelary orb roll its magic carpet across the waters to their feet.

“Yes--or the loan of Strefford's villa,” her husband emended, glancing upward through the branches at a long low patch of paleness to which the moonlight was beginning to give the form of a white house-front.

Sample B – from:

Eminent Victorians

by Lytton Strachey, published 1918

Preface

THE history of the Victorian Age will never be written; we know too much about it. For ignorance is the first requisite of the historian—ignorance, which simplifies and clarifies, which selects and omits, with a placid perfection unattainable by the highest art. Concerning the Age which has just passed, our fathers and our grandfathers have poured forth and accumulated so vast a quantity of information that the industry of a Ranke would be submerged by it, and the perspicacity of a Gibbon would quail before it. It is not by the direct method of a scrupulous narration that the explorer of the past can hope to depict that singular epoch. If he is wise, he will adopt a subtler strategy. He will attack his subject in unexpected places; he will fall upon the flank, or the rear; he will shoot a sudden, revealing searchlight into obscure recesses, hitherto undivined. He will row out over that great ocean of material, and lower down into it, here and there, a little bucket, which will bring up to the light of day some characteristic specimen, from those far depths, to be examined with a careful curiosity.

Sample C – from

The Big Drum

by Arthur Wing Pinero, published 1915

Noyes.

[Announcing Philip.] Mr. Mackworth.

Roope.

[A simple-looking gentleman of fifty, scrupulously attired—jumping up and shaking hands warmly with Philip as the servant withdraws.] My dear Phil!

Philip.

[A negligently—almost shabbily—dressed man in his late thirties, with a handsome but worn face.] My dear Robbie!

Roope.

A triumph, to have dragged you out! [Looking at his watch.] Luncheon isn't till a quarter-to-two. I asked you for half-past-one because I want to have a quiet little jaw with you beforehand.

Philip.

Delightful.

Roope.

Er—I'd better tell you at once, old chap, whom you'll meet here to-day.

Sample A is clearly a piece of narrative fiction, mixing narrative description and simulated reported speech, references to characters and situations that are depicted as life-like, as well as featuring a number of at least partly evaluative reporting verbs, such as opined and emended. Sample B, on the other hand, contains no reported speech and reporting verbs, although it's clearly also narrative – albeit non-fictional –, with a relatively complex sentence structure, including numerous relative and adverbial clauses, and an overall high degree of formality. Sample C, in contrast, exhibits clear characteristics of (simulated) spoken language, much shorter and less complex syntax, even single-word ‘sentences’, with names, titles and informal terms of address (old chap) used when the characters are address- ing/introducing each other, exclamations, contractions, and at least one hesitation marker (Er). And even though the language in the latter sample seems fairly natural, we can still easily see that it comes from a scripted text, partly because of the indication of speakers (which I've highlighted in bold-face), and partly due to the stage instructions included in square brackets.

As we haven't discussed any of the issues in processing such text samples yet, it may not be immediately obvious to you that these different types of register may potentially require different analysis approaches, depending on what our exact aims in analysing them are. For instance, for Sample A, do we want to conceptually treat the reported speech as being of the same status as the descriptive parts, and do we thus want to analyse them together or separately? Or are we possibly just interested in how the author represents the direct speech of the characters in the novel, and would therefore want to extract only that? And if so, how would we best go about this?

Sample B is probably relatively straightforward to analyse in terms of perhaps a frequency analysis of the words, but what if we're also interested in particular aspects of syntax or lexis that may be responsible for its textual complexity or the perceived level of formality, respectively? And, last but not least, concerning Sample C, similarly to Sample A, which parts of the text would we be interested in here and how would we extract them? Are the stage instructions equally important to us as the direct speech exchanges between the characters? Or, if, for example, we're interested in the average number of words uttered by each character, how do we deal with hesitation markers? Do we treat them as words or ‘non-words’ simply to be deleted? As I've already tried to hint at in the beginning of this paragraph, the answers to these questions really depend on our research purpose(s), and can thus not be conclusively stated here.

Something else you may have noticed when looking at the samples I've provided above is that they're all from the early 20th century. As such, the language we encounter in them may sometimes appear overly formal (or even archaic) to some extent, compared to the perhaps more ‘colloquial’ language we're used to from the different registers these days. I've chosen extracts from these three particular texts and period for a number of reasons: a) their authors all died more than 70 years ago so the texts are in the public domain; in other words, there are no copyright issues, even when quoting longer passages; b) they are included in corpus compilations; and c) they not only illustrate register/genre differences but also how the conventions for these may change over time, as can be seen, for example, in the spelling of to-day in the final extract.

As pointed out above, another interesting aspect of these samples is that they exhibit particular formatting issues, which again may not be immediately apparent to you yet, but are due to somewhat bizarre typographical conventions. If you look closely at the samples, you can see that in Sample A there are double dashes marking the parenthetical counterpart (i.e. reference resolution) “their honey-moon” to the sentence-initial cataphoric pronoun “IT”. What is in fact problematic to some extent for processing the text is that these dashes actually look like double hyphens, i.e. they're not surrounded by spaces on either side, as would be the normal convention. Now, many computer programs designed to count words will split the input text on spaces and punctuation. Unfortunately, though, this would leave us with some very strange ‘words’ (that superficially look like hyphenated compounds ), them—their and honey-moon—over, in any resulting word-frequency list. This is obviously something we do not want and which introduces errors into any automatic analysis of the data. Something similar, albeit not to signal a parenthetical but instead some kind of pseudo-punctuation, happens again for “Yes—or” a little further down in the text. We can already see, therefore, from this relatively short sample of text that a failure to deal with this feature could cause issues in a number of places throughout the text. The same problem occurs in the other two samples, only that there the dash doesn't actually consist of two separate characters, but one single m-dash.

A different problem occurs in the use of initial capitals in Samples A and B. As you can see, the words it and the appear in capital letters throughout, signalling the beginning of the chapter typographically. Again, as ‘human consumers’ of the text, this will not cause any processing problems, but for the computer, the, The, and THE are in fact three different ‘words’, or at least word forms. Thus, even single initial capitals at the beginning of sentences may become issues in identifying and counting words on the computer. We'll talk more about this type of issue in Section 4.4.1, where we'll explore ways of dealing with such features of the text in order to retain relatively ‘clean’ data.

1.1.3 Collecting and analysing data

When collecting our own data, we obviously need to consider methodologies that allow us to collect the right types and amount(s) of data to answer our particular research questions. This, however, isn't the only type of consideration necessary, but we also need to bear in mind ethical issues involved in the collection – such as asking people for permission to record them or to publish their recordings, etc. – and which type of format that data should be stored in so as to be most useful to us, and potentially also other researchers.

When using other people's existing data, there are usually issues in accessing data stored in their specific format(s) or converting the data to a format that is more suitable to one's own needs, as we've just seen above, such as removing unwanted types of information or transforming overly specific information into simpler forms of representation. In this textbook, we'll also look at some of the important aspects of collecting or adapting data to one's needs, as well as how to go about analysing and presenting them in various ways, once a suitable format has been established.

In order to be able to work with electronic data, we also need to become familiar with a variety of different programs, some written specifically for linguistic analysis, some for more general purposes of working with texts. One of the key features of this book is that the programs I'll recommend to you are almost exclusively obtainable free of charge, i.e. so-called freeware. This doesn't mean that there aren't other excellent programs out there that may do some of the tasks we want to perform even better, or in simpler or more powerful ways, but simply reflects the fact that there are already many useful free programs available, and also my own philosophy that we shouldn't need to spend substantial amounts of money just to enable us to do research. This is at least part of the reason why I make most of my own programs available to the research community in this way, apart from the fact that this makes my own research (results) more easily reproducible by others, and therefore caters for the aims of satisfying accountability and academic honesty. For the sake of completeness, though, I'll generally try to at least refer to alternative commercial programs, but without discussing them in any detail.

Corpus linguistics, as a form of data analysis methodology, can of course be carried out on a number of different operating systems, so I'll also try to make recommendations as to which programs may be useful for the most commonly used ones, Windows, Mac OS X, and Linux. Because there are many different ‘flavours’ of Linux, though, with a variety of different windowing interfaces, I'll restrict my discussions to two of the most popular ones, KDE and Gnome. Unfortunately, I won't be able to provide detailed support on how to actually install the programs themselves, as this may sometimes involve relatively detailed information about your system that I cannot predict. Instead, however, I'll actually try to avoid/pre-empt such issues by recommending default programs that are probably already installed, provided that they do in fact fulfil all or at least most of our needs.

1.2 Outline of the Book

This book is organised into four sections. The first section (comprising Chapters 1 and 2) begins with a very brief introduction to the history and general design of corpora, simply to ‘set the scene’, rather than to provide an extensive coverage of the multitude of corpora that have been created for different purposes and possibly also made available for free or in the form of various types of interfaces. More extensive coverage on the subject, including more theoretical implications, is already provided in books like Kennedy (1998), Meyer (2002), or Lindquist (2009), so these texts can always be consulted for reference if necessary, and we can instead focus on more practical issues. For a more detailed interesting discussion of some of the different ‘philosophical’ approaches to corpus linguistics, you can consult McEnery and Hardie (2012).

The introductory section is followed by an overview of different methods to compile and prepare corpora from available online resources, such as text archives or the WWW. This section (spanning Chapters 3 and 4) should essentially provide the basis for you to start building your own corpora, but also introduces you to various issues related to handling language on the computer, including explanations of different file types you may encounter or want to use, as well as certain types of meta-information about texts.

Section 3 (Chapters 5 to 10) then deals with different approaches to corpus-based linguistic data analysis, ranging from basic searching (concordancing) via learning about more complex linguistic patterns, expressed in the form of regular expressions, to simple and extended word (frequency) list analyses. This part already contains information on how to tag your data morpho-syntactically, using freely available tagging resources, and how to make use of tagging in your analyses. The final section then takes the notion of adding linguistic information to your data further, and illustrates how to enrich corpus data using basic forms of XML in order to cyclically improve your analyses or publish/visualise analysis results effectively.

As corpus linguistics is a methodology that allows us to develop insights into how language works by ‘consulting’ real-life data, it should be fairly obvious that we cannot learn how to do corpus research on a purely theoretical basis. Therefore, as far as possible, all sections of this book will be accompanied by practical exercises. Some of these will appear to be relatively straightforward, almost mechanical, ones where you simply get to follow a sequence of steps in order to learn how to use a specific function inside a program or web interface, while others are more explicitly designed to enable you to develop your own strategies for solving problems and testing hypotheses in linguistics. Please bear in mind, though, that for the former type of exercise, simply following the steps blindly without trying to understand why you're doing them will not allow you to learn properly. So, as far as possible, at each point you should try to understand what we're trying to achieve and how the particular program we're using only gives us a handle on producing the relevant data, but does not actually answer our research questions for us. In the same vein, it's also important to understand that once we actually have extracted some relevant data from a corpus, this is rarely ever the ‘final product’. Such data generally either still needs to be interpreted, filtered, or evaluated as to its usefulness, if necessary by (re-)adjusting the search strategy or initial hypotheses and/or conclusions, or, if it's to be used for more practical purposes, such as in the creation of teaching materials or exercises, to be brought into an appropriate form.

As we move on and you learn more and more techniques, the exercises will also get more complex, sometimes assuming the size of small research projects, if carried out in full detail. As a matter of fact, as these exercises require and consoli- date a lot of the knowledge gained in prior sections, they might well be suitable for small research projects to be set by teachers, and possibly even form the basis of BA theses or MA dissertations.

Of course, you won't be left alone in figuring out the solutions to these exercises; both types will be solved at the end of each respective section, either in the form of detailed and precise explanations, or, whenever the results might be open to interpretation, by appropriate comments illustrating what you could/should be able to observe. For the more extensive exercises referred to in the previous paragraph, I'll often start you off with suitable explanations regarding the procedures to follow, and also hint at some potential issues that may arise, but will leave the completion up to you, to help you develop your awareness independently. Furthermore, as real corpus linguistics is not just about getting some ‘impressive’ numbers but should in fact allow you to gain real insights into different aspects of language, you should always try to relate your results to what you know from established theories and other methods used in linguistics, or even other related disciplines, such as for example sociology, psychology, etc., as far as they may be relevant to answering your research questions. This is also why the solutions to, and discussions of, the exercises may often represent those parts of the book that cover some of the more theoretical aspects of corpus linguistics, aspects that you'll hopefully be able to master once you've acquired the more practical tools of the trade. Thus, even if you may think you've already found a perfect answer to an exercise, you should probably still spend some time reading carefully through each solution.

As this textbook is more practical in nature than other textbooks on corpus linguistics, at the end of almost all chapters, I've also added a section entitled ‘Sources and Further Reading’. These sections essentially provide lists of references I've consulted and/or have found most useful and representative in illustrating the particular topic(s) discussed in the chapter. You can consult these refe- rences if you want to know more about theoretical or practical issues that I am unable to cover here, due to reasons of space. These sections may not necessarily contain highly up-to-date references, for the simple reason that, unfortunately, later writings may not always represent improvements over the more fundamental works produced in some of the areas covered. Once you understand more about corpus linguistics, though, you may want to consult the individual chapters in two of the recent handbooks, O'Keeffe & McCarthy (2010) and Lüdeling & Kytö (2008), so that you can evaluate the progress made over recent years yourself.

1.3 Conventions Used in this Book

In linguistics, there are many conventions that help us to distinguish between different levels of analysis and/or description, so as to better illustrate which different types of language phenomena we're dealing with at any given point in time. Throughout this book, I'm going to make use of many, if not most, of these conventions, so it's important to introduce them at this point. In addition to using these conventions as is done in linguistics, I may also use some of them to indicate special types of textual content relevant to the presentation of resources in this book, etc.

Double quotes (“…”) indicate direct speech or short passages quoted from books.

Single quotes (‘…’) signal that an expression is being used in an unusual or unconventional way, that we're referring to the meaning of a word or construction on the semantic level, or to refer to menu items or sometimes button text in programs used. The latter may also be represented by a stylised button text, e.g..

Curly brackets ({…}) are used to represent information pertaining to the level of morphology.

Angle brackets (<…>) indicate that we're dealing with issues related to orthography or spelling. Alternatively, they're also used in certain types of linguistic annotation.

Forward slashes/square brackets generally indicate that we're discussing issues on the levels of phonology or phonetics. Within quoted material, they may also signal amendments to the original material made in order to fit it into the general sentence structure.

Italics are used to represent words or expressions, sometimes whole sentences, that illustrate language materials under discussion. In some cases, they may also be used to indicate emphasis/highlighting, especially if co-occurring with boldface.

Small caps are used to indicate lemmas, i.e. forms that allow us to conveniently refer to all instances of a verb, noun, etc.

Monospaced font indicates instructions/text to be typed into the computer, such as a search string or regular expression.

1.4 A Note for Teachers

The relatively low number of chapters may make this book appear deceptively short, and you might be wondering whether it would be suitable for a course that runs for a whole semester of up to 18 weeks; there's no need to worry, though, that you may necessarily have to supplement it with further materials, although this is of course possible.,

The sections and chapters of the book have been arranged to be thematically coherent, but, if you're planning to use it as a textbook in class, you'll frequently find that one chapter corresponds to more than one classroom unit. I'd therefore suggest that, while preparing specific topics, even – or especially – if you may already be an expert in the field, you at least try out the exercises carefully yourself, and then attempt to gauge how long it may take your students to carry them out. If your audience is already highly technically literate and has a strong background in linguistics, then obviously the exercises can be done much more quickly. If, on the other hand, your students are somewhat ‘technophobic’ or do not yet have a strong background in linguistics, you may either wish to spread the content over multiple units, or set at least some of the exercises as homework. In order to save time, you can also ask your course participants to perform certain preparatory tasks, such as downloading and installing different pieces of software, or registering for online resources, prior to coming to class.

1.5 Online Resources

This book also has an accompanying web page, where you'll be able to find some online exercises, links to my own software, updated information about programs or features discussed in the book, etc. The web address for this page is http://martinweisser.org/pract_cl/online_materials.html, and you'll probably want to bookmark this straight away, so that you'll be able to access it for future reference.

All my own software is provided under GPL 3 licence, so you can download and distribute it freely. The programs were originally designed to run on Windows, but can easily be used through Wine (https://www.winehq.org/) on Mac OS X or Linux. Additional information on how to do this can be found at http://martinweisser.org/ling_soft.html.

CHAPTER 2What's Out There?:A General Introduction to Corpora

2.1 What's a Corpus?

A corpus (pl. corpora) is a collection of spoken or written texts to be used for linguistic analysis and based on a specific set of design criteria influenced by its purpose and scope. There are various, and sometimes conflicting, definitions in the relevant literature (c.f. e.g. Kennedy, 1998: 3 or Meyer, 2002: xi) as to what exactly constitutes a corpus, but for our purposes, we'll adopt the relatively simple and straightforward one given above. This basically means that any collection of texts that has been systematically assembled in order to investigate one or more linguistic phenomena can be termed a corpus, even if it may only contain a handful of classroom transcripts, interviews, or plays.

Although, theoretically, corpora can simply consist of texts that are in non-electronic form, and indeed some of the earliest corpora were just collections of index cards or slips of paper (McCarthy & O'Keeffe 2010: 4), these days, almost all corpora in use are computerised. When we talk about corpora from now on, we'll thus always be referring to computerised ones, unless otherwise stated.

2.2 Corpus Formats

Most corpora – unless they're only accessible through an online interface – are stored in plain-textformat (to be explained in more detail in Section 3.2.3) and can therefore easily be viewed using any basic text editor, but if a corpus example contains phonetic transcriptions, then of course a specialised typeface (font ) may be required in order to view it. Complications may also arise if the character set is not directly supported by the computer the corpus is viewed on. This may for example happen when the corpus is in a language that uses a different alphabet from the standard Western European ones that are supported on all computers by default. Even displaying European languages, such as Greek, may represent a (minor) problem, but the real difficulties start when one wants to work with East Asian character sets (such as for Chinese, Japanese, Korean & Vietnamese), Indic languages (such as Hindi or Urdu), or right-to-left languages like Arabic and Hebrew. For a simple simulation of this, see the online resources at: http://martinweisser.org/pract_cl/encoding.html. In order to overcome these encoding problems, these days more and more corpora are encoded in one of the Unicodeencodings, most commonly UTF-8. To illustrate this, let's take a look at an extract from the Helsinki Corpus (see Section 2.3.3) of the first lines of the Old English Beowulf epic, presented in the original (slightly modified) representation in the corpus and a modern recoding in UTF-8.

As we can see from the above examples, the original version from the Helsinki Corpus is much more difficult to read because we always need to mentally replace each transliterated character by the appropriate Old English one. Furthermore, emendations, i.e. corrections to the original text, are indicated in the corpus by surrounding the text with a set of opening and closing square and curly brackets ([{…}]), which again distracts while reading, so that I've rendered the modified text in bold and italic formatting to make it easier to read.

Another way to store a corpus is to save it into a database. This makes it more difficult to view and process without having the appropriate database management system (DBMS) installed or if a web-based online interface isn't working as expected, due to browser issues or download speed restrictions. On the other hand, though, this makes it possible for the basic text to be linked to various types of data-enriching annotations, as well as to perform more complex search operations, or to store intermediate or final results of such searches for different users and for quicker access or export later. We'll experience the advantages of this when we set up/work with accounts for access to some web-based corpus interfaces, such as BNCweb or COCA.

2.3 Synchronic vs. Diachronic Corpora

Corpora can be designed and used for synchronic (i.e. ‘contemporary’) and diachronic (i.e. ‘historical’/comparative) studies. Different issues may apply to the design of these two types of corpora. For instance, historical corpora may contain old-fashioned or unfamiliar words and spellings or a large number of spelling variants (e.g. yeare, hee, generalitie, it selfe, etc.), as well as possible even characters (letters) that no longer exist in a modern alphabet, such as the Old English thorn (þ), which we've already encountered in the above Beowulf extract.

Historical corpora are also by nature restricted to written materials because there just are no recordings of native speakers of Old or Middle English in existence. Furthermore, the restriction does not only apply to the types of material available but also to the amount of data we can draw on because, in former times, there simply wasn't such a wealth of documents available, and from as many different sources as we have them today.

2.3.1 ‘Early’ synchronic corpora

Another major distinction between different types of corpora is whether they comprise spoken or written data. This is an extremely important distinction because written language generally tends to be far easier to process than spoken language, as it does not contain fillers, hesitations, false starts or any ungrammatical constructs. When creating a spoken corpus, one also needs to think about whether an orthographic representation of the text will be sufficient, whether the corpus should be represented in phonetic transcription, or whether it should support annotation on various different levels (see Chapter 11).

Initially, computerised language corpora tended to contain only written language, which was easier to obtain, and presumably also deemed to be more important than spoken language, a notion that unfortunately still seems to be all-too-prevalent in our often ‘literature-focussed’ society and education.

2.3.1.1 Written corpora

Let's start our investigation into the nature of corpora with a look at some of the earliest written ones, accompanied by a short exercise to sensitise you towards certain issues. At the time these first corpora were created, one million words still seemed like a huge amount of data, partly because compu- ters in those days had a hard time handling even this amount, and partly because no-one had ever had such easy access to so much language data in electronic form before.

Table 2.2 provides a brief overview of some of these early corpora. The web addresses in the first column of this table generally link to the online versions of the respective manuals.

Table 2.1 Extract from Beowulf, encoded/represented in two different ways

Original (slightly modified)

Re-coded Version

[} [\BEOWULF\] }]

BEOWULF

Hw+at. We Gardena in geardagum, +teodcyninga, +trym gefrunon, hu +da +a+telingas ellen fremedon.

Hwæt. We Gardena in geardagum, þeodcyninga, þrym gefrunon, hu ða æþelingas ellen fremedon.

Oft Scyld Scefing [{scea+tena{] +treatum, monegum m+ag+tum, meodosetla ofteah, egsode eorlas.

Oft Scyld Scefing

sceaþena

þreatum, monegum mægþum, meodosetla ofteah, egsode eorlas.

Sy+d+dan +arest [{wear+d{] feasceaft funden, he +t+as frofre gebad, weox under wolcnum, weor+dmyndum +tah, o+d+t+at him +aghwylc +tara ymbsittendra ofer hronrade hyran scolde, gomban gyldan.

Syððan ærest

wearð

feasceaft funden, he þæs frofre gebad, weox under wolcnum, weorðmyndum þah, oðþæt him æghwylc þara ymbsittendra ofer hronrade hyran scolde, gomban gyldan.

+t+at w+as god cyning.

þæt wæs god cyning.

+d+am eafera w+as +after cenned, geong in geardum, +tone god sende folce to frofre; fyren+dearfe ongeat +te hie +ar drugon [{aldorlease{] lange hwile.

ðæm eafera wæs æfter cenned, geong in geardum, þone god sende folce to frofre; fyrenðearfe ongeat þe hie ær drugon

aldorlease

lange hwile.

Him +t+as liffrea, wuldres wealdend, woroldare forgeaf; Beowulf w+as breme bl+ad wide [{sprang{] , Scyldes eafera Scedelandum in.

Him þæs liffrea, wuldres wealdend, woroldare forgeaf; Beowulf wæs breme blæd wide

sprang

, Scyldes eafera Scedelandum in.

Swa sceal [{geong{] [{guma{] gode gewyrcean, fromum feohgiftum on f+ader [{bearme{] , +t+at hine on ylde eft gewunigen wilgesi+tas, +tonne wig cume, leode gel+asten; lofd+adum sceal in m+ag+ta gehw+are man ge+teon.

Swa sceal

geong

guma

gode gewyrcean, fromum feohgiftum on fæder

bearme

, þæt hine on ylde eft gewunigen wilgesiþas, þonne wig cume, leode gelæsten; lofdædum sceal in mægþa gehwære man geþeon.

Table 2.2 Early written corpora

Corpus

Description

Size (words, ca.)

Brown

http://clu.uni.no/icame/manuals/BROWN/ INDEX.HTM

; (available from

https://archive.org/ details/BrownCorpus

)

first-ever computerised corpus, published in 1964: written American English

1 million

LOB (Lancaster-Oslo-Bergen)

http://clu.uni.no/icame/manuals/LOB/ INDEX.HTM

published in 1978; British counterpart to Brown

1 million

Frown

http://clu.uni.no/icame/manuals/FROWN/ INDEX.HTM

published in 1999; 90s counterpart to Brown

1 million

FLOB

http://clu.uni.no/icame/manuals/FLOB/ INDEX.HTM

published in 1998; 90s counterpart to LOB

1 million

Kolhapur

http://clu.uni.no/icame/manuals/ KOLHAPUR/INDEX.HTM

published in 1978; written Indian English

1 million

ACE (Australian Corpus of English)

http://clu.uni.no/icame/manuals/ACE/ INDEX.HTM

compiled from 1986; also known as the ‘Macquarie Corpus’

1 million

Wellington Corpus of Written New Zealand English

http://clu.uni.no/icame/manuals/ WELLMAN/INDEX.HTM

published in 1993

1 million

A complete set of all manuals of corpora distributed by the ICAME (International Computer Archive of Modern and Medieval English) can also be downloaded from http://clu.uni.no/icame/icamemanuals.html.

Exercise 1

Table 2.3 illustrates the composition of the Brown Corpus. Take a look at this and try to see whether you can gain some insights into the nature of the corpus in terms of its categories and number of texts it comprises.

Table 2.3 Composition of the Brown Corpus

Label

Text Category/Genre

No. of Texts

Press: Reportage

Press: Editorial

Press: Reviews

Religion

Skills & Hobbies

Popular Lore

Belles Lettres, Biography, Essays

Miscellaneous: Government Documents, Foundation Reports, Industry Reports, College Catalogue, Industry House Organ

Learned

General Fiction

Mystery & Detective Fiction

Science Fiction

Adventure & Western Fiction

Romance & Love Story

Humour

What kind of language would you expect inside the different categories, and can you identify anything particularly interesting regarding them?

If you're already planning a research project, do you think data from these will fit your particular needs and help you to answer your research questions?

Once you've done this, also open some of the links to other manuals given in Table 2.2 and compare the composition of these corpora to the Brown Corpus.

Compare the categories in the Brown Corpus to those of the Australian Corpus of English directly. Can you pinpoint the slight cultural difference?

You may have noticed that some letters are missing from the categorisation scheme, notably I, O, and Q. This is probably because the uppercase letter I can easily be confused with the number 1 or lowercase l, and uppercase O with the number 0. I have no logical explanation why Q is not used, unless the assumption is that Q is similar to O, and hence the same options for confusion may arise.

2.3.1.2 Spoken corpora

Next, let's take a look at a selection of important spoken corpora to develop a better understanding of where the differences between written and spoken corpora are.

Exercise 2

Compare the categories of the SEC to one of the written corpora in Table 2.2 and try to see why/whether those different categories may be important for/representative of written and spoken language, respectively.

Look through the online page for the LLC and try to understand what types of additional information may need to be represented (encoded) in spoken corpora.

2.3.2 Mixed corpora

Mixed corpora try to establish some kind of balance between spoken and written language in order to be more representative of language in general. Earlier mixed corpora were still comparatively small in size, but this has changed with the advent of improved data collection methods and the ensuing creation of mega corpora that now run into hundreds of millions of words.

2.3.2.1 Early mixed corpora

Exercise 3

Open the ICE website and navigate to the ‘Corpus Design’ page. In light of the information you've already come across for individual written and spoken corpora, try to evaluate how similar/different the composition of the ICE corpora is to/from these ‘traditional’ corpora.

2.3.2.2 Modern mega corpora and national corpora

With the use of corpora becoming more popular, and techniques for data analysis improving, researchers soon realised that corpora of one million words were not nearly large enough for observing all interesting linguistic phenomena, especially not those that involve idiomatic structures or collocations (see Chapter 10). Especially for investigating rarer features of the language, the basic notion thus seems to be ‘the bigger, the better’, and thus researchers, often supported by publishing houses who wanted to create better dictionaries or textbooks, began to compile corpora of 100 million words or more.

Such corpora, due to their rather large size, are of course more difficult to process on our own computers, and may not even be easy or affordable enough to obtain for individual research on a smaller scale. However, the latter issue is often not such a big problem after all because most openly accessible mega corpora these days provide online interfaces that users can sign up for free of charge. These interfaces will be covered more extensively in later chapters.