39,99 €
Quantitative Methods in Linguistics offers a practical introduction to statistics and quantitative analysis with data sets drawn from the field and coverage of phonetics, psycholinguistics, sociolinguistics, historical linguistics, and syntax, as well as probability distribution and quantitative methods.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 374
Veröffentlichungsjahr: 2011
Contents
Acknowledgments
Design of the Book
1 Fundamentals of Quantitative Analysis
1.1 What We Accomplish in Quantitative Analysis
1.2 How to Describe an Observation
1.3 Frequency Distributions: A Fundamental Building Block of Quantitative Analysis
1.4 Types of Distributions
1.5 Is Normal Data, Well, Normal?
1.6 Measures of Central Tendency
1.7 Measures of Dispersion
1.8 Standard Deviation of the Normal Distribution
EXERCISES
2 Patterns and Tests
2.1 Sampling
2.2 Data
2.3 Hypothesis Testing
2.4 Correlation
EXERCISES
3 Phonetics
3.1 Comparing Mean Values
3.2 Predicting the Back of the Tongue from the Front: Multiple Regression
3.3 Tongue Shape Factors: Principal Components Analysis
EXERCISES
4 Psycholinguistics
4.1 Analysis of Variance: One Factor, More than Two Levels
4.2 Two Factors: Interaction
4.3 Repeated Measures
4.4 The “Language as Fixed Effect” Fallacy
EXERCISES
5 Sociolinguistics
5.1 When the Data are Counts: Contingency Tables
5.2 Working with Probabilities: The Binomial Distribution
5.3 An Aside about Maximum Likelihood Estimation
5.4 Logistic Regression
5.5 An Example from the [∫]treets of Columbus
5.6 Logistic Regression as Regression: An Ordinal Effect – Age
5.7 Varbrul/R Comparison
EXERCISES
6 Historical Linguistics
6.1 Cladistics: Where Linguistics and Evolutionary Biology Meet
6.2 Clustering on the Basis of Shared Vocabulary
6.3 Cladistic Analysis: Combining Character-Based Subtrees
6.4 Clustering on the Basis of Spelling Similarity
6.5 Multidimensional Scaling: A Language Similarity Space
EXERCISES
7 Syntax
7.1 Measuring Sentence Acceptability
7.2 A Psychogrammatical Law?
7.3 Linear Mixed Effects in the Syntactic Expression of Agents in English
7.4 Predicting the Dative Alternation: Logistic Modeling of Syntactic Corpora Data
EXERCISES
Appendix 7A
References
Index
For Erin
© 2008 by Keith Johnson
BLACKWELL PUBLISHING
350 Main Street, Malden, MA 02148-5020, USA
9600 Garsington Road, Oxford OX4 2DQ, UK
550 Swanston Street, Carlton, Victoria 3053, Australia
The right of Keith Johnson to be identified as the author of this work has been asserted in accordance with the UK Copyright, Designs, and Patents Act 1988.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs, and Patents Act 1988, without the prior permission of the publisher.
Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks, or registered trademarks of their respective owners. The publisher is not associated with any product or vendor mentioned in this book.
This publication is designed to provide accurate and authoritative information in regard to the subject matter covered. It is sold on the understanding that the publisher is not engaged in rendering professional services. If professional advice or other expert assistance is required, the services of a competent professional should be sought.
First published 2008 by Blackwell Publishing Ltd
1 2008
Library of Congress Cataloging-in-Publication Data
Johnson, Keith, 1958–
Quantitative methods in linguistics/Keith Johnson.
p. cm.
Includes bibliographical references and index.
ISBN 978-1-4051-4424-7 (hardcover: alk. paper) —
ISBN 978-1-4051-4425-4 (pbk.: alk. paper) 1. Linguistics—Statistical methods. I. Title.
P138.5.J64 2008
401.2’1—dc22
2007045515
ISBN-13: 978-1-4051-4424-7 (hardback)
ISBN-13: 978-1-4051-6181-7 (paperback)
A catalogue record for this title is available from the British Library.
Set in Palatino 10/12.5
by Graphicraft Limited Hong Kong
Printed and bound in Singapore
by Utopia Press Pte Ltd
The publisher’s policy is to use permanent paper from mills that operate a sustainable forestry policy, and which has been manufactured from pulp processed using acid-free and elementary chlorine-free practices. Furthermore, the publisher ensures that the text paper and cover board used have met acceptable environmental accreditation standards.
For further information on
Blackwell Publishing, visit our website at
www.blackwellpublishing.com
Acknowledgments
This book began at Ohio State University and Mary Beckman is largely responsible for the fact that I wrote it. She established a course in “Quantitative Methods in Linguistics” which I also got to teach a few times. Her influence on my approach to quantitative methods can be found throughout this book and in my own research studies, and of course I am very grateful to her for all of the many ways that she has encouraged me and taught me over the years.
I am also very grateful to a number of colleagues from a variety of institutions who have given me feedback on this volume, including: Susanne Gahl, Chris Manning, Christine Mooshammer, Geoff Nicholls, Gerald Penn, Bonny Sands, and a UC San Diego student reading group led by Klinton Bicknell. Students at Ohio State also helped sharpen the text and exercises – particularly Kathleen Currie-Hall, Matt Makashay, Grant McGuire, and Steve Winters. I appreciate their feedback on earlier handouts and drafts of chapters. Grant has also taught me some R graphing strategies. I am very grateful to UC Berkeley students Molly Babel, Russell Lee-Goldman, and Reiko Kataoka for their feedback on several of the exercises and chapters. Shira Katseff deserves special mention for reading the entire manuscript during fall 2006, offering copy-editing and substantive feedback. This was extremely valuable detailed attention – thanks! I am especially grateful to OSU students Amanda Boomershine, Hope Dawson, Robin Dodsworth, and David Durian who not only offered comments on chapters but also donated data sets from their own very interesting research projects. Additionally, I am very grateful to Joan Bresnan, Beth Hume, Barbara Luka, and Mark Pitt for sharing data sets for this book. The generosity and openness of all of these “data donors” is a high standard of research integrity. Of course, they are not responsible for any mistakes that I may have made with their data. I wish that I could have followed the recommendation of Johanna Nichols and Balthasar Bickel to add a chapter on typology. They were great, donating a data set and a number of observations and suggestions, but in the end I ran out of time. I hope that there will be a second edition of this book so I can include typology – and perhaps by then some other areas of linguistic research as well.
Finally, I would like to thank Nancy Dick-Atkinson for sharing her cabin in Maine with us in the summer of 2006, and Michael for the whiffle-ball breaks. What a nice place to work!
Design of the Book
One thing that I learned in writing this book is that I had been wrongly assuming that we phoneticians were the main users of quantitative methods in linguistics. I discovered that some of the most sophisticated and interesting quantitative techniques for doing linguistics are being developed by sociolinguists, historical linguists, and syntacticians. So, I have tried with this book to present a relatively representative and usable introduction to current quantitative research across many different subdisciplines within linguistics.1
The first chapter “Fundamentals of quantitative analysis” is an overview of, well, fundamental concepts that come up in the remainder of the book. Much of this will be review for students who have taken a general statistics course. The discussion of probability distributions in this chapter is key. Least-square statistics – the mean and standard deviation, are also introduced.
The remainder of the chapters introduce a variety of statistical methods in two thematic organizations. First, the chapters (after the second general chapter on “Patterns and tests”) are organized by linguistic subdiscipline – phonetics, psycholinguistics, sociolinguistics, historical linguistics, and syntax.
This organization provides some familiar landmarks for students and a convenient backdrop for the other organization of the book which centers around an escalating degree of modeling complexity culminating in the analysis of syntactic data. To be sure, the chapters do explore some of the specialized methods that are used in particular disciplines – such as principal components analysis in phonetics and cladistics in historical linguistics – but I have also attempted to develop a coherent progression of model complexity in the book.
Thus, students who are especially interested in phonetics are well advised to study the syntax chapter because the methods introduced there are more sophisticated and potentially more useful in phonetic research than the methods discussed in the phonetics chapter! Similarly, the syntactician will find the phonetics chapter to be a useful precursor to the methods introduced finally in the syntax chapter.
The usual statistics textbook introduction suggests what parts of the book can be skipped without a significant loss of comprehension. However, rather than suggest that you ignore parts of what I have written here (naturally, I think that it was all worth writing, and I hope it will be worth your reading) I refer you to Table 0.1 that shows the continuity that I see among the chapters.
The book examines several different methods for testing research hypotheses. These focus on building statistical models and evaluating them against one or more sets of data. The models discussed in the book include the simple t-test which is introduced in Chapter 2 and elaborated in Chapter 3, analysis of variance (Chapter 4), logistic regression (Chapter 5), linear mixed effects models and logistic linear mixed effects models discussed in Chapter 7. The progression here is from simple to complex. Several methods for discovering patterns in data are also discussed in the book (in Chapters 2, 3, and 6) in progression from simpler to more complex. One theme of the book is that despite our different research questions and methodologies, the statistical methods that are employed in modeling linguistic data are quite coherent across subdisciplines and indeed are the same methods that are used in scientific inquiry more generally. I think that one measure of the success of this book will be if the student can move from this introduction – oriented explicitly around linguistic data – to more general statistics reference books. If you are able to make this transition I think I will have succeeded in helping you connect your work to the larger context of general scientific inquiry.
Table 0.1 The design of the book as a function of statistical approach (hypothesis testing vs. pattern discovery), type of data, and type of predictor variables.
A Note about Software
One thing that you should be concerned with in using a book that devotes space to learning how to use a particular software package is that some software programs change at a relatively rapid pace.
In this book, I chose to focus on a software package (called “R”) that is developed under the GNU license agreement. This means that the software is maintained and developed by a user community and is distributed not for profit (students can get it on their home computers at no charge). It is serious software. Originally developed at AT&T Bell Labs, it is used extensively in medical research, engineering, and science. This is significant because GNU software (like Unix, Java, C, Perl, etc.) is more stable than commercially available software – revisions of the software come out because the user community needs changes, not because the company needs cash. There are also a number of electronic discussion lists and manuals covering various specific techniques using R. You’ll find these resources at the R project web page (http://www.r-project.org).
At various points in the text you will find short tangential sections called “R notes.” I use the R notes to give you, in detail, the command language that was used to produce the graphs or calculate the statistics that are being discussed in the main text. These commands have been student tested using the data and scripts that are available at the book web page, and it should be possible to copy the commands verbatim into an open session of R and reproduce for yourself the results that you find in the text. The aim of course is to reduce the R learning curve a bit so you can apply the concepts of the book as quickly as possible to your own data analysis and visualization problems.
Contents of the Book Web Site
The data sets and scripts that are used as examples in this book are available for free download at the publisher’s web site – www.blackwellpublishing.com. The full listing of the available electronic resources is reproduced here so you will know what you can get from the publisher.
Chapter 2 Patterns and Tests
Script: Figure 2.1.
Script: The central limit function from a uniform distribution (central.limit.unif).
Script: The central limit function from a skewed distribution (central.limit).
Script: The central limit function from a normal distribution.
Script: Figure 2.5.
Script: Figure 2.6 (shade.tails)
Data: Male and female F1 frequency data (F1_data.txt).
Script: Explore the chi-square distribution (chisq).
Chapter 3 Phonetics
Data: Cherokee voice onset times (cherokeeVOT.txt).
Data: The tongue shape data (chaindata.txt).
Script: Commands to calculate and plot the first principal component of tongue shape.
Script: Explore the F distribution (shade.tails.df).
Data: Made-up regression example (regression.txt).
Chapter 4 Psycholinguistics
Data: One observation of phonological priming per listener from Pitt and Shoaf’s (2002).
Data: One observation per listener from two groups (overlap versus no overlap) from Pitt and Shoaf’s study.
Data: Hypothetical data to illustrate repeated measures of analysis.
Data: The full Pitt and Shoaf data set.
Data: Reaction time data on perception of flap, /d/, and eth by Spanish-speaking and English-speaking listeners.
Data: Luka and Barsalou (2005) “by subjects” data.
Data: Luka and Barsalou (2005) “by items” data.
Data: Boomershine’s dialect identification data for exercise 5.
Chapter 5 Sociolinguistics
Data: Robin Dodsworth’s preliminary data on /l/ vocalization in Worthington, Ohio.
Data: Data from David Durian’s rapid anonymous survey on /str/ in Columbus, Ohio.
Data: Hope Dawson’s Sanskrit data.
Chapter 6 Historical Linguistics
Script: A script that draws Figure 6.1.
Data: Dyen, Kruskal, and Black’s (1984) distance matrix for 84 IndoEuropean languages based on the percentage of cognate words between languages.
Data: A subset of the Dyen et al. (1984) data coded as input to the Phylip program “pars.”
Data: IE-lists.txt: A version of the Dyen et al. word lists that is readable in the scripts below.
Script: make_dist: This Perl script tabulates all of the letters used in the Dyen et al. word lists.
Script: get_IE_distance: This Perl script implements the “spelling distance” metric that was used to calculate distances between words in the Dyen et al. list.
Script: make_matrix: Another Perl script. This one takes the output of get_IE_distance and writes it back out as a matrix that R can easily read.
Data: A distance matrix produced from the spellings of words in the Dyen et al. (1984) data set.
Data: Distance matrix for eight Bantu languages from the Tanzanian Language Survey.
Data: A phonetic distance matrix of Bantu languages from Ladefoged, Glick, and Criper (1971).
Data: The TLS Bantu data arranged as input for phylogenetic parsimony analysis using the Phylip program pars.
Chapter 7 Syntax
Data: Results from a magnitude estimation study.
Data: Verb argument data from CoNLL-2005.
Script: Cross-validation of linear mixed effects models.
Data: Bresnan et al.’s (2007) dative alternation data.
1 I hasten to add that, even though there is very much to be gained by studying techniques in natual language processing (NLP), this book is not a language engineering book. For a very authoritative introduction to NLP I would recommend Manning and Schütze’s Foundations of Statistical Natural Language Processing (1999).
