43,99 €
Thoroughly revised and updated, the second edition of The Handbook of Phonetic Sciences provides an authoritative account of the key topics in both theoretical and applied areas of speech communication, written by an international team of leading scholars and practitioners.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 1865
Veröffentlichungsjahr: 2012
Table of Contents
Cover
Praise for The Handbook of Phonetic Sciences
Blackwell Handbooks in Linguistics
Title page
Copyright page
Dedication
Contributors
Preface to the Second Edition
Introduction
Part I: Experimental Phonetics
1 Laboratory Techniques for Investigating Speech Articulation
1 Imaging Techniques
2 Point-Tracking Measurements of the Vocal Tract
3 Measurement of Tongue–Palate Interaction
2 The Aerodynamics of Speech
1 Introduction
2 Basic Considerations
3 Aerodynamically Distinct Tract Behaviors
4 Measurement Methods
5 Models Incorporating Aerodynamics
APPENDIX: CONSTANTS AND CONVERSION FACTORS
3 Acoustic Phonetics
1 Introduction
2 Vowels, Vowel-Like Sounds, and Formants
3 Obstruents
4 Nasal Consonants and Nasalized Vowels
5 Concluding Comment
ACKNOWLEDGMENTS
4 Investigating the Physiology of Laryngeal Structures
1 Introduction: Basic Laryngeal Functions
2 Methods of Investigating Laryngeal Function in Speech
3 Laryngeal Structures and the Control of Phonation
4 Laryngeal Adjustments for Different Phonetic Conditions
5 Current Main Issues and the Direction of Future Research
Part II: Biological Perspectives
5 Organic Variation of the Vocal Apparatus
1 Introduction
2 Life-Cycle Changes in the Vocal Apparatus
3 Interpersonal Variation
4 Variation Resulting from Trauma or Disease
5 Conclusion
6 Brain Mechanisms Underlying Speech Motor Control
1 Introduction
2 Macro- and Microstructural Characteristics of the Brain in Subhuman Primates and Man
3 Acoustic Communication in Monkeys and Apes
4 Cerebral Representation of Orofacial and Laryngeal Musculature in Subhuman Primates
5 Morphological Asymmetries of Primary and Nonprimary Motor Areas in Subhuman Primates and Man
6 Cortical Maps of Vocal Tract Muscle Representation in Humans
7 Electro- and Magnetoencephalographic Measurements of the Time Course of Brain Activity Related to Spreech Production
8 Clinical Data: Compromised Motor Aspects of Speech Production in Focal Brain Lesions and Degenerative Diseases of the Central Nervous System (CNS)
9 Cerebral Networks of Speech Motor Control: Functional Hemodynamic Imaging Studies
10 Conclusions
7 Development of Neural Control of Orofacial Movements for Speech
1 Introduction
2 Components of Articulatory Motor Control
3 Development of Speech Motor Processes
Part III: Modeling Speech Production and Perception
8 Speech Acquisition
1 The Problem
2 The Broader Context
3 Contemporary Theoretical Perspectives on Speech Acquisition
4 Summary
9 Coarticulation and Connected Speech Processes
1 Speech Contextual Variability
2 Theoretical Accounts of Coarticulation
3 Summary
10 Theories and Models of Speech Production
1 The Speech Signal and Its Description
2 Concepts and Issues in Movement Control
3 Serial Control of Speech Movements
4 Summary
ACKNOWLEDGMENT
11 Voice Source Variation and Its Communicative Functions
1 Introduction
2 Analyzing the Voice Source
3 Some Commonly Occurring Voice Qualities
4 Determinants of Voice Source Variation
5 Future Research
12 Articulatory–Acoustic Relations as the Basis of Distinctive Contrasts
1 The Question: How Is the Discrete Linguistic Representation of an Utterance Related to the Continuously-Varying Speech Signal?
2 One Answer: Quantal Theory and Distinctive Features
3 Enhancement and Overlap: Introducing Variation to the Defining Acoustic Cues
4 Concluding Remarks
13 Aspects of Auditory Processing Related to Speech Perception
1 Introduction
2 Frequency Selectivity
3 Across-Channel Processes in Masking
4 Timbre Perception
5 The Perception of Pitch
6 Temporal Analysis
7 Calculation of the Internal Representation of Sounds
8 Concluding Remarks
14 Cognitive Processes in Speech Perception
1 Introduction
2 Lexical Information
3 Segmental Information
4 Suprasegmental Information
5 Conclusions
Part IV: Linguistic Phonetics
15 The Prosody of Speech: Timing and Rhythm
1 Introduction
2 Lengthenings and Shortenings: The Temporal Signatures of Prosody
3 Speech Timing: A Rhythmic Dimension
4 Tempo and Pausing
5 Concluding Comments
16 Tone and Intonation
1 Introduction
2 The Representation of Tone and Intonation
3 A Taxonomy of Formal Parameters
4 A Taxonomy of Linguistic Functions
5 Is a Typology Needed?
17 The Relation between Phonetics and Phonology
1 Introduction
2 Some History
3 Philosophy
4 The Integration of Phonetics and Phonology
18 Phonetic Notation
1 Introduction
2 Challenges to Notational Categories
3 Elaborating the IPA System of Notation
19 Sociophonetics
1 Introduction
2 Defining Sociophonetic Variation
3 Sociophonetic Studies of Speech Production
4 Sociophonetic Studies and Speech Perception
5 Methodological Issues
6 Theoretical Implications of Sociophonetic Studies
7 Wider Applications of Sociophonetics
8 Conclusion
Part V: Speech Technology
20 An Introduction to Signal Processing for Speech
1 Overview
2 Resonance
3 Sinusoids
4 Linearity
5 Fourier Analysis
6 Filters
7 The Spectrogram
8 Linear Prediction
9 Speech Features
10 Conclusions
21 Speech Synthesis
1 Introduction
2 Speech Synthesis in Text-to-Speech Systems
3 Components of a Generic Text-to-Speech System
4 Notations for Rule-Based Parametric Speech Synthesis
5 Prosodic Descriptions and Implementations
6 Sound Generation Techniques in Parametric Synthesis
7 Synthesis by Concatenation
8 Trends in Speech Synthesis
9 Concluding Remarks
22 Automatic Speech Recognition
1 Introduction
2 A Data-Driven Approach
3 Acoustic Features
4 Acoustic Modeling
5 Pronunciation Modeling
6 Language Modeling
7 Search
8 Evaluation
9 Discussion
10 Further Reading
Index
Praise for The Handbook of Phonetic Sciences
“With this second edition, the Handbook of Phonetics Sciences will continue to be an outstanding resource for students, providing wide-ranging critical overviews of the development of key scientific topics and of the debates which are at the heart of contemporary phonetic research.”
Gerard Docherty, Newcastle University
“This Handbook is an outstanding collection of state-of-the-art surveys and original contributions. Revised and refreshed, it is essential reading for anyone engaged in understanding phonetic aspects of speech.”
John Local, University of York
“This new edition updates its coverage of a wide range of topics, reflecting the most recent trends in research. I will use it as a reference for both my teaching and my research.”
Patricia Keating, University of California, Los Angeles
Blackwell Handbooks in Linguistics
This outstanding multi-volume series covers all the major subdisciplines within linguistics today and, when complete, will offer a comprehensive survey of linguistics as a whole.
Already published:
This paperback edition first published 2013
© 2013 Blackwell Publishing Ltd except for editorial material and organization © 2013 William J. Hardcastle, John Laver, and Fiona E. Gibbon
Edition History: Blackwell Publishing Ltd (1e, 1997; 2e hardback, 2010)
Blackwell Publishing was acquired by John Wiley & Sons in February 2007. Blackwell’s publishing program has been merged with Wiley’s global Scientific, Technical, and Medical business to form Wiley-Blackwell.
Registered Office
John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK
Editorial Offices
350 Main Street, Malden, MA 02148-5020, USA
9600 Garsington Road, Oxford, OX4 2DQ, UK
The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK
For details of our global editorial offices, for customer services, and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com/wiley-blackwell.
The right of William J. Hardcastle, John Laver, and Fiona E. Gibbon to be identified as the authors of the editorial material in this work has been asserted in accordance with the UK Copyright, Designs and Patents Act 1988.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.
Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The publisher is not associated with any product or vendor mentioned in this book. This publication is designed to provide accurate and authoritative information in regard to the subject matter covered. It is sold on the understanding that the publisher is not engaged in rendering professional services. If professional advice or other expert assistance is required, the services of a competent professional should be sought.
Library of Congress Cataloging-in-Publication Data
The handbook of phonetic sciences/edited by William J. Hardcastle, John Laver, Fiona E. Gibbon. – 2nd ed.
p. cm. – (Blackwell handbooks in linguistics)
Includes bibliographical references and index.
ISBN 978-1-4051-4590-9 (hardcover : alk. paper) ISBN 978-1-118-35820-7 (paperback: alk. paper)
1. Phonetics–Handbooks, manuals, etc. I. Hardcastle, William J., 1943– II. Laver, John. III. Gibbon, Fiona E.
P221.H28 2009
414′.8–dc22
2009033872
A catalogue record for this book is available from the British Library.
Cover image: Ceremonial, 1999 by Ignacio Auzike/Getty
Cover design by Workhaus
For Peter Ladefoged and Gunnar Fant, who led the field
Contributors
Hermann AckermannUniversity of Tübingen
Janet Mackenzie BeckQueen Margaret University, Edinburgh
Mary E. BeckmanOhio State University
Rolf CarlsonKTH Royal Institute of Technology, Stockholm
Anne CutlerMax Planck Institute for Psycholinguistics, NijmegenMARCS Auditory Laboratories, University of Western Sydney
Barbara L. DavisUniversity of Texas
Daniel P. W. EllisColumbia University
John H. EslingUniversity of Victoria
Edda FarnetaniCentro di Studio per le Richerche di Fonetica del CNR, Padova
Janet FletcherUniversity of Melbourne
Paul FoulkesUniversity of York
Christer GoblTrinity College Dublin
Björn GranströmKTH Royal Institute of Technology, Stockholm
Helen M. HansonUnion College, New York
Jonathan HarringtonUniversity of Munich
Hajime HiroseKitasato University
Simon KingUniversity of Edinburgh
Anders LöfqvistHaskins Laboratories, New Haven
James M. McQueenMax Planck Institute for Psycholinguistics, NijmegenRadboud University Nijmegen
Brian C. J. MooreUniversity of Cambridge
Ailbhe Ní ChasaideTrinity College Dublin
John J. OhalaUniversity of California at Berkeley
Daniel RecasensUniversitat Autònoma de Barcelona
Steve RenalsUniversity of Edinburgh
James M. ScobbieQueen Margaret University, Edinburgh
Christine H. ShadleHaskins Laboratories, New Haven
Anne SmithPurdue University
Kenneth N. StevensMassachusetts Institute of Technology
Maureen StoneUniversity of Maryland
Jennifer J. VendittiSan Jose State University
Dominic WattUniversity of York
Wolfram ZieglerCity Hospital, Bogenhausen, Munich
Preface to the Second Edition
It is now over 10 years since the publication of the first edition of The Handbook of Phonetic Sciences. Since then the phonetic sciences have developed substantially and there are now many more disciplines taking a professional interest in speech-related areas. This multidisciplinary orientation continues to be reflected in the second edition.
In this second edition, 32 leading researchers have contributed 22 chapters in 5 major sectors of the contemporary subject. As with the first edition, an elementary knowledge of the field is assumed and each chapter presents an overview of a key area of the expertise which makes up the wide range of the phonetic sciences today.
There are a number of chapters retained from the first edition which have been substantially updated by the authors. These include the chapters by Stone, Shadle, Hirose, Mackenzie Beck, Farnetani and Recasens, Löfqvist, Gobl and Ní Chasaide, Stevens and Hanson, Moore, McQueen and Cutler, Ohala, Carlson and Granström. Other topic areas from the first edition have been given completely new treatment by newly commissioned authors (chapters by Harrington, Ackermann and Ziegler, Smith, Davis, Ellis, Renals and King). There are also two new chapters covering sociophonetics (Scobbie, Foulkes, and Watt) and phonetic notation (Esling). To reflect the increasing significance of the area of prosody in the phonetic sciences we have also included two commissioned chapters covering the areas of timing and rhythm (Fletcher), and tone and intonation (Beckman and Venditti).
For readers with complementary interests in phonology and clinical phonetics and linguistics the companion volumes to this handbook, The Handbook of Phonological Theory (Goldsmith, 2010, 2nd edn.) and The Handbook of Clinical Linguistics (Ball, Perkins, Müller, & Howard, 2008) are recommended.
We would like to thank a number of colleagues for their assistance with editorial work, including Annabel Allen, Pauline Campbell, Erica Clements, Sue Peppe, and Sonja Schaeffler. Special thanks are also due to Anna Oxbury for her meticulous and thoughtful copy-editing.
The editors
Introduction
WILLIAM J. HARCASTLE, JOHN LAVER, AND FIONA E. GIBBON
As with the first edition, the book is divided into five major sections. The first part begins with an account of the main measurement techniques, methodologies, and instruments found in experimental phonetic laboratories. The next part explores aspects of the anatomical and physiological framework for normal and disordered speech production. The third and largest part of the book focuses on the acquisition of speech and theories and models of speech production and perception. The fourth part deals with the linguistic motivation of much research in the phonetic sciences in covering a number of key areas of linguistic phonetics. The final part returns to experimental approaches to the phonetic sciences but this time focusing on speech signal processing and engineering in an overview of the main developments in speech technology. There are extensive pointers to further reading in each chapter.
Part I has four chapters on the topic of Experimental Phonetics. The section begins with a critical evaluation by Maureen Stone on current laboratory techniques that measure the oral vocal tract during speech. The focus is on instruments that measure the articulators directly and indirectly. Indirect measurements come from instruments that are remote from the structures of interest such as imaging techniques (e.g., X-ray, MRI, and ultrasound). Direct measurements come from instruments that contact the structures of interest, such as, point-tracking devices and electropalatography. References are made to current research using each instrument in order to indicate its applications and strengths.
Experimental approaches to speech production are explored further by Christine Shadle in the next chapter on the aerodynamics of speech. This chapter begins by defining aerodynamics and reviews the basic concepts of fluid statics and dynamics (including turbulence), and aerodynamically distinct vocal tract behaviors are discussed. This is followed by a section covering measurement methods, divided into basic methods such as pressure and flow velocity measurement, and speech-adapted methods such as the Rothenberg mask and methods for measuring or estimating lung volume and subglottal pressure, and the use of hot-wires to measure flow velocities in the vocal tract. A final section describes models of speech production that incorporate aerodynamics.
Acoustic phonetics is the subject of the third chapter by Jonathan Harrington. This new chapter provides an overview of the acoustic characteristics of consonants and vowels from the perspective of a broad range of research questions in experimental phonetics and laboratory phonology. Various procedures for the phonetic classification of the acoustic speech signal are reviewed including the identification of vowel height and backness from various transformed acoustic spaces, the derivation of place of articulation in oral stops from burst and locus cues, and techniques for distinguishing between fricatives based on parameterizing spectral shape. These techniques are informed by a knowledge of speech production and are related to speech perception, and they also establish links to pattern classification in signal processing.
Investigating the physiology of laryngeal structures is the subject of the final chapter in this section. In this chapter, Hajime Hirose describes specialized, newly developed techniques for observing laryngeal behavior during speech production, including flexible fiberscopy, high-speed digital imaging, laryngeal electromyography, photoglottography, electroglottography, and magnetic resonance imaging. Basic behaviors of the laryngeal structures are described with reference to the results of observation obtained by the above techniques and the nature of laryngeal adjustments that take place under different phonetic conditions.
Part II contains three chapters on biological perspectives and opens with an exploration by Janet Mackenzie Beck on organic variation and the ways it affects the vocal apparatus. She points to two main sources of variation in speech performance: phonetic variation resulting from differences in the way individuals use their vocal apparatus, and organic variation depending on individual differences in inherent characteristics of the vocal organs. The chapter focuses on organic variation bringing together information from a variety of sources, anatomical, physiological, anthropological. Three main types of differences in the structure of the vocal apparatus are discussed: the life-cycle changes within an individual; genetic or environmental factors which differentiate between individuals; and differences which result from trauma or disease.
Hermann Ackermann and Wolfram Ziegler in their chapter on brain mechanisms underlying speech motor control begin with an overview of the topic. Their discussions draw upon data derived from three approaches, namely, electrical surface stimulation of the cortex, lesion studies in patients with neurogenic communication disorders, and functional imaging techniques. These discussions are preceded by a review of experimental studies in subhuman primates addressing the corticobulbar representation of orofacial muscles as well as the cerebral correlates of vocal behavior.
The final chapter in Part II is by Anne Smith and concerns the development of neural control for speech. She gives an integrative overview of studies of the development of the neuromotor processes involved in controlling articulatory movements for speech. The area of speech motor development has not been critically reviewed recently and this chapter provides a detailed summary of major advances in understanding the time course of maturation of speech motor control processes, which, contrary to earlier claims, are not adult-like until late adolescence. Discussions of theoretical issues in speech motor development, such as the units involved in the language–motor interface and the issues of neural plasticity and sensitive periods in speech motor development, portray important, ongoing debates in this area.
Part III contains seven chapters on the topic of modeling speech production and perception. The first is a chapter on speech acquisition by Barbara Davis. She addresses the question of how young children integrate biology and cognition to achieve the necessary capacities for the phonological component of linguistic communication. The chapter outlines how contemporary theoretical perspectives and research paradigms consider the nature of speech acquisition. These include formalist phonological perspectives representing a consistent strand of proposals on acquisition of sound patterns in languages. She contrasts this approach with functionalist phonetic science perspectives that have focused on biological characteristics of the developing child and the ways in which these capacities contribute to emergence of complex speech output patterns.
The chapter by Edda Farnetani and Daniel Recasens presents an overview of the current knowledge concerning coarticulation and connected speech processes. The authors address the nature of coarticulatory and assimilatory processes in connected speech, and explore the foundations and predictions of the most relevant theoretical models of labial, velar, and lingual coarticulation (feature spreading, time-locked, locus equation, adaptive variability, window model, and coarticulatory resistance). They describe the significant theoretical and experimental progress in understanding contextual variability, which is reflected in continuously evolving and improving models, and in increasingly rigorous and sophisticated research methodologies.
Theories and models of speech production are developed further by Anders Löfqvist, particularly from the point of view of spatial and temporal control of speech movements. In his chapter, theoretical and empirical approaches to speech production converge in their focus on understanding how the different parts of the vocal tract are flexibly marshaled and coordinated to produce the acoustic signal that the speaker uses to convey a message. He outlines a variety of experimental paradigms and how these are applied to the problem of coordination and control in motor systems with excess degrees of freedom.
An area of key theoretical and technical importance is the nature of the voice source and how it varies in speech. The chapter by Christer Gobl and Ailbhe Ní Chasaide is concerned with acoustic aspects of phonation and its exploitation in speech communication. The early sections focus on the source signal itself, on analysis techniques, and provide acoustic descriptions of different voice qualities. The later sections describe how variations in the voice source are associated with segmental or suprasegmental aspects of the linguistic code, and discuss the role of voice quality in the paralinguistic signaling of emotion, mood, and attitude. The sociolinguistic function in differentiating among linguistic, regional, and social groups is briefly outlined, as well as its important role in speaker identification.
The next chapter by Kenneth Stevens and Helen Hanson focuses on articulatory–acoustic relations as the basis of distinctive contrasts. The chapter provides a physical basis for the inventory of binary distinctive features or phonological contrasts that are observed in language. The chapter is a major update on the quantal nature of speech, and the authors show how aerodynamic and acoustic properties of speech production lead to quantal relations between the articulatory parameters and the acoustic consequences of these variations. The chapter also proposes how listeners might extract additional enhancing cues as well as cues relating to the defining quantally-based properties of the acoustic signal in running speech. Other approaches that have been proposed to account for variability in speech are also described.
The final two chapters in Part III deal with aspects of auditory processing and speech perception. The first chapter by Brian Moore reviews selected aspects of auditory processing, chosen because they play a role in the perception of speech. The review is concerned with basic processes, many of which are strongly influenced by the operation of the peripheral auditory system and which can be characterized using simple stimuli such as pure tones and bands of noise. He discusses the resolution of the auditory system in frequency and time, as revealed by psychoacoustic experiments. A consistent finding is that the resolution of the auditory system usually markedly exceeds the resolution necessary for the identification or discrimination of speech sounds. This partly accounts for the fact that speech perception is robust, and resistant to distortion of the speech and to background noise.
James McQueen and Anne Cutler in their chapter focus on the cognitive processes involved in speech perception. They describe how recognition of spoken language involves the extraction of acoustic-phonetic information from the speech signal, and the mapping of this information onto cognitive representations. They focus on our ability to understand speech from talkers we have never heard before, and to perceive the same phoneme despite acoustically different realizations (e.g., by a child’s voice versus an adult male’s). They show how processing of segmental, lexical and suprasegmental information in word recognition contributes significantly to listeners’ processing decisions.
The five chapters in Part IV cover different aspects of linguistic phonetics, and begins with two new chapters on speech prosody. Janet Fletcher explores rhythm and timing in speech with a particular focus on how durational patterns of segments and syllables contribute to the signaling of stress and/or accent and prosodic phrasing in different languages. The chapter summarizes the contribution of durational patterns of segments, morae, and syllables to the rhythm and tempo of spoken language, and evaluates the different kinds of metrics that are often used in experimental investigations. What emerges is a complex picture of how speech unfolds in time, and crucially how the temporal signatures of prosody in a language are often accompanied by additional qualitative acoustic and articulatory modifications, rather than just adjustment of measurable duration alone.
In the second chapter on speech prosody, Mary Beckman and Jennifer Venditti examine tone and intonation. The authors begin by reviewing the ways in which pitch patterns are represented in work on tone and intonation. A key point in this review is that symbolic representations are phonetically meaningful only if they are tags for parameter settings in an analysis-by-synthesis model of f0 contours. The most salient functions of lexical contrast, prosodic grouping, and prominence marking are described in a way that makes clear that many aspects of the pitch pattern can simultaneously serve one, two, or all three of these functions. The authors conclude by suggesting that broad-scale typologies that differentiate only between two or three language “types” (e.g., “tone languages”) are overly simplistic.
The next chapter by John Ohala explores the relation between phonetics and phonology. In tracing the history of this relationship from the early part of the last century, he shows it has been affected by theoretical frameworks such as structuralist phonology, in which more attention was given to relations between sounds at the expense of substance of sounds. It is proposed that in order to explain sound patterns in language, phonology needs to re-integrate scientific phonetics (as well as psychology and sociolinguistics). The author provides examples where principles of aerodynamics and acoustics are used to explain certain common sound patterns.
John Esling’s chapter on phonetic notation reviews the theoretical constructs of how speech sounds are transcribed using phonetic notation. He presents the International Phonetic Alphabet (IPA) as a common core of standard usage that transcribers of language can universally refer to and understand. Orthographic, iconic, and alphabetic notation are differentiated, and the phonetic relationships between sets of symbols are addressed. A revised version of the IPA consonant chart is developed, as well as a novel way of looking at the IPA vowel chart. Place of articulation, manner of articulation, vowel classification, and secondary articulation are discussed where they present challenges to notational conventions. He also discusses notation for stress and juncture, strength of articulation, voice quality, and clinical usage for transcribing disordered speech.
The last chapter in Part IV is on sociophonetics. In this chapter, Paul Foulkes, James Scobbie, and Dominic Watt provide an overview of sociophonetics as an area of the phonetic sciences which takes into account the systematic subtle differences in phonetic systems which attach to social groups. This structured variation informs theoretical debate in fields such as sociolinguistics, phonetics, phonology, psycholinguistics, typology, and diachronic linguistics. In their chapter, Foulkes, Scobbie, and Watt survey work which touches on all these areas, although sociolinguistics features most strongly. The chapter addresses both production and perception studies, before moving on to consider contemporary methodological issues and the general theoretical implications that arise from the literature.
Part V contains three chapters that are concerned with issues relating to speech technology. Most speech technology applications rely on digital signal processing and Daniel Ellis presents an introduction to the topic of signal processing for speech. His chapter emphasizes an intuitive understanding of signal processing in place of a formal mathematical presentation. He begins with familiar daily experiences of resonance and oscillation, for instance as seen in a pendulum, and builds up to the ideas of decomposing signals into sinusoids (Fourier analysis), filtering, and the familiar speech-related tools of the spectrogram and cepstral coefficients. All of this is done without a single equation, but in a way that may help cement insights even for readers already familiar with more technical presentations.
The next chapter, by Rolf Carlson and Björn Granström, is a survey of speech synthesis systems. They review some of the more popular approaches to speech synthesis and show how it is no longer simply a research tool but has many everyday applications. They describe current trends in speech synthesis research and point to some present and future applications of text-to-speech technology.
Part V concludes with a chapter on automatic speech recognition by Steve Renals and Simon King. They define automatic speech recognition as the task of transforming an acoustic speech signal to the corresponding sequence of words. Their chapter provides on overview of the statistical, data-driven approaches which now comprise the state-of-the-art. The chapter outlines the decomposition of the problem into acoustic modeling and language modeling and provides a flavor of some of the technical details that underpin this research field, as well as outlining some of the major open challenges.
We would like to conclude by offering our warmest thanks to all the contributors. We believe that the 22 chapters in the second edition of this handbook give an exciting as well as a representative flavor of the productive multidisciplinary research that typifies the phonetic sciences today.
Part I
Experimental Phonetics
1
Laboratory Techniques for Investigating Speech Articulation
MAUREEN STONE
This chapter discusses current laboratory techniques that measure the oral vocal tract during speech. The focus is on instruments that measure the articulators directly and indirectly. Indirect measurements come from instruments that are remote from the structures of interest such as imaging techniques. Direct measurements come from instruments that contact the structures of interest, such as, point-tracking devices and electropalatography. Although some references are made to current research using each instrument, to indicate its applications and strengths, the list of studies is not comprehensive as the goal is to explain the instrument.
Measuring the vocal tract is a challenging task because the articulators differ widely in location, shape, structural composition, and speed and complexity of movement. First, there are large differences in tissue consistency between soft tissue structures (tongue, lips, velum) and hard tissue structures (jaw, palate), which result in substantially different movement complexity. In other words, the fluid deformation of the soft structures and the rigid movements of the bones need different measurement strategies. Second, measurement strategies must differ between structures visible to superficial inspection, such as the lips, and structures deep within the oral cavity, such as the velum. Third, articulator rates of motion vary, so that an instrument with a frequency response appropriate for the slow-moving jaw will be too slow for the fast-moving tongue tip. The final and perhaps most important measurement complication is the interaction among articulators. Some articulatory behaviors are highly correlated, and distinguishing the contributions of each player can be quite difficult. The most dramatic example of this is the tongue–jaw system. It is clear that jaw height is a major factor in tongue tip height. However, the coupling of these two structures becomes progressively weaker as one moves posteriorly, until in the pharynx, tongue movement is only minimally coupled to jaw movement if at all. Thus, trying to measure the contribution of the jaw to tongue movement becomes a difficult task.
It is difficult to devise a transducer that can be inserted into the mouth, which will not in some way distort the speech event. Thus, the types of instruments used in the vocal tract need to be unobtrusive, such as by resting passively against a surface (e.g., electropalatography), by being small and positioned on noncontact surfaces (e.g., pellet tracking systems), or by not entering the vocal tract at all (e.g., imaging techniques).
Instruments that enter the oral cavity must meet certain criteria. They need to be unaffected by temperature change, moisture, or air pressure. Affixatives must be unaffected by moisture, nontoxic, able to stick to expandable, moist surfaces, and must be removable without tearing the surface tissue. Devising instruments that are noninvasive, unobtrusive, meet the above criteria, and still measure one or more components of the speech event is so difficult that most researchers prefer to study the speech wave and infer physiological events from it. However, since those inferences are based on, and refined by, physiological data, it is critical to add new physiological databases, lest models of the vocal tract and our understanding of speech production stagnate.
In recent times, physiological measurements have improved at an extraordinary pace. Imaging techniques are revolutionizing the way we view the vocal tract by providing recognizable images of structures deep within the pharynx. They also provide information on local tissue movement and control strategies. Point-tracking systems and palatographic measurements have transformed our ideas about coarticulation by revealing inter-articulator relationships that could only in the past be addressed theoretically. Applications to linguistics and rehabilitation are now ongoing. This chapter considers indirect measurements, that is, imaging techniques, and direct measurements such as point-tracking techniques, and tongue–palate measurement devices
The internal structures of the vocal tract are difficult to measure without impinging upon normal movement patterns. Imaging techniques overcome that difficulty because they register internal movement without directly contacting the structures. Four well-known imaging techniques have been applied to speech research: X-ray, computed tomography (CT), magnetic resonance imaging (MRI), and ultrasound. Imaging systems provide recordings of the entire structure, rather than single points on the structure.
X-ray is the most well known of the imaging systems. It is important because it was the first widely used imaging system and most of our historical knowledge about the pharyngeal portion of the vocal tract came from X-ray data. To make a lateral X-ray image, an X-ray beam is projected from one side of the head through all the tissue, and recorded onto a plate on the other side. The resulting image shows the head from front to back and provides a lengthwise view of the tongue. A frontal or anterior–posterior (AP) X-ray is made by projecting the X-ray beam from the front of the head through to the back of the head and recording the image on a plate behind the head. The resulting images provide a cross-sectional view of the oral cavity. Prior to the advent of MRI considerable research was done using X-ray imaging. More recent X-ray studies are based on archival databases.
X-ray data have contributed to many aspects of speech production research. Many vocal tract models are based on X-rays (cf. Fant, 1965; Mermelstein, 1973; Harshman et al., 1977; Wood, 1979; Hashimoto & Sasaki, 1982; Maeda, 1990). X-rays have also been used to study normal speech production (Kent & Netsell, 1971; Kent, 1972; Kent & Moll, 1972), nonspeech motions (Kokawa et al., 2006), motor control strategies (Lindblom et al., 2002; Iskarous, 2005), language differences (cf. Gick, 2002b; Gick et al., 2004), and speech disorders (Subtelny et al., 1989; Tye-Murray, 1991).
Usually soft tissue structures such as the tongue are difficult to measure with X-rays, because the beam records everything in its path including teeth, jaw, and vertebrae. These strongly imaged bony structures obscure the fainter soft tissue. Another limitation of X-ray is that unless a contrast medium is used to mark the midline of the tongue, it is difficult to tell if the visible edge is the midline surface of the tongue or a lateral edge. This is particularly problematic during speech, because the tongue is often grooved or arched. Finally, the potential hazards of overexposure have reduced the collection of large quantities of X-ray data. There is, however, public availability of archival X-ray databases for research use. One such database (Munhall et al., 1994a, 1994b) was compiled by Advanced Technologies Research Laboratories, Kyoto, and is available from http://psyc.queensu.ca/∼munhallk/05_database.htm.
Tomography is a fundamentally different imaging method from projection X-ray in that it records slices of tissue. Three tomographic techniques used in speech research are Computed Tomography, Magnetic Resonance Imaging, and Ultrasound Imaging. These slices are made by projecting a thin, flat beam through the tissue in one of four planes: sagittal, coronal, oblique, and transverse (see Figure 1.1). The mid-sagittal plane is a longitudinal slice, from top to bottom, down the median plane, or midline, of the body (dashed line – upper right). The para-sagittal plane is parallel to the midline of the body and off-center (not shown). The coronal plane is a longitudinal slice perpendicular to the median plane of the body. The oblique plane is inclined between the horizontal and vertical planes. Finally, the transverse plane lies perpendicular to the long axis of the body, and is often called the transaxial, or in MRI, the axial plane.
Figure 1.1 Scan types used in through-transmission and tomographic imaging. There are two X-ray angles contrasted with four tomographic scanning planes.
Computed Tomography uses X-rays to image slices (sections) of the body as thin as 0.5 mm or less. Tomographic images made in coronal planes are made by projecting very thin X-ray beams through a slice of tissue from multiple origins. The scanner rotates around the body taking these images and a computer creates a composite, including structures that are visible in some scans but obscured in others. Using this technique, tissue slices can be collected rapidly, 15 Hz or faster, and multiple slices can be collected simultaneously. CT images soft tissue more clearly than X-rays because it produces a composite X-ray. By digitally summing a series of scans, the composite section has sharper edges and more distinct tissue definition. From the multislice datasets, planar sections can be reconstructed in any direction. CT images can produce excellent resolution of soft and hard tissue structures. Figure 1.2, for example, is a reconstructed image of the midsagittal plane of the vocal tract. Bone appears bright white in the image, soft tissue structures are gray. In this figure, the junction of the velum and hard palate can be seen to be quite complex. The soft tissue below the hard palate widens before the velum emerges as a freestanding object. It is clear from this image that the shape of the palatine bone is not well reflected in the soft tissue. Measures of the palate bone made from an MRI or ultrasound image will differ from measurements made directly in the mouth or from dental impressions. Without this image, those differences would be hard to interpret.
Figure 1.2 Midsagittal CT of vocal tract reconstructed from axial images. Bone is white; soft tissue is gray.
(Reproduced courtesy of Ian Wilson)
Another method of CT data collection is Spiral CT. Spiral CT collects multiple slices at the same time by collecting a single spiral-shaped slice instead of multiple flat planar slices. In the mid 1980s, the cable and drum mechanism for powering the rotation of the CT machine was replaced with a slip ring. The slip ring allows the CT scanner to rotate continuously, creating a spiral image. Spiral CT scans have very high resolution, but currently take 20–30 seconds to create, and hence are too slow for imaging continuous speech, though excellent for static images (Lell et al., 2004).
Electron Beam CT was developed to measure calcium deposits around coronary arteries. Its principles are similar to CT, but it uses an electron “gun” instead of regular X-ray. EBCT collects a set of parallel images that are reconstructed as a 3D volume. EBCT is a fast acquisition technique and therefore has been used to collect vocal tract images for datasets requiring short acquisition times. For example, Tom et al. (2001) scanned the entire vocal tract in under 90 seconds, to compare vocal tract shapes during falsetto and chest registers.
Although CT has been used to image the vocal tract, it is not the instrument of choice for speech research because of radiation exposure and because MRI provides much the same information, albeit at a lower spatial and temporal resolution. In fact, the major limitation of CT is that it has more radiation exposure than traditional X-ray, because it images thinner slices, and each slice is scanned several times to collect multiple images. Another limitation is that the subject is supine or prone, so gravitational effects on the subject differ from upright. On the positive side, 3D reconstructions can be made and sliced in any plane, and images are clear and easy to measure.
Another tomographic technique is Magnetic Resonance Imaging, which uses a magnetic field and radio waves rather than X-rays to image a section of tissue. There are a number of MRI procedures that yield a variety of information: high-resolution MRI, cine MRI, tagged-snapshot MRI, tagged-cine MRI, diffusion tensor MRI, and functional MRI. All of these use identical hardware: typically 1.5 or 3 Tesla machines. The differences lie in the software algorithms, which are designed to exploit different features of the relationship between the hydrogen proton, magnetic fields, and radio waves.
An MRI scanner consists of electromagnets that surround the body and create a magnetic field. MRI scanning detects the presence of hydrogen atoms, which occur in abundance in water and, therefore, in human soft tissue. Figure 1.3 depicts the MRI process. Picture (a) represents hydrogen protons spinning about their axes, which are oriented randomly. (b) shows what happens when a magnetic field is introduced. The protons’ axes align along the direction of the field’s poles. Even when aligned, however, the protons wobble, or precess. In (c) a short-lived radio pulse, vibrating at the same frequency as the precession, is introduced. This momentarily knocks the proton out of alignment. (d) shows the proton realigning, within milliseconds, to the magnetic field. As the proton realigns, it emits a weak radio signal of its own. Period (d) is when the MR image is “read.” The radio signals are summed until the protons return to position (b). The resulting data are constructed into an image that reflects the hydrogen content (i.e., the amount of water or fat) of the different tissues. Because the proton emissions are weak, the process is repeated many times and the data are summed into a single image. If the process is repeated for several minutes, while the subject holds still, high-resolution images result.
Figure 1.3 MRI recording of the amount of hydrogen in tissue. Hydrogen protons spin about axes that are oriented randomly (A). The MRI magnet causes them to align to the long axis of the body, but with a small precession (wobble) (B). A radio-frequency pulse knocks them out of alignment (C). As the protons realign to the magnet (D) they emit a radio pulse that is read by the scanner.
MRI measurement of oral structures has replaced X-ray for many research applications. MR images have been used to detail developmental vocal tract anatomy and function (Xue & Hao, 2003; Vorperian et al., 2005). MRI also has provided quite accurate extraction of vocal tract surfaces (Story et al., 1996). These surfaces have been used to calculate 3D vocal tract volumes for modeling geometry to acoustic relationships (Tameem & Mehta, 2004; Story, 2005). Extracted edges have also been used to model 3D structures within the vocal tract. Serrurier and Badin (2005) modeled velar position for French vowels from MRI and CT images. Engwall (2003) modeled tongue position for Swedish vowels from MRI, Electromagnetic Articulography (EMA), and Electropalatography (EPG). Story et al. (1996) modeled vocal tract airway shapes for 18 English phonemes from MRI. MRI is very good at characterizing different types of soft tissue and therefore is quite successful in identifying tumors and soft tissue pathology. For example, Lenz et al. (2000) used MRI and CT together to stage oral tumors and Lam et al. (2004) had good success using MRI T1 and T2 weighted images to determine tumor thickness.
Two types of MRI are used particularly to characterize tissue: high-resolution MRI (hMRI) and diffusion tensor MRI (DTI). Figure 1.4 shows a high-resolution sagittal MRI image of the vocal tract at rest. The vocal tract appears black, as do the teeth, since neither contains water. Water and fat, both of which are high in hydrogen, are found in marrow, seen in the palate and mandible. Muscles are visible in the tongue, velum, and lips. The other method of characterizing soft tissue is diffusion tensor MRI (DTI), which measures 3D fiber direction, typically in ex-vivo structures. DTI, developed in the early 1990s, visualizes fiber direction by measuring random thermal displacement of water molecules in the tissue. The direction of greatest molecular diffusion parallels the local fiber direction. DTI has virtually microscopic spatial resolution and distinguishes tissue fibers with their orientations for any muscles. A fiber map can be drawn and superimposed on an MRI structural image. The fiber map is 3D and can differentiate among nerve fiber pathways and detail anatomical structures based on their fiber architecture. There are limitations of this technique that impede the measurement of oropharyngeal structures. First, when fiber directions cross within a single voxel (3D pixel), visualization of the underlying fiber structure is reduced. Fiber interdigitation is typical in oral musculature, especially the tongue, lips, and velum. Second, DTI is sensitive to motion and the structure must remain immobile for several minutes to record a volumetric scan. Using long collection times, DTI has been used to study the excised tongues of animals (Wedeen et al., 2001) and humans (Gilbert & Napadow, 2005). In addition, DTI can be used in vivo with cooperative subjects to collect data in as little as 3–5 minutes. Figure 1.5 shows a fiber map indicating the fan-like fibers of the genioglossus muscle, which run from superior–inferior to anterior–posterior in direction. This image was taken from an in vivo human tongue at rest (Shinagawa et al., 2008, 2009).
Figure 1.4 High-resolution MRI (hMRI) of the midsagittal vocal tract at rest.
Figure 1.5 Diffusion Tensor Image shows fiber directions of the genioglossus muscle (light gray). Fibers are overlaid on a high resolution image of the head.
When measuring vocal tract motion, Cine-MRI is of particular interest. Cine-MRI is similar to other cine techniques, such as videofluoroscopy or movies, in that it divides a moving event into a number of still frames. Because MRI sums proton emissions over time, it typically takes a long time to reconstruct a single image, and collecting data during speech motion is challenging. Cine-MRI is often done by having the subject repeat a task multiple times and summing data from each frame across repetitions, similar to ensemble averaging. This technique has been used to compare vocal tract behaviors during speech production (Magen et al., 2003), especially vowel production (Hasegawa-Johnson et al., 2003; Story, 2005; McGowan, 2006). However, the subject must produce the repetitions very precisely to prevent image blurring. It is also possible to compare MRI vocal tract data to speech acoustics, either by collecting the speech wave independently from the (quite noisy) MRI data collection, or by using subtraction microphones within the MRI machine itself (cf., NessAiver et al., 2006). Some Cine-MRI data have been collected with single repetitions of a task (Mády et al., 2002; Narayanan et al., 2004). These images usually have a reduced frame rate or a reduced image quality, but typically have sufficient spatial resolution to extract surfaces of vocal tract structures. Faster frame rates and good image quality require several repetitions per slice (Stone et al., 2001a). Rapid collection of MRI frames using a single repetition has been reported using spiral MRI (Narayanan et al., 2004). This method acquires images in interleaved spirals rather than planes. Although Cine-MRI does not produce the quality of anatomy seen in high resolution (hMRI) images, the speed of pellet-tracking systems, or the level of muscle detail seen in DTI, it is able to answer many speech questions that were previously unstudied. Therefore, Cine-MRI is very popular in the study of speech.
Concurrent with the development of Cine-MRI, was the development of Tagged Snapshot MRI. Tags are created by applying a spatial gradient to the tissue, which demodulates the spinning protons in alternating planes. In the demodulated planes the protons spin out of phase with the rest of the tissue. The demodulated protons are invisible to the machine when the image is read, thus the invisible proton planes appear as black stripes on the image (see Figure 1.6). After tags are applied to the tissue, the object of interest, such as the tongue or lips, is moved. Because magnetization stays with the tissue, the tags deform to exactly the same extent as the tissue. Figure 1.6 shows reference frames in three planes taken during /∫/ and deformed frames taken during the in “sha.” In Tagged Snapshot MRI only two images are read. One is before the motion (reference) and the other is during or after the motion (deformed). A grid of tags is created by applying tags in horizontal and vertical planes in immediate succession prior to reading the image, or by combining two datasets of orthogonal tags collected separately. Niitsu et al. (1994) examined tag positions from rest (before) and a vowel position (after) to derive the direction of the movement. Napadow et al. (1999a, 1999b) studied swallowing and nonspeech tongue motions by having subjects repeat the task multiple times and each time collecting a deformed image at a later time. These images were then put into a pseudo-motion sequence reflecting the movement.
Figure 1.6 Tagged MR images in three planes. The left column shows the reference state of the tongue, just after the tags were applied during the sound /∫/ The right column shows the tongue in the deformed state, after the tongue has moved into position.
Cine-MRI and Tagged Snapshot MRI can be combined to form Tagged Cine-MRI (tMRI). tMRI captures internal tissue motion over time during the performance of a task. The first applications of tMRI were studies of the heart’s motion and internal tissue characteristics (Zerhouni et al., 1988; Axel and Dougherty, 1989). Continuous tongue deformation during speech has also been measured using tMRI. Figure 1.7 shows deformation of the midsagittal tongue between the two vowels /i/ and . The black and white squares are a visualization device to better depict the deformations. The change in their shapes over time demonstrates features of local tissue deformation. From these images tags can be tracked to directly measure positions and motion of all tissue points in the tongue (Parthasarathy et al., 2007). From the tissue point motions one can calculate displacement, velocity, and local strain (compressions and expansions).
Figure 1.7 Tagged Cine-MRI sequence shows motion from /i/ to . Checkerboard deformation shows local expansion, compression, and shear.
Functional MRI (fMRI) is a method of MRI scanning that measures changes in blood flow in the brain. Increased blood flow characterizes increased uptake of blood in the active region of the brain. Because blood is high in hydrogen, the local increase in activity can be measured by MRI. The premise is that spatially distinct, distributed areas of the brain are connected into functional networks organized to produce specific tasks. If these networks can be imaged, they can be mapped geographically to detail brain function during various behaviors. Increased neural activity in a region causes increased demand for oxygen, and that oxygen is brought to the region by blood. Replacement of deoxygenated blood with oxygenated blood produces a more uniform magnetic environment, which increases the longevity of the MRI signal. fMRI signals are snapshots that are collected at 10–20 Hz. Multislice recording of the brain and multiple repetitions can be recorded. Some limitations exist in the use of fMRI for speech research. First, the visible effects of the increase in oxygen occur some time after the event itself (0.5–8.0 seconds), which results in poor temporal resolution. Second, signals can vary even with no change in brain state. To overcome the latter problem, fMRI usually compares sets of images before and after the task. Despite these difficulties, research on speech and language is being done including studies of cortical aspects of speech production (Gracco et al., 2005), speech perception (Specht et al., 2005; Pulvermuller et al., 2006; Uppenkamp et al., 2006), and speech disorders (Ackermann & Riecker, 2004; Bonilha et al., 2006). Figure 1.8 shows an axial fMRI scan of the brain during a speech task. The task is to think of as many words as possible that use a specific letter, in a 24-second period. The task is repeated several times and modeled to bring out the contrast between the on and off states. Active regions are circled.
Figure 1.8 Axial fMRI scan showing regions of the brain that are active while subject thinks of words. The white regions (circled) indicate activity in the left hemisphere.
(Photo courtesy of Rao Gullapalli, University of Maryland Medical School)
As with all instruments, MRI has several drawbacks. The first is the slow capture rate that results from summing the weak radio signals emitted by each proton. Thus, high-resolution images and DTI require long periods of immobility for a good image, and fMRI has a slow response time. Cine- and tMRI require summation of multiple, very precise repetitions for optimal images. A second drawback is the width of the section. Whereas CT sections are less than 1 mm wide, and ultrasound sections are less than 2 mm, MRI sections are usually 5–10 mm wide. A tomographic scan compresses a three-dimensional volume into two dimensions, which is like flattening a cylinder into a circle. For example, in a slice that is 5 mm wide, items that are actually 5 mm apart will appear to be in the same plane. Thus, in the transverse plane, the hyoid bone and epiglottis might appear in the same slice even though one is several millimeters below the other. Narrower widths in MRI sections are possible, but require longer exposure time. A third drawback is that many subjects, as many as 30 percent, experience claustrophobia and cannot tolerate the procedure. Fourth, metal clamps, tooth crowns, and steel implants quench the signal creating a diffuse dark spot surrounding the metal. Final drawbacks for MRI are that the subject must be lying supine or prone, which changes the location of gravity with respect to the oral structures and normal agonist–antagonist muscle relationships. Despite these nontrivial drawbacks, MRI’s strengths make it an important instrument in speech production research.
Ultrasound produces an image by using the reflective properties of sound waves. A piezoelectric crystal stimulated by an electric current emits an ultra high-frequency sound wave. The crystal both emits a sound wave and receives the reflected echo. The sound wave travels through the soft tissue and reflects back when it reaches an interface with tissue of a different density, like bone, or when it reaches air. The best reflections are perpendicular to the beam (see Figure 1.9). In order to see a section of tissue rather than a single point, one needs an array transducer. In an array transducer, up to 128 crystals fire sequentially, imaging a section of tissue that is rectangular or wedge-shaped. The image size is proportional to the size of the transducer and frequency of the crystals, the wedge angle may be up to 140 degrees. The returning echoes are processed by an internal computer and displayed as a video image.
Figure 1.9 Schematic of ultrasound beam emitted from a transducer and reflected by surfaces of different angles. The best reflections are perpendicular to the beam. Soft tissue typically reflects and refracts sound in multiple directions.
Figure 1.10 shows a sagittal image of the tongue in a 90-degree wedge-shaped scan. To create such an image, the transducer is placed below the chin and the beam passes upward through a 1.9 mm thick section of the tongue. When the sound reaches the air at the surface of the tongue, it reflects back creating a bright white line. The black area immediately below is the tongue body. The tongue surface is the lower edge of the white line. Interfaces within the tongue are also visible. Figure 1.11 depicts the tongue in coronal section. The tongue surface is thinner in cross-section and herein contains a small midsagittal depression. Measurement error on such images is at most 1 pixel (Stone et al., 1988). Although ultrasound is typically used to study tongue motion, it has been used on occasion to study other structures, such as the lateral pharyngeal wall (Parush & Ostry, 1993; Miller & Watkin, 1997), or vocal folds (Munhall & Ostry, 1985; Ueda et al., 1993).
Figure 1.10 An ultrasound image of the sagittal (lengthwise) tongue. The white line is the upper surface of the tongue.
Figure 1.11 An ultrasound image of the coronal (crosswise) tongue. The upper surface has a midline groove.
The vocal folds may be imaged by placing the ultrasound transducer at the front of the neck, at the thyroid notch (Adam’s apple), and pointing it directly back in the transverse plane. Glottal stops, tumors, and slow-moving behaviors can be seen this way. However, the vocal folds have a very fast vibration rate, at least 80 Hz, and usually much more. The sampling rate of the fastest ultrasound machines is about 90 Hz. Therefore, vibration may be seen during phonation, but it is undersampled, and in individual frames cannot be measured. Other instruments, such as Electroglottography, are more accurate methods for measuring vocal fold vibration.
Ultrasound presents specific challenges when measuring the tongue. The first challenge is that up to 1 cm of the tongue tip may not be captured in the image, because the ultrasound beam is reflected at the floor of the mouth and the sound wave doesn’t enter the tongue tip. The tip may be imaged, however, if there is sufficient saliva in the mouth, if the tongue is resting against the floor, or if the transducer is posterior and angled forward (Stone, 2005). The second limitation is the inability to see beyond a tissue–air or tissue–bone interface. Since the tissue–air interface at the tongue’s surface reflects the sound wave, the structures on the far side of the vocal tract, such as the palate and pharyngeal wall, cannot be imaged. Similarly, when ultrasound reaches a bone, the curved shape refracts the sound wave creating an acoustic shadow or dark area. Thus, the jaw and hyoid bones appear as shadows and their exact position cannot be reliably measured.
Despite these limitations, a large number of studies successfully use real-time ultrasound to study tongue movements. Normal speech production and exploration of tongue surface features are the most common applications, (cf. Davidson, 2006; Slud et al., 2002; Chiang et al., 2003). Ultrasound has also been the basis of tongue surface models in the sagittal plane (Green & Wang, 2003), coronal plane (Slud et al., 2002), and in 3D (Watkin & Rubin, 1989; Yang Stone & Lundberg, 1996; & Stone, 2002; Bressmann et al., 2005b). Because ultrasound is well tolerated by subjects it is well suited to the study of disorders (Bressmann et al., 2005a; Schmelzeisen et al., 1996) and to studies of children (Ueda et al., 1993). Ultrasound is also an excellent tool to study swallowing because it is noninvasive and does not affect the swallow (Miller & Watkin, 1997; Chi-Fishman et al., 1998; Watkin, 1999; Peng et al., 2000; Soder & Miller, 2002). Moreover, it is now possible to align tongue position with the hard palate, if the transducer and head are held still (Epstein & Stone, 2005). This alignment allows better interpretation of the ultrasound data. Finally, applications of ultrasound to linguistics are increasing because portable machines allow ultrasound to be used in fieldwork (cf. Gick, 2002a). Ultrasound provides large quantities of time–motion data with a single repetition, thin slices, and a noninvasive method. Many machines are now digital and can collect 90 or more scans per second, which is fast enough to measure most tongue motions, though not fast enough to measure vocal fold vibration. A second advantage is that ultrasound involves no known biological hazards, since the transduction process involves only sound waves.
