194,99 €
AUTOMATIC SPEECH RECOGNITION and TRANSLATION for LOW-RESOURCE LANGUAGES This book is a comprehensive exploration into the cutting-edge research, methodologies, and advancements in addressing the unique challenges associated with ASR and translation for low-resource languages. Automatic Speech Recognition and Translation for Low Resource Languages contains groundbreaking research from experts and researchers sharing innovative solutions that address language challenges in low-resource environments. The book begins by delving into the fundamental concepts of ASR and translation, providing readers with a solid foundation for understanding the subsequent chapters. It then explores the intricacies of low-resource languages, analyzing the factors that contribute to their challenges and the significance of developing tailored solutions to overcome them. The chapters encompass a wide range of topics, ranging from both the theoretical and practical aspects of ASR and translation for low-resource languages. The book discusses data augmentation techniques, transfer learning, and multilingual training approaches that leverage the power of existing linguistic resources to improve accuracy and performance. Additionally, it investigates the possibilities offered by unsupervised and semi-supervised learning, as well as the benefits of active learning and crowdsourcing in enriching the training data. Throughout the book, emphasis is placed on the importance of considering the cultural and linguistic context of low-resource languages, recognizing the unique nuances and intricacies that influence accurate ASR and translation. Furthermore, the book explores the potential impact of these technologies in various domains, such as healthcare, education, and commerce, empowering individuals and communities by breaking down language barriers. Audience The book targets researchers and professionals in the fields of natural language processing, computational linguistics, and speech technology. It will also be of interest to engineers, linguists, and individuals in industries and organizations working on cross-lingual communication, accessibility, and global connectivity.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 711
Veröffentlichungsjahr: 2024
Cover
Table of Contents
Series Page
Title Page
Copyright Page
Dedication Page
Foreword
Preface
Acknowledgement
1 A Hybrid Deep Learning Model for Emotion Conversion in Tamil Language
1.1 Introduction
1.2 Dataset Collection and Database Preparation
1.3 Pre-Trained CNN Architectural Models
1.4 Proposed Method for Emotion Transformation
1.5 Synthesized Speech Evaluation
1.6 Conclusion
References
2 Attention-Based End-to-End Automatic Speech Recognition System for Vulnerable Individuals in Tamil
2.1 Introduction
2.2 Related Work
2.3 Dataset Description
2.4 Implementation
2.5 Results and Discussion
2.6 Conclusion
References
3 Speech-Based Dialect Identification for Tamil
3.1 Introduction
3.2 Literature Survey
3.3 Proposed Methodology
3.4 Experimental Setup and Results
3.5 Conclusion
References
4 Language Identification Using Speech Denoising Techniques: A Review
4.1 Introduction
4.2 Speech Denoising and Language Identification
4.3 The Noisy Speech Signal is Denoised Using Temporal and Spectral Processing
4.4 The Denoised Signal is Classified to Identify the Language Spoken Using Recent Machine Learning Algorithm
4.5 Conclusion
References
5 Domain Adaptation-Based Self-Supervised ASR Models for Low-Resource Target Domain
5.1 Introduction
5.2 Literature Survey
5.3 Dataset Description
5.4 Self-Supervised ASR Model
5.5 Domain Adaptation for Low-Resource Target Domain
5.6 Implementation of Domain Adaptation on wav2vec2 Model for Low-Resource Target Domain
5.7 Results Analysis
5.8 Conclusion
Acknowledgements
References
6 ASR Models from Conventional Statistical Models to Transformers and Transfer Learning
6.1 Introduction
6.2 Preprocessing
6.3 Feature Extraction
6.4 Generative Models for ASR
6.5 Discriminative Models for ASR
6.6 Deep Architectures for Low-Resource Languages
6.7 The DNN-HMM Hybrid System
6.8 Summary
References
7 Syllable-Level Morphological Segmentation of Kannada and Tulu Words
7.1 Introduction
7.2 Related Work
7.3 Corpus Construction and Annotation
7.4 Methodology
7.5 Experiments and Results
7.6 Conclusion and Future Work
References
8 A New Robust Deep Learning-Based Automatic Speech Recognition and Machine Transition Model for Tamil and Gujarati
8.1 Introduction
8.2 Literature Survey
8.3 Proposed Architecture
8.4 Experimental Setup
8.5 Results
8.6 Conclusion
References
9 Forensic Voice Comparison Approaches for Low-Resource Languages
9.1 Introduction
9.2 Challenges of Forensic Voice Comparison
9.3 Motivation
9.4 Review on Forensic Voice Comparison Approaches
9.5 Low-Resource Language Datasets
9.6 Applications of Forensic Voice Comparison
9.7 Future Research Scope
9.8 Conclusion
References
10 CoRePooL—Corpus for Resource-Poor Languages: Badaga Speech Corpus
10.1 Introduction
10.2 CoRePooL
10.3 Benchmarking
10.4 Conclusion
Acknowledgement
References
11 Bridging the Linguistic Gap: A Deep Learning-Based Image-to-Text Converter for Ancient Tamil with Web Interface
11.1 Introduction
11.2 The Historical Significance of Ancient Tamil Scripts
11.3 Realization Process
11.4 Dataset Preparation
11.5 Convolution Neural Network
11.6 Webpage with Multilingual Translator
11.7 Results and Discussions
11.8 Conclusion and Future Work
References
12 Voice Cloning for Low-Resource Languages: Investigating the Prospects for Tamil
12.1 Introduction
12.2 Literature Review
12.3 Dataset
12.4 Methodology
12.5 Results and Discussion
12.6 Conclusion
References
13 Transformer-Based Multilingual Automatic Speech Recognition (ASR) Model for Dravidian Languages
13.1 Introduction
13.2 Literature Review
13.3 Dataset Description
13.4 Methodology
13.5 Experimentation Results and Analysis
13.6 Conclusion
References
14 Language Detection Based on Audio for Indian Languages
14.1 Introduction
14.2 Literature Review
14.3 Language Detector System
14.4 Experiments and Outcomes
14.5 Conclusion
References
15 Strategies for Corpus Development for Low-Resource Languages: Insights from Nepal
15.1 Low-Resource Languages and the Constraints
15.2 Language Resources Map for the Languages of Nepal
15.3 Unicode Inception and Advent in Nepal
15.4 Speech and Translation Initiatives
15.5 Corpus Development Efforts—Sharing Our Experiences
15.6 Constraints to Competitive Language Technology Research for Nepali and Nepal’s Languages
15.7 Roadmap for the Future
15.8 Conclusion
References
16 Deep Neural Machine Translation (DNMT): Hybrid Deep Learning Architecture-Based English-to-Indian Language Translation
16.1 Introduction
16.2 Literature Survey
16.3 Background
16.4 Proposed System
16.5 Experimental Setup and Results Analysis
16.6 Conclusion and Future Work
References
17 Multiview Learning-Based Speech Recognition for Low-Resource Languages
17.1 Introduction
17.2 Approaches of Information Fusion in ASR
17.3 Partition-Based Multiview Learning
17.4 Data Augmentation Techniques
17.5 Conclusion
References
18 Automatic Speech Recognition Based on Improved Deep Learning
18.1 Introduction
18.2 Literature Review
18.3 Proposed Methodology
18.4 Results and Discussion
18.5 Conclusion
References
19 Comprehensive Analysis of State-of-the-Art Approaches for Speaker Diarization
19.1 Introduction
19.2 Generic Model of Speaker Diarization System
19.3 Review of Existing Speaker Diarization Techniques
19.4 Challenges
19.5 Applications
19.6 Conclusion
References
20 Spoken Language Translation in Low-Resource Language
20.1 Introduction
20.2 Related Work
20.3 MT Algorithms
20.4 Dataset Collection
20.5 Conclusion
References
Index
End User License Agreement
Chapter 1
Table 1.1 Objective evaluation using MCD.
Table 1.2 Subjective evaluation of some pretrained CNN architectural models us...
Table 1.3 Subjective evaluation of GMM, FFNN, and proposed model using MOS tes...
Table 1.4 Subjective evaluation of some pretrained CNN architectural models us...
Table 1.5 Subjective evaluation of GMM, FFNN, and proposed model using ABX tes...
Chapter 3
Table 3.1 Number of speech utterances collected.
Table 3.2 Classification report of dialect identification.
Table 3.3 Accuracy scores of GMMs calculated by varying the mixture components...
Chapter 5
Table 5.1 Results of E-content.
Table 5.2 Results of NPTEL.
Table 5.3 Results of E - Content and NPTEL.
Chapter 6
Table 6.1 An excerpt from a Malayalam speech corpus with word level transcript...
Table 6.2 Sample entries in a Malayalam pronunciation lexicon.
Table 6.3 WER of Malayalam ASR model created using CD-DNNHMM compared with var...
Chapter 7
Table 7.1 Sample English words and their segmentation at word level.
Table 7.2 Sample agglutinative words in Kannada and Tulu and their English tra...
Table 7.3 Sample Kannada and Tulu sandhied words with their stem and suffix mo...
Table 7.4 Sample Kannada and Tulu words, their syllables, and consonants and t...
Table 7.5 Sample Kannada compound words, words after removing the duplicate ve...
Table 7.6 Sample annotated Kannada and Tulu corpora.
Table 7.7 Hyperparameters and their values used in the proposed CRFSuite model...
Table 7.8 Features used in the proposed CRF model.
Table 7.9 Results of CRF model for MS of Kannada and Tulu words.
Chapter 9
Table 9.1 Literature review on forensic voice comparison for low-resource lang...
Table 9.2 Low-resource language dataset details [67].
Chapter 10
Table 10.1 CoRePooL v0.1.0: statistics.
Table 10.2 CoRePooL v0.1.0: variations.
Chapter 12
Table 12.1 Mean similarity opinion scores.
Chapter 13
Table 13.1 Proposed hyperparameters for Tamil and Telugu.
Table 13.2 Analysis of model parameters and WER for Tamil and Telugu.
Chapter 15
Table 15.1 Language family-wise distribution of speakers.
Table 15.2 State-wise distribution of language.
Table 15.3 State-wise official language.
Table 15.4 Groups of languages based on vitality.
Table 15.5 Interrelationship between language and ethnicity.
Table 15.6 Major scripts in Nepal.
Table 15.7 Research conducted by the Language Commission.
Table 15.8 Dataset details of NSC.
Table 15.9 Datasets for machine translation in languages of Nepal.
Chapter 16
Table 16.1 Outline data of the Samanatar corpus.
Table 16.2 IndicNLP corpus data.
Table 16.3 Data of Wikipedia articles utilized for preparing linguistic models...
Table 16.4 Corpus statistics for testing, training, and development.
Table 16.5 System requirement.
Table 16.6 BLEU results for English-to-Indian language conversion on Samananta...
Table 16.7 BLEU results for Indian-to-English language conversion on Samananta...
Table 16.8 Description of fluency and adequacy.
Table 16.9 An instance of imprudent conversion for Hindi-to-English.
Table 16.10 Outcomes of manual assessment.
Table 16.11 Excellence of conversion productivities.
Table 16.12 Some example Hindi-to-English translation outputs with adequacy an...
Table 16.13 Comparison of different language pairs using different evaluation ...
Table 16.14 Comparison of proposed DNMT model with google translate.
Chapter 17
Table 17.1 Possible approaches to handling low-resource language issues.
Chapter 1
Figure 1.1 The diagram 5 layers feed forward neural network model.
Figure 1.2 Block diagram of the proposed model.
Figure 1.3 Block diagram showing testing process.
Figure 1.4 MCD test for objective evaluation.
Figure 1.5 (a) Subjective evaluation of some pre-trained CNN architectural mod...
Figure 1.6 Subjective evaluation using ABX test.
Chapter 2
Figure 2.1 Sample data.
Figure 2.2 Attention-based encoder–decoder RNN network.
Figure 2.3 Working flow of SpeechBrain [4].
Figure 2.4 Architecture of SpeechBrain toolkit [4].
Figure 2.5 Proposed architecture.
Chapter 3
Figure 3.1 Block diagram of the dialect identification system.
Figure 3.2 MFCC features extraction steps.
Chapter 5
Figure 5.1 wav2vec2 Model architecture.
Figure 5.2 Sample audio file.
Figure 5.3 Feature encoder architecture.
Chapter 6
Figure 6.1 Speech recognition—human auditory system vs. ASR system.
Figure 6.2 Speech affected by different noises in each processing stage.
Figure 6.3 Framing.
Figure 6.4 Hamming window.
Figure 6.5 Mel filter bank.
Figure 6.6 Workflow of LPC.
Figure 6.7 Workflow of PLP.
Figure 6.8 Components of a statistical ASR system [25].
Figure 6.9 HMM states of a phoneme corresponding to feature vector X.
Figure 6.10 Types of SVMs.
Figure 6.11 (a) Data plot of vowel i and o, (b) hard margin applied by SVM cla...
Figure 6.12 Data plot of vowels i and e.
Figure 6.13 Introduced slack variable.
Figure 6.14 Multiclass, Malayalam monophthong short vowels.
Figure 6.15 Training time taken by each kernel function.
Figure 6.16 Encoder–decoder model with attention.
Figure 6.17 LSTM-based encoder-decoder with attention [53].
Figure 6.18 The transformer model architecture [54].
Figure 6.19 Architecture of shared hidden layer multilingual DNN.
Figure 6.20 CD-DNN-HMM architecture for ASR.
Chapter 7
Figure 7.1 Distribution of labels in the Kannada and Tulu datasets.
Figure 7.2 Framework of the proposed methodology.
Chapter 8
Figure 8.1 Proposed architecture.
Figure 8.2 Loss of model over the steps.
Figure 8.3 Result of model.
Figure 8.4 ROUGE evaluation of model.
Figure 8.5 Comparison between models.
Chapter 9
Figure 9.1 Forensic audio modulation graph [7].
Figure 9.2 Forensic voice comparison [8].
Figure 9.3 Depicts the forensic voice comparison (FVC) methodology.
Figure 9.4 Using mobile phone speech recordings as evidence in a court of law ...
Figure 9.5 Auditory and acoustic analysis of voice quality variations in norma...
Figure 9.6 Manual voice biometrics in law enforcement of audio forensics [16]....
Figure 9.7 Manual analysis through handwritten generated copies of suspect [17...
Figure 9.8 IKAR Lab 3: forensic audio suite [59].
Figure 9.9 Forensic audio analysis software SIS II [60].
Figure 9.10 Sound Cleaner II [61].
Chapter 10
Figure 10.1 Speech-to-text: evaluation loss and WER for fine-tuning Badaga.
Figure 10.2 Text-to-speech: evaluation loss.
Figure 10.3 Gender identification: evaluation loss and accuracy for fine-tunin...
Figure 10.4 Speaker identification: evaluation loss and accuracy for fine-tuni...
Figure 10.5 Epoch vs. BLEU score (in %).
Figure 10.6 Epoch vs. evaluation loss.
Chapter 11
Figure 11.1 Evolution of ancient Tamil script period-wise [8].
Figure 11.2 Process flow diagram.
Figure 11.3 Pre-processed Tamil characters from the period of the 9th century ...
Figure 11.4 Dataset augmentation technique output image.
Figure 11.5 Original inscription image [20].
Figure 11.6 Rotated image.
Figure 11.7 Pre-processed image.
Figure 11.8 Boxed image of ancient Tamil inscription.
Figure 11.9 CNN architecture.
Figure 11.10 Convolutional layer.
Figure 11.11 Pooling.
Figure 11.12 Web interface of ancient Tamil text translator.
Figure 11.13 Multilanguage choosing option.
Figure 11.14 Accuracy in graphical form.
Figure 11.15 Accuracy in table form.
Figure 11.16 Overall result of the ancient Tamil inscription multilingual tran...
Chapter 12
Figure 12.1 Flow diagram of methodology.
Figure 12.2 Mel spectrogram of original utterance and generated utterance.
Figure 12.3 Mel spectrogram of original utterance and generated utterance.
Chapter 13
Figure 13.1 Block schematic explaining our proposed methodology of transformer...
Figure 13.2 Block schematic explaining mel feature extraction.
Figure 13.3 Block schematic for model architecture.
Figure 13.4 Training loss for Tamil and Telugu.
Figure 13.5 Transcriptions produced by our model on various test cases.
Chapter 14
Figure 14.1 Neural network architecture.
Figure 14.2 Block diagram of (name of the diagram).
Figure 14.3 Data cleaning and preparation workflow.
Figure 14.4 WAV graphic for an audio file.
Figure 14.5 CNN architecture.
Figure 14.6 A glimpse of our unique CNN architecture.
Figure 14.7 Sample data.
Figure 14.8 Extracted features.
Figure 14.9 Features extracted from audio file.
Figure 14.10 Model compiling and fitting.
Figure 14.11 Predicting the class label.
Chapter 15
Figure 15.1 Number of languages and speakers.
Figure 15.2 Development timeline of language technologies in Nepal.
Figure 15.3 Speech corpus development process.
Figure 15.4 Iterative view of the short- and long-term goals.
Chapter 16
Figure 16.1 Structure of DNN.
Figure 16.2 Structure of CNN.
Figure 16.3 Structure of RNN.
Figure 16.4 Structure of DBN.
Figure 16.5 Structure of SAE.
Figure 16.6 Simplified architecture of machine translation.
Figure 16.7 Overview and architecture of the proposed DNMT model.
Figure 16.8 Neural network block diagram.
Figure 16.9 SAE-based encoder–decoder architecture of DNMT.
Figure 16.10 The structure of the DBN autoencoder.
Figure 16.11 Hybrid DNMT model—Working.
Figure 16.12 Experimental architecture [47].
Chapter 17
Figure 17.1 Challenges in low-resource languages.
Figure 17.2 Partition-based multiview learning.
Chapter 18
Figure 18.1 The process of the proposed automatic speech recognition using an ...
Figure 18.2 Process of MFCC.
Figure 18.3 Recurrent neural network model.
Figure 18.4 Comparison of recognition methods with precision.
Figure 18.5 Comparison of recall and recognition methods.
Figure 18.6 F-measures of the proposed and current speech recognition algorith...
Figure 18.7 Comparison of accuracy using recognition methods.
Chapter 19
Figure 19.1 Generic model of speaker diarization system.
Chapter 20
Figure 20.1 Tree structure of low-resource MT [17].
Figure 20.2 General architecture of end-to-end speech translation.
Figure 20.3 Layer freezing approach transformer model [33].
Figure 20.4 GAN and LAC MT system model [35].
Figure 20.5 Multi-pattern text filtering word2vec model for Uyghur language [3...
Figure 20.6 Universal MT LRL [41].
Cover Page
Table of Contents
Series Page
Title Page
Copyright Page
Dedication Page
Foreword
Preface
Acknowledgments
Begin Reading
Index
WILEY END USER LICENSE AGREEMENT
ii
iii
iv
v
xix
xx
xxi
xxii
xxiii
xxiv
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
Scrivener Publishing100 Cummings Center, Suite 541JBeverly, MA 01915-6106
Publishers at ScrivenerMartin Scrivener ([email protected])Phillip Carmical ([email protected])
Edited by
L. Ashok Kumar
PSG College of Technology, Coimbatore, India
D. Karthika Renuka
PSG College of Technology, Coimbatore, India
Bharathi Raja Chakravarthi
School of Computer Science, University of Galway, Ireland
and
Thomas Mandl
Institute for Information Science and Language Technology, University of Hildesheim, Germany
This edition first published 2024 by John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA and Scrivener Publishing LLC, 100 Cummings Center, Suite 541J, Beverly, MA 01915, USA© 2024 Scrivener Publishing LLCFor more information about Scrivener publications please visit www.scrivenerpublishing.com.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.
Wiley Global Headquarters111 River Street, Hoboken, NJ 07030, USA
For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.
Limit of Liability/Disclaimer of WarrantyWhile the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials, or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read.
Library of Congress Cataloging-in-Publication Data
ISBN 978-1-394-21358-0
Cover image: Pixabay.ComCover design by Russell Richardson
To my wife, Ms. Y. Uma Maheswari, and daughter, A. K. Sangamithra, for their constant support and love.
Dr. L. Ashok Kumar
To my family and friends who have been highly credible and a great source of inspiration and motivation.
Dr. D. Karthika Renuka
Dr. Bharathi Raja Chakravarthi would like to thank his students.
Dr. Bharathi Raja Chakravarthi
Recent advancements in Automatic Speech Recognition (ASR) and Machine Translation (MT) technologies have brought about a new era of hope and possibility for these low-resource languages. The convergence of cutting-edge research, powerful algorithms, and increased computational capacity has paved the way for groundbreaking applications that can revolutionize linguistic accessibility and inclusion.
This book stands as a testament to the transformative potential of ASR and MT technologies for marginalized languages. It brings together a diverse group of experts, researchers, and practitioners who have dedicated their efforts to addressing the unique challenges faced by low-resource languages and finding ways to overcome them with ASR and MT.
The chapters herein explore a wide range of topics related to ASR and MT for low-resource languages. The book delves into the theoretical foundations of ASR and MT, providing readers with a comprehensive understanding of the underlying principles and methodologies. It examines the technical intricacies and practical considerations of developing ASR and MT systems that are specifically tailored to low-resource languages, taking into account the scarcity of data and linguistic resources.
Moreover, this book sheds light on the potential applications of ASR and MT technologies beyond mere transcription and translation. It explores how these technologies can be harnessed to preserve endangered languages, facilitate cross-cultural communication, enhance educational resources, and empower marginalized communities. By offering real-world case studies, success stories, and lessons learned, the contributors provide invaluable insights into the impact of ASR and MT on low-resource languages and the people who speak them.
As you embark on this enlightening journey through the pages of this book, you will discover the tremendous potential of ASR and MT technologies to bridge the digital divide and empower low-resource languages. You will witness the strides made in linguistic accessibility and cultural preservation, and you will gain a deeper appreciation for the profound impact these technologies can have on societies, both large and small.
I extend my heartfelt appreciation to the editors and authors who have contributed their expertise, dedication, and passion to this volume. Their collective wisdom and tireless efforts have given rise to a comprehensive resource that will undoubtedly serve as a guiding light for researchers, practitioners, and policymakers committed to advancing the cause of linguistic diversity and inclusivity.
Together, let us embrace the power of ASR and MT technologies as instruments of empowerment and change. Let us work collaboratively to ensure that no language, no matter how small or remote, is left behind in the digital era. Through our collective endeavors, we can unleash the full potential of low-resource languages, fostering a world where linguistic diversity thrives, cultures flourish, and global understanding is truly within reach.
Sheng-Lung Peng
Dean, College of Innovative Design and Management, National Taipei University of Business, Creative Technologies and Product Design, Taiwan
In today’s interconnected world, effective communication across different languages is vital for fostering understanding, collaboration, and progress. However, language barriers pose significant challenges, particularly for languages that lack extensive linguistic resources and technological advancements. In this context, the field of Automatic Speech Recognition (ASR) and translation assumes paramount importance.
ASR and Translation for Low Resource Languages is a comprehensive exploration into the cutting-edge research, methodologies, and advancements in addressing the unique challenges associated with ASR and translation for low-resource languages. This book sheds light on the innovative approaches and techniques developed by researchers and practitioners to overcome the limitations imposed by scarce linguistic resources and data availability.
To start, the book delves into the fundamental concepts of ASR and translation, providing readers with a solid foundation for understanding the subsequent chapters. Then in explores the intricacies of low-resource languages, analyzing the factors that contribute to their challenges and the significance of developing tailored solutions to overcome them.
The material contained herein encompasses a wide range of topics, ranging from both the theoretical and practical aspects of ASR and translation for low-resource languages. The book discusses data augmentation techniques, transfer learning, and multilingual training approaches that leverage the power of existing linguistic resources to improve accuracy and performance. Additionally, it investigates the possibilities offered by unsupervised and semi-supervised learning, as well as the benefits of active learning and crowdsourcing in enriching the training data.
Throughout the book, emphasis is placed on the importance of considering the cultural and linguistic context of low-resource languages, recognizing the unique nuances and intricacies that influence accurate ASR and translation. Furthermore, we explore the potential impact of these technologies in various domains, such as healthcare, education, and commerce, empowering individuals and communities by breaking down language barriers.
The editors of this book brought together experts, researchers, and enthusiasts from diverse fields to share their knowledge, experiences, and insights in ASR and translation for low-resource languages. We hope that this collaborative effort will contribute to the development of robust and efficient solutions, ultimately fostering inclusive communication and bridging the language divide. We invite readers to embark on this journey of discovery and innovation, gaining a deeper understanding of the challenges, opportunities, and breakthroughs in ASR and translation for low-resource languages. Together, let us pave the way towards a world where language is no longer a barrier, but a bridge that connects individuals, cultures, and ideas.
Dr. L. Ashok Kumar
Professor, PSG College of Technology, India
Dr. D. Karthika Renuka
Professor, PSG College of Technology, India
Dr. Bharathi Raja Chakravarthi
Assistant Professor/Lecturer above-the-Bar School of Computer Science, University of Galway, Ireland
Dr. Thomas Mandl
Professor, Institute for Information Science and Language Technology, University of Hildesheim, Germany
We bow our head before “The God Almighty” who blessed us with health and confidence to undertake and complete the book successfully. We express our sincere thanks to the Principal and Management of PSG College of Technology, University of Galway, Ireland, University of Hildesheim, Germany for their constant encouragement and support.
We thank our family and friends who always stood beside us and encouraged us to complete the book.
Dr. L. Ashok Kumar is thankful to his wife, Y. Uma Maheswari, for her constant support during writing. He is also grateful to his daughter, A. K. Sangamithra, for her support; it helped him a lot in completing this work.
Dr. D. Karthika Renuka would like to express gratitude to her parents, for their constant support. Her heartfelt thanks to her husband, Mr. R. Sathish Kumar, and her dear daughter, Ms. P. S. Preethi, for their unconditional love which made her capable of achieving all her goals.
Dr. Bharathi Raja Chakravarthi would like to thank his students.
Dr. Thomas Mandl would like to thank his parents and family as well as all his academic colleagues for their inspiration during cooperation.
We would like to acknowledge the help of all the people involved in this project. First, we would like to thank each one of the authors for their contributions. Our sincere gratitude goes to the chapter’s authors who contributed their time and expertise to this book. We thank all the authors of the chapters for their commitment to this endeavor and their timely response to our incessant requests for revisions.
Second, the editors wish to acknowledge the valuable contributions of the reviewers regarding the improvement of quality, coherence, and content presentation of chapters. Next, the editors would like to recognize the contributions of editorial board in shaping the nature of the chapters in this book. In addition, we wish to thank the editorial staff at Wiley-Scrivener book for their professional assistance and patience. Sincere thanks to each one of them.
Dr. L. Ashok Kumar
Professor, PSG College of Technology, India
Dr. D. Karthika Renuka
Professor, PSG College of Technology, India
Dr. Bharathi Raja Chakravarthi
Assistant Professor/Lecturer above-The-Bar School of Computer Science, University of Galway, Ireland
Dr. Thomas Mandl
Professor, Institute for Information Science and Language Technology, University of Hildesheim, Germany
S. Suhasini1, B. Bharathi2* and Bharathi Raja Chakravarthi3
1Department of Computer Science and Engineering, R. M. D. Engineering College, Tamil Nadu, India
2Department of Computer Science and Engineering, Sri Sivasubramaniya Nadar College of Engineering, Tamil Nadu, India
3School of Computer Science, University of Galway, Galway, Ireland
The process of turning spoken language into written text is called automatic speech recognition (ASR). It is used in many o settings. Automatic speech recognition becomes a crucial tool when daily life is digitized. It is well known that it considerably improves the lives of the elderly and people with disabilities. Minor dysarthria, or slurred speech, is common in elderly people and those who are physically or mentally challenged, which leads to erroneous transcription of the data. In this study, we suggested creating a Tamil-language automatic voice recognition system for the elderly. The ASR system must be trained using elderly people’s speech utterances in order to increase its performance when processing elderly people’s speech. There is no Tamil speech corpus made up of elderly speakers. We recorded elderly and transgender individuals speaking Tamil on the spot. These statements were gathered from people speaking in open spaces including markets, hospitals, and vegetable shops. The speech corpus contains remarks from men, women, and transgender people. In this research, an attention-based, end-to-end paradigm is used to construct an ASR system. The proposed system includes two key steps: creating an audio model and a linguistic model. Recurrent neural network architecture was used in the construction of the language model. The attention-based encoder–decoder architecture was used to construct the acoustic model. The encoder model utilized a convolution network with a recurrent network, and the decoder model utilized an attention-based gated recurrent unit. Word error rate (WER) is used to assess how well the suggested ASR system performs when used on geriatric speaking utterances. The outcomes are compared to several transformer models that have already been trained. By pretraining a single model using the raw waveform of speech in various languages, the pre-trained XLSR models develop cross-lingual speech representations. The Common Voice Tamil voice corpus is used to fine tune the pre-trained models. According to the experiments, the suggested attention-based, end-to-end model performs noticeably better than transformer models that have already been trained.
Keywords: Automatic speech recognition (ASR), recurrent neural network (RNN), hidden Markov model (HMM), cross-lingual speech representations (XLSR), word error rate (WER), transformer model, encoder–decoder model
Recently, a variety of tasks, including the creation of picture captions, handwriting synthesis machine translation, and visual object categorization, were successfully accomplished using attention-based recurrent networks. These models process their input iteratively, choosing pertinent facts at each step. With the help of this fundamental concept, end-to-end training techniques may now be used to build networks with external memory. To facilitate voice recognition, we propose adjustments to attention-based recurrent networks. It is conceivable to think of learning to recognize speech as learning to create a series (transcription) from a sequence (speech). From this angle, it was found that attention-based methods work well for projects like machine translation and handwriting synthesis. It is challenging to discriminate between similar speech fragments inside a single utterance because speech recognition requires far longer input sequences than machine translation does (thousands of frames as opposed to dozens of words). The input sequence is noisier and less clearly structured than in handwriting synthesis, which is another way in which it differs from that technique. Designing novel attention-based architectures that can analyze lengthy and noisy inputs using speech recognition is an appealing testbed as a result of these factors. Speech recognition is an ongoing research area and requires attention-based models to achieve completely end-to-end trainable systems. The most popular method is still based on an n-gram language model, a triphone hidden Markov model (HMM) model, and a deep neural acoustic model. To make the components operate together, dictionaries of manually created phoneme and pronunciation lexicons are needed, as well as a multi-stage training process.
Some elderly persons attempt to acquire information from the internet using their audio message because they are not well-versed in technology [1]. An acoustic model must be created to handle these types of audio messages from elderly individuals; the model will identify their speech and extract the output of the speech data [1, 5]. One effective approach for automatic speech recognition (ASR) has been end-to-end (E2E) architecture. In this method, a single network is used instead of the traditional HMM-based systems’ [6] pronunciation dictionaries to directly map acoustic data into a sequence of characters or subwords [2, 12], where the input and output sequences are deterministically aligned by the attention mechanism, and the connectionist temporal classification (CTC) and recurrent neural network (RNN) [8, 11]. Transducer interpret this alignment as a latent random variable for MAP (maximum a posteriori) inference [3]. In particular, if the discriminative objective function used for training is closely related to the error rate on phones, words, or sentences, the recognition accuracy of a Gaussian mixture model-hidden Markov model (GMM-HMM) system can be further enhanced if it is discriminatively fine-tuned after it has been generatively trained to maximize its probability of generating the observed data [7]. A spell checker’s primary job is to find and fix grammatical errors, missing words, and incorrect words in text documents [9, 10]. In everyday situations, individuals can determine a speaker’s age based on their speech. This indicates that speech has some acoustic features that are age-related. Various age groups have various speech acoustic properties [13, 15]. In this study, we built the baseline acoustic models utilizing a lot of data in order to find more efficient ways to train acoustic models for geriatric speech and so improve speech recognition results [17]. For those with limited hand movement or eye impairment, particularly the elderly, ASR is a useful modality [18, 19]. As part of the PaeLife project10, which aimed to create a multimodal virtual personal assistant to support seniors in maintaining an active and social lifestyle, multilingual ASR was implemented [14, 16].
Data on Tamil informal speech is gathered from the elderly. Older people’s recorded speech reveals how they converse in common settings including the vegetable shop, jewel shop, transport area, and patient wards. It contains three different gender (male, female, transgender) individuals who provided the speech data. The speech corpus comprises a total of 46 audio samples representing the 6 h, 42 min of speech data that make up the corpus [20