Automatic Speech Recognition and Translation for Low Resource Languages -  - E-Book

Automatic Speech Recognition and Translation for Low Resource Languages E-Book

0,0
194,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

AUTOMATIC SPEECH RECOGNITION and TRANSLATION for LOW-RESOURCE LANGUAGES This book is a comprehensive exploration into the cutting-edge research, methodologies, and advancements in addressing the unique challenges associated with ASR and translation for low-resource languages. Automatic Speech Recognition and Translation for Low Resource Languages contains groundbreaking research from experts and researchers sharing innovative solutions that address language challenges in low-resource environments. The book begins by delving into the fundamental concepts of ASR and translation, providing readers with a solid foundation for understanding the subsequent chapters. It then explores the intricacies of low-resource languages, analyzing the factors that contribute to their challenges and the significance of developing tailored solutions to overcome them. The chapters encompass a wide range of topics, ranging from both the theoretical and practical aspects of ASR and translation for low-resource languages. The book discusses data augmentation techniques, transfer learning, and multilingual training approaches that leverage the power of existing linguistic resources to improve accuracy and performance. Additionally, it investigates the possibilities offered by unsupervised and semi-supervised learning, as well as the benefits of active learning and crowdsourcing in enriching the training data. Throughout the book, emphasis is placed on the importance of considering the cultural and linguistic context of low-resource languages, recognizing the unique nuances and intricacies that influence accurate ASR and translation. Furthermore, the book explores the potential impact of these technologies in various domains, such as healthcare, education, and commerce, empowering individuals and communities by breaking down language barriers. Audience The book targets researchers and professionals in the fields of natural language processing, computational linguistics, and speech technology. It will also be of interest to engineers, linguists, and individuals in industries and organizations working on cross-lingual communication, accessibility, and global connectivity.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 711

Veröffentlichungsjahr: 2024

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Table of Contents

Series Page

Title Page

Copyright Page

Dedication Page

Foreword

Preface

Acknowledgement

1 A Hybrid Deep Learning Model for Emotion Conversion in Tamil Language

1.1 Introduction

1.2 Dataset Collection and Database Preparation

1.3 Pre-Trained CNN Architectural Models

1.4 Proposed Method for Emotion Transformation

1.5 Synthesized Speech Evaluation

1.6 Conclusion

References

2 Attention-Based End-to-End Automatic Speech Recognition System for Vulnerable Individuals in Tamil

2.1 Introduction

2.2 Related Work

2.3 Dataset Description

2.4 Implementation

2.5 Results and Discussion

2.6 Conclusion

References

3 Speech-Based Dialect Identification for Tamil

3.1 Introduction

3.2 Literature Survey

3.3 Proposed Methodology

3.4 Experimental Setup and Results

3.5 Conclusion

References

4 Language Identification Using Speech Denoising Techniques: A Review

4.1 Introduction

4.2 Speech Denoising and Language Identification

4.3 The Noisy Speech Signal is Denoised Using Temporal and Spectral Processing

4.4 The Denoised Signal is Classified to Identify the Language Spoken Using Recent Machine Learning Algorithm

4.5 Conclusion

References

5 Domain Adaptation-Based Self-Supervised ASR Models for Low-Resource Target Domain

5.1 Introduction

5.2 Literature Survey

5.3 Dataset Description

5.4 Self-Supervised ASR Model

5.5 Domain Adaptation for Low-Resource Target Domain

5.6 Implementation of Domain Adaptation on wav2vec2 Model for Low-Resource Target Domain

5.7 Results Analysis

5.8 Conclusion

Acknowledgements

References

6 ASR Models from Conventional Statistical Models to Transformers and Transfer Learning

6.1 Introduction

6.2 Preprocessing

6.3 Feature Extraction

6.4 Generative Models for ASR

6.5 Discriminative Models for ASR

6.6 Deep Architectures for Low-Resource Languages

6.7 The DNN-HMM Hybrid System

6.8 Summary

References

7 Syllable-Level Morphological Segmentation of Kannada and Tulu Words

7.1 Introduction

7.2 Related Work

7.3 Corpus Construction and Annotation

7.4 Methodology

7.5 Experiments and Results

7.6 Conclusion and Future Work

References

8 A New Robust Deep Learning-Based Automatic Speech Recognition and Machine Transition Model for Tamil and Gujarati

8.1 Introduction

8.2 Literature Survey

8.3 Proposed Architecture

8.4 Experimental Setup

8.5 Results

8.6 Conclusion

References

9 Forensic Voice Comparison Approaches for Low-Resource Languages

9.1 Introduction

9.2 Challenges of Forensic Voice Comparison

9.3 Motivation

9.4 Review on Forensic Voice Comparison Approaches

9.5 Low-Resource Language Datasets

9.6 Applications of Forensic Voice Comparison

9.7 Future Research Scope

9.8 Conclusion

References

10 CoRePooL—Corpus for Resource-Poor Languages: Badaga Speech Corpus

10.1 Introduction

10.2 CoRePooL

10.3 Benchmarking

10.4 Conclusion

Acknowledgement

References

11 Bridging the Linguistic Gap: A Deep Learning-Based Image-to-Text Converter for Ancient Tamil with Web Interface

11.1 Introduction

11.2 The Historical Significance of Ancient Tamil Scripts

11.3 Realization Process

11.4 Dataset Preparation

11.5 Convolution Neural Network

11.6 Webpage with Multilingual Translator

11.7 Results and Discussions

11.8 Conclusion and Future Work

References

12 Voice Cloning for Low-Resource Languages: Investigating the Prospects for Tamil

12.1 Introduction

12.2 Literature Review

12.3 Dataset

12.4 Methodology

12.5 Results and Discussion

12.6 Conclusion

References

13 Transformer-Based Multilingual Automatic Speech Recognition (ASR) Model for Dravidian Languages

13.1 Introduction

13.2 Literature Review

13.3 Dataset Description

13.4 Methodology

13.5 Experimentation Results and Analysis

13.6 Conclusion

References

14 Language Detection Based on Audio for Indian Languages

14.1 Introduction

14.2 Literature Review

14.3 Language Detector System

14.4 Experiments and Outcomes

14.5 Conclusion

References

15 Strategies for Corpus Development for Low-Resource Languages: Insights from Nepal

15.1 Low-Resource Languages and the Constraints

15.2 Language Resources Map for the Languages of Nepal

15.3 Unicode Inception and Advent in Nepal

15.4 Speech and Translation Initiatives

15.5 Corpus Development Efforts—Sharing Our Experiences

15.6 Constraints to Competitive Language Technology Research for Nepali and Nepal’s Languages

15.7 Roadmap for the Future

15.8 Conclusion

References

16 Deep Neural Machine Translation (DNMT): Hybrid Deep Learning Architecture-Based English-to-Indian Language Translation

16.1 Introduction

16.2 Literature Survey

16.3 Background

16.4 Proposed System

16.5 Experimental Setup and Results Analysis

16.6 Conclusion and Future Work

References

17 Multiview Learning-Based Speech Recognition for Low-Resource Languages

17.1 Introduction

17.2 Approaches of Information Fusion in ASR

17.3 Partition-Based Multiview Learning

17.4 Data Augmentation Techniques

17.5 Conclusion

References

18 Automatic Speech Recognition Based on Improved Deep Learning

18.1 Introduction

18.2 Literature Review

18.3 Proposed Methodology

18.4 Results and Discussion

18.5 Conclusion

References

19 Comprehensive Analysis of State-of-the-Art Approaches for Speaker Diarization

19.1 Introduction

19.2 Generic Model of Speaker Diarization System

19.3 Review of Existing Speaker Diarization Techniques

19.4 Challenges

19.5 Applications

19.6 Conclusion

References

20 Spoken Language Translation in Low-Resource Language

20.1 Introduction

20.2 Related Work

20.3 MT Algorithms

20.4 Dataset Collection

20.5 Conclusion

References

Index

End User License Agreement

List of Tables

Chapter 1

Table 1.1 Objective evaluation using MCD.

Table 1.2 Subjective evaluation of some pretrained CNN architectural models us...

Table 1.3 Subjective evaluation of GMM, FFNN, and proposed model using MOS tes...

Table 1.4 Subjective evaluation of some pretrained CNN architectural models us...

Table 1.5 Subjective evaluation of GMM, FFNN, and proposed model using ABX tes...

Chapter 3

Table 3.1 Number of speech utterances collected.

Table 3.2 Classification report of dialect identification.

Table 3.3 Accuracy scores of GMMs calculated by varying the mixture components...

Chapter 5

Table 5.1 Results of E-content.

Table 5.2 Results of NPTEL.

Table 5.3 Results of E - Content and NPTEL.

Chapter 6

Table 6.1 An excerpt from a Malayalam speech corpus with word level transcript...

Table 6.2 Sample entries in a Malayalam pronunciation lexicon.

Table 6.3 WER of Malayalam ASR model created using CD-DNNHMM compared with var...

Chapter 7

Table 7.1 Sample English words and their segmentation at word level.

Table 7.2 Sample agglutinative words in Kannada and Tulu and their English tra...

Table 7.3 Sample Kannada and Tulu sandhied words with their stem and suffix mo...

Table 7.4 Sample Kannada and Tulu words, their syllables, and consonants and t...

Table 7.5 Sample Kannada compound words, words after removing the duplicate ve...

Table 7.6 Sample annotated Kannada and Tulu corpora.

Table 7.7 Hyperparameters and their values used in the proposed CRFSuite model...

Table 7.8 Features used in the proposed CRF model.

Table 7.9 Results of CRF model for MS of Kannada and Tulu words.

Chapter 9

Table 9.1 Literature review on forensic voice comparison for low-resource lang...

Table 9.2 Low-resource language dataset details [67].

Chapter 10

Table 10.1 CoRePooL v0.1.0: statistics.

Table 10.2 CoRePooL v0.1.0: variations.

Chapter 12

Table 12.1 Mean similarity opinion scores.

Chapter 13

Table 13.1 Proposed hyperparameters for Tamil and Telugu.

Table 13.2 Analysis of model parameters and WER for Tamil and Telugu.

Chapter 15

Table 15.1 Language family-wise distribution of speakers.

Table 15.2 State-wise distribution of language.

Table 15.3 State-wise official language.

Table 15.4 Groups of languages based on vitality.

Table 15.5 Interrelationship between language and ethnicity.

Table 15.6 Major scripts in Nepal.

Table 15.7 Research conducted by the Language Commission.

Table 15.8 Dataset details of NSC.

Table 15.9 Datasets for machine translation in languages of Nepal.

Chapter 16

Table 16.1 Outline data of the Samanatar corpus.

Table 16.2 IndicNLP corpus data.

Table 16.3 Data of Wikipedia articles utilized for preparing linguistic models...

Table 16.4 Corpus statistics for testing, training, and development.

Table 16.5 System requirement.

Table 16.6 BLEU results for English-to-Indian language conversion on Samananta...

Table 16.7 BLEU results for Indian-to-English language conversion on Samananta...

Table 16.8 Description of fluency and adequacy.

Table 16.9 An instance of imprudent conversion for Hindi-to-English.

Table 16.10 Outcomes of manual assessment.

Table 16.11 Excellence of conversion productivities.

Table 16.12 Some example Hindi-to-English translation outputs with adequacy an...

Table 16.13 Comparison of different language pairs using different evaluation ...

Table 16.14 Comparison of proposed DNMT model with google translate.

Chapter 17

Table 17.1 Possible approaches to handling low-resource language issues.

List of Illustrations

Chapter 1

Figure 1.1 The diagram 5 layers feed forward neural network model.

Figure 1.2 Block diagram of the proposed model.

Figure 1.3 Block diagram showing testing process.

Figure 1.4 MCD test for objective evaluation.

Figure 1.5 (a) Subjective evaluation of some pre-trained CNN architectural mod...

Figure 1.6 Subjective evaluation using ABX test.

Chapter 2

Figure 2.1 Sample data.

Figure 2.2 Attention-based encoder–decoder RNN network.

Figure 2.3 Working flow of SpeechBrain [4].

Figure 2.4 Architecture of SpeechBrain toolkit [4].

Figure 2.5 Proposed architecture.

Chapter 3

Figure 3.1 Block diagram of the dialect identification system.

Figure 3.2 MFCC features extraction steps.

Chapter 5

Figure 5.1 wav2vec2 Model architecture.

Figure 5.2 Sample audio file.

Figure 5.3 Feature encoder architecture.

Chapter 6

Figure 6.1 Speech recognition—human auditory system vs. ASR system.

Figure 6.2 Speech affected by different noises in each processing stage.

Figure 6.3 Framing.

Figure 6.4 Hamming window.

Figure 6.5 Mel filter bank.

Figure 6.6 Workflow of LPC.

Figure 6.7 Workflow of PLP.

Figure 6.8 Components of a statistical ASR system [25].

Figure 6.9 HMM states of a phoneme corresponding to feature vector X.

Figure 6.10 Types of SVMs.

Figure 6.11 (a) Data plot of vowel i and o, (b) hard margin applied by SVM cla...

Figure 6.12 Data plot of vowels i and e.

Figure 6.13 Introduced slack variable.

Figure 6.14 Multiclass, Malayalam monophthong short vowels.

Figure 6.15 Training time taken by each kernel function.

Figure 6.16 Encoder–decoder model with attention.

Figure 6.17 LSTM-based encoder-decoder with attention [53].

Figure 6.18 The transformer model architecture [54].

Figure 6.19 Architecture of shared hidden layer multilingual DNN.

Figure 6.20 CD-DNN-HMM architecture for ASR.

Chapter 7

Figure 7.1 Distribution of labels in the Kannada and Tulu datasets.

Figure 7.2 Framework of the proposed methodology.

Chapter 8

Figure 8.1 Proposed architecture.

Figure 8.2 Loss of model over the steps.

Figure 8.3 Result of model.

Figure 8.4 ROUGE evaluation of model.

Figure 8.5 Comparison between models.

Chapter 9

Figure 9.1 Forensic audio modulation graph [7].

Figure 9.2 Forensic voice comparison [8].

Figure 9.3 Depicts the forensic voice comparison (FVC) methodology.

Figure 9.4 Using mobile phone speech recordings as evidence in a court of law ...

Figure 9.5 Auditory and acoustic analysis of voice quality variations in norma...

Figure 9.6 Manual voice biometrics in law enforcement of audio forensics [16]....

Figure 9.7 Manual analysis through handwritten generated copies of suspect [17...

Figure 9.8 IKAR Lab 3: forensic audio suite [59].

Figure 9.9 Forensic audio analysis software SIS II [60].

Figure 9.10 Sound Cleaner II [61].

Chapter 10

Figure 10.1 Speech-to-text: evaluation loss and WER for fine-tuning Badaga.

Figure 10.2 Text-to-speech: evaluation loss.

Figure 10.3 Gender identification: evaluation loss and accuracy for fine-tunin...

Figure 10.4 Speaker identification: evaluation loss and accuracy for fine-tuni...

Figure 10.5 Epoch vs. BLEU score (in %).

Figure 10.6 Epoch vs. evaluation loss.

Chapter 11

Figure 11.1 Evolution of ancient Tamil script period-wise [8].

Figure 11.2 Process flow diagram.

Figure 11.3 Pre-processed Tamil characters from the period of the 9th century ...

Figure 11.4 Dataset augmentation technique output image.

Figure 11.5 Original inscription image [20].

Figure 11.6 Rotated image.

Figure 11.7 Pre-processed image.

Figure 11.8 Boxed image of ancient Tamil inscription.

Figure 11.9 CNN architecture.

Figure 11.10 Convolutional layer.

Figure 11.11 Pooling.

Figure 11.12 Web interface of ancient Tamil text translator.

Figure 11.13 Multilanguage choosing option.

Figure 11.14 Accuracy in graphical form.

Figure 11.15 Accuracy in table form.

Figure 11.16 Overall result of the ancient Tamil inscription multilingual tran...

Chapter 12

Figure 12.1 Flow diagram of methodology.

Figure 12.2 Mel spectrogram of original utterance and generated utterance.

Figure 12.3 Mel spectrogram of original utterance and generated utterance.

Chapter 13

Figure 13.1 Block schematic explaining our proposed methodology of transformer...

Figure 13.2 Block schematic explaining mel feature extraction.

Figure 13.3 Block schematic for model architecture.

Figure 13.4 Training loss for Tamil and Telugu.

Figure 13.5 Transcriptions produced by our model on various test cases.

Chapter 14

Figure 14.1 Neural network architecture.

Figure 14.2 Block diagram of (name of the diagram).

Figure 14.3 Data cleaning and preparation workflow.

Figure 14.4 WAV graphic for an audio file.

Figure 14.5 CNN architecture.

Figure 14.6 A glimpse of our unique CNN architecture.

Figure 14.7 Sample data.

Figure 14.8 Extracted features.

Figure 14.9 Features extracted from audio file.

Figure 14.10 Model compiling and fitting.

Figure 14.11 Predicting the class label.

Chapter 15

Figure 15.1 Number of languages and speakers.

Figure 15.2 Development timeline of language technologies in Nepal.

Figure 15.3 Speech corpus development process.

Figure 15.4 Iterative view of the short- and long-term goals.

Chapter 16

Figure 16.1 Structure of DNN.

Figure 16.2 Structure of CNN.

Figure 16.3 Structure of RNN.

Figure 16.4 Structure of DBN.

Figure 16.5 Structure of SAE.

Figure 16.6 Simplified architecture of machine translation.

Figure 16.7 Overview and architecture of the proposed DNMT model.

Figure 16.8 Neural network block diagram.

Figure 16.9 SAE-based encoder–decoder architecture of DNMT.

Figure 16.10 The structure of the DBN autoencoder.

Figure 16.11 Hybrid DNMT model—Working.

Figure 16.12 Experimental architecture [47].

Chapter 17

Figure 17.1 Challenges in low-resource languages.

Figure 17.2 Partition-based multiview learning.

Chapter 18

Figure 18.1 The process of the proposed automatic speech recognition using an ...

Figure 18.2 Process of MFCC.

Figure 18.3 Recurrent neural network model.

Figure 18.4 Comparison of recognition methods with precision.

Figure 18.5 Comparison of recall and recognition methods.

Figure 18.6 F-measures of the proposed and current speech recognition algorith...

Figure 18.7 Comparison of accuracy using recognition methods.

Chapter 19

Figure 19.1 Generic model of speaker diarization system.

Chapter 20

Figure 20.1 Tree structure of low-resource MT [17].

Figure 20.2 General architecture of end-to-end speech translation.

Figure 20.3 Layer freezing approach transformer model [33].

Figure 20.4 GAN and LAC MT system model [35].

Figure 20.5 Multi-pattern text filtering word2vec model for Uyghur language [3...

Figure 20.6 Universal MT LRL [41].

Guide

Cover Page

Table of Contents

Series Page

Title Page

Copyright Page

Dedication Page

Foreword

Preface

Acknowledgments

Begin Reading

Index

WILEY END USER LICENSE AGREEMENT

Pages

ii

iii

iv

v

xix

xx

xxi

xxii

xxiii

xxiv

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

Scrivener Publishing100 Cummings Center, Suite 541JBeverly, MA 01915-6106

Publishers at ScrivenerMartin Scrivener ([email protected])Phillip Carmical ([email protected])

Automatic Speech Recognition and Translation for Low Resource Languages

Edited by

L. Ashok Kumar

PSG College of Technology, Coimbatore, India

D. Karthika Renuka

PSG College of Technology, Coimbatore, India

Bharathi Raja Chakravarthi

School of Computer Science, University of Galway, Ireland

and

Thomas Mandl

Institute for Information Science and Language Technology, University of Hildesheim, Germany

This edition first published 2024 by John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA and Scrivener Publishing LLC, 100 Cummings Center, Suite 541J, Beverly, MA 01915, USA© 2024 Scrivener Publishing LLCFor more information about Scrivener publications please visit www.scrivenerpublishing.com.

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.

Wiley Global Headquarters111 River Street, Hoboken, NJ 07030, USA

For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.

Limit of Liability/Disclaimer of WarrantyWhile the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials, or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read.

Library of Congress Cataloging-in-Publication Data

ISBN 978-1-394-21358-0

Cover image: Pixabay.ComCover design by Russell Richardson

Dedication

To my wife, Ms. Y. Uma Maheswari, and daughter, A. K. Sangamithra, for their constant support and love.

Dr. L. Ashok Kumar

To my family and friends who have been highly credible and a great source of inspiration and motivation.

Dr. D. Karthika Renuka

Dr. Bharathi Raja Chakravarthi would like to thank his students.

Dr. Bharathi Raja Chakravarthi

Foreword

Recent advancements in Automatic Speech Recognition (ASR) and Machine Translation (MT) technologies have brought about a new era of hope and possibility for these low-resource languages. The convergence of cutting-edge research, powerful algorithms, and increased computational capacity has paved the way for groundbreaking applications that can revolutionize linguistic accessibility and inclusion.

This book stands as a testament to the transformative potential of ASR and MT technologies for marginalized languages. It brings together a diverse group of experts, researchers, and practitioners who have dedicated their efforts to addressing the unique challenges faced by low-resource languages and finding ways to overcome them with ASR and MT.

The chapters herein explore a wide range of topics related to ASR and MT for low-resource languages. The book delves into the theoretical foundations of ASR and MT, providing readers with a comprehensive understanding of the underlying principles and methodologies. It examines the technical intricacies and practical considerations of developing ASR and MT systems that are specifically tailored to low-resource languages, taking into account the scarcity of data and linguistic resources.

Moreover, this book sheds light on the potential applications of ASR and MT technologies beyond mere transcription and translation. It explores how these technologies can be harnessed to preserve endangered languages, facilitate cross-cultural communication, enhance educational resources, and empower marginalized communities. By offering real-world case studies, success stories, and lessons learned, the contributors provide invaluable insights into the impact of ASR and MT on low-resource languages and the people who speak them.

As you embark on this enlightening journey through the pages of this book, you will discover the tremendous potential of ASR and MT technologies to bridge the digital divide and empower low-resource languages. You will witness the strides made in linguistic accessibility and cultural preservation, and you will gain a deeper appreciation for the profound impact these technologies can have on societies, both large and small.

I extend my heartfelt appreciation to the editors and authors who have contributed their expertise, dedication, and passion to this volume. Their collective wisdom and tireless efforts have given rise to a comprehensive resource that will undoubtedly serve as a guiding light for researchers, practitioners, and policymakers committed to advancing the cause of linguistic diversity and inclusivity.

Together, let us embrace the power of ASR and MT technologies as instruments of empowerment and change. Let us work collaboratively to ensure that no language, no matter how small or remote, is left behind in the digital era. Through our collective endeavors, we can unleash the full potential of low-resource languages, fostering a world where linguistic diversity thrives, cultures flourish, and global understanding is truly within reach.

Sheng-Lung Peng

Dean, College of Innovative Design and Management, National Taipei University of Business, Creative Technologies and Product Design, Taiwan

Preface

In today’s interconnected world, effective communication across different languages is vital for fostering understanding, collaboration, and progress. However, language barriers pose significant challenges, particularly for languages that lack extensive linguistic resources and technological advancements. In this context, the field of Automatic Speech Recognition (ASR) and translation assumes paramount importance.

ASR and Translation for Low Resource Languages is a comprehensive exploration into the cutting-edge research, methodologies, and advancements in addressing the unique challenges associated with ASR and translation for low-resource languages. This book sheds light on the innovative approaches and techniques developed by researchers and practitioners to overcome the limitations imposed by scarce linguistic resources and data availability.

To start, the book delves into the fundamental concepts of ASR and translation, providing readers with a solid foundation for understanding the subsequent chapters. Then in explores the intricacies of low-resource languages, analyzing the factors that contribute to their challenges and the significance of developing tailored solutions to overcome them.

The material contained herein encompasses a wide range of topics, ranging from both the theoretical and practical aspects of ASR and translation for low-resource languages. The book discusses data augmentation techniques, transfer learning, and multilingual training approaches that leverage the power of existing linguistic resources to improve accuracy and performance. Additionally, it investigates the possibilities offered by unsupervised and semi-supervised learning, as well as the benefits of active learning and crowdsourcing in enriching the training data.

Throughout the book, emphasis is placed on the importance of considering the cultural and linguistic context of low-resource languages, recognizing the unique nuances and intricacies that influence accurate ASR and translation. Furthermore, we explore the potential impact of these technologies in various domains, such as healthcare, education, and commerce, empowering individuals and communities by breaking down language barriers.

The editors of this book brought together experts, researchers, and enthusiasts from diverse fields to share their knowledge, experiences, and insights in ASR and translation for low-resource languages. We hope that this collaborative effort will contribute to the development of robust and efficient solutions, ultimately fostering inclusive communication and bridging the language divide. We invite readers to embark on this journey of discovery and innovation, gaining a deeper understanding of the challenges, opportunities, and breakthroughs in ASR and translation for low-resource languages. Together, let us pave the way towards a world where language is no longer a barrier, but a bridge that connects individuals, cultures, and ideas.

Dr. L. Ashok Kumar

Professor, PSG College of Technology, India

Dr. D. Karthika Renuka

Professor, PSG College of Technology, India

Dr. Bharathi Raja Chakravarthi

Assistant Professor/Lecturer above-the-Bar School of Computer Science, University of Galway, Ireland

Dr. Thomas Mandl

Professor, Institute for Information Science and Language Technology, University of Hildesheim, Germany

Acknowledgement

We bow our head before “The God Almighty” who blessed us with health and confidence to undertake and complete the book successfully. We express our sincere thanks to the Principal and Management of PSG College of Technology, University of Galway, Ireland, University of Hildesheim, Germany for their constant encouragement and support.

We thank our family and friends who always stood beside us and encouraged us to complete the book.

Dr. L. Ashok Kumar is thankful to his wife, Y. Uma Maheswari, for her constant support during writing. He is also grateful to his daughter, A. K. Sangamithra, for her support; it helped him a lot in completing this work.

Dr. D. Karthika Renuka would like to express gratitude to her parents, for their constant support. Her heartfelt thanks to her husband, Mr. R. Sathish Kumar, and her dear daughter, Ms. P. S. Preethi, for their unconditional love which made her capable of achieving all her goals.

Dr. Bharathi Raja Chakravarthi would like to thank his students.

Dr. Thomas Mandl would like to thank his parents and family as well as all his academic colleagues for their inspiration during cooperation.

We would like to acknowledge the help of all the people involved in this project. First, we would like to thank each one of the authors for their contributions. Our sincere gratitude goes to the chapter’s authors who contributed their time and expertise to this book. We thank all the authors of the chapters for their commitment to this endeavor and their timely response to our incessant requests for revisions.

Second, the editors wish to acknowledge the valuable contributions of the reviewers regarding the improvement of quality, coherence, and content presentation of chapters. Next, the editors would like to recognize the contributions of editorial board in shaping the nature of the chapters in this book. In addition, we wish to thank the editorial staff at Wiley-Scrivener book for their professional assistance and patience. Sincere thanks to each one of them.

Dr. L. Ashok Kumar

Professor, PSG College of Technology, India

Dr. D. Karthika Renuka

Professor, PSG College of Technology, India

Dr. Bharathi Raja Chakravarthi

Assistant Professor/Lecturer above-The-Bar School of Computer Science, University of Galway, Ireland

Dr. Thomas Mandl

Professor, Institute for Information Science and Language Technology, University of Hildesheim, Germany

2Attention-Based End-to-End Automatic Speech Recognition System for Vulnerable Individuals in Tamil

S. Suhasini1, B. Bharathi2* and Bharathi Raja Chakravarthi3

1Department of Computer Science and Engineering, R. M. D. Engineering College, Tamil Nadu, India

2Department of Computer Science and Engineering, Sri Sivasubramaniya Nadar College of Engineering, Tamil Nadu, India

3School of Computer Science, University of Galway, Galway, Ireland

Abstract

The process of turning spoken language into written text is called automatic speech recognition (ASR). It is used in many o settings. Automatic speech recognition becomes a crucial tool when daily life is digitized. It is well known that it considerably improves the lives of the elderly and people with disabilities. Minor dysarthria, or slurred speech, is common in elderly people and those who are physically or mentally challenged, which leads to erroneous transcription of the data. In this study, we suggested creating a Tamil-language automatic voice recognition system for the elderly. The ASR system must be trained using elderly people’s speech utterances in order to increase its performance when processing elderly people’s speech. There is no Tamil speech corpus made up of elderly speakers. We recorded elderly and transgender individuals speaking Tamil on the spot. These statements were gathered from people speaking in open spaces including markets, hospitals, and vegetable shops. The speech corpus contains remarks from men, women, and transgender people. In this research, an attention-based, end-to-end paradigm is used to construct an ASR system. The proposed system includes two key steps: creating an audio model and a linguistic model. Recurrent neural network architecture was used in the construction of the language model. The attention-based encoder–decoder architecture was used to construct the acoustic model. The encoder model utilized a convolution network with a recurrent network, and the decoder model utilized an attention-based gated recurrent unit. Word error rate (WER) is used to assess how well the suggested ASR system performs when used on geriatric speaking utterances. The outcomes are compared to several transformer models that have already been trained. By pretraining a single model using the raw waveform of speech in various languages, the pre-trained XLSR models develop cross-lingual speech representations. The Common Voice Tamil voice corpus is used to fine tune the pre-trained models. According to the experiments, the suggested attention-based, end-to-end model performs noticeably better than transformer models that have already been trained.

Keywords: Automatic speech recognition (ASR), recurrent neural network (RNN), hidden Markov model (HMM), cross-lingual speech representations (XLSR), word error rate (WER), transformer model, encoder–decoder model

2.1 Introduction

Recently, a variety of tasks, including the creation of picture captions, handwriting synthesis machine translation, and visual object categorization, were successfully accomplished using attention-based recurrent networks. These models process their input iteratively, choosing pertinent facts at each step. With the help of this fundamental concept, end-to-end training techniques may now be used to build networks with external memory. To facilitate voice recognition, we propose adjustments to attention-based recurrent networks. It is conceivable to think of learning to recognize speech as learning to create a series (transcription) from a sequence (speech). From this angle, it was found that attention-based methods work well for projects like machine translation and handwriting synthesis. It is challenging to discriminate between similar speech fragments inside a single utterance because speech recognition requires far longer input sequences than machine translation does (thousands of frames as opposed to dozens of words). The input sequence is noisier and less clearly structured than in handwriting synthesis, which is another way in which it differs from that technique. Designing novel attention-based architectures that can analyze lengthy and noisy inputs using speech recognition is an appealing testbed as a result of these factors. Speech recognition is an ongoing research area and requires attention-based models to achieve completely end-to-end trainable systems. The most popular method is still based on an n-gram language model, a triphone hidden Markov model (HMM) model, and a deep neural acoustic model. To make the components operate together, dictionaries of manually created phoneme and pronunciation lexicons are needed, as well as a multi-stage training process.

2.2 Related Work

Some elderly persons attempt to acquire information from the internet using their audio message because they are not well-versed in technology [1]. An acoustic model must be created to handle these types of audio messages from elderly individuals; the model will identify their speech and extract the output of the speech data [1, 5]. One effective approach for automatic speech recognition (ASR) has been end-to-end (E2E) architecture. In this method, a single network is used instead of the traditional HMM-based systems’ [6] pronunciation dictionaries to directly map acoustic data into a sequence of characters or subwords [2, 12], where the input and output sequences are deterministically aligned by the attention mechanism, and the connectionist temporal classification (CTC) and recurrent neural network (RNN) [8, 11]. Transducer interpret this alignment as a latent random variable for MAP (maximum a posteriori) inference [3]. In particular, if the discriminative objective function used for training is closely related to the error rate on phones, words, or sentences, the recognition accuracy of a Gaussian mixture model-hidden Markov model (GMM-HMM) system can be further enhanced if it is discriminatively fine-tuned after it has been generatively trained to maximize its probability of generating the observed data [7]. A spell checker’s primary job is to find and fix grammatical errors, missing words, and incorrect words in text documents [9, 10]. In everyday situations, individuals can determine a speaker’s age based on their speech. This indicates that speech has some acoustic features that are age-related. Various age groups have various speech acoustic properties [13, 15]. In this study, we built the baseline acoustic models utilizing a lot of data in order to find more efficient ways to train acoustic models for geriatric speech and so improve speech recognition results [17]. For those with limited hand movement or eye impairment, particularly the elderly, ASR is a useful modality [18, 19]. As part of the PaeLife project10, which aimed to create a multimodal virtual personal assistant to support seniors in maintaining an active and social lifestyle, multilingual ASR was implemented [14, 16].

2.3 Dataset Description

Data on Tamil informal speech is gathered from the elderly. Older people’s recorded speech reveals how they converse in common settings including the vegetable shop, jewel shop, transport area, and patient wards. It contains three different gender (male, female, transgender) individuals who provided the speech data. The speech corpus comprises a total of 46 audio samples representing the 6 h, 42 min of speech data that make up the corpus [20