Sanskrit Parsing - Amba Kulkarni - E-Book

Sanskrit Parsing E-Book

Amba Kulkarni

0,0
12,00 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

About the Book
India has a rich grammatical tradition, still extant in the form of PÀõini’s grammar as well as the theories of verbal cognition. These two together provide a formal theory of language communication. The formal nature of the theory makes it directly relevant to the new technology called Natural Language Processing.
This book, first presents the key concepts from the Indian Grammatical Tradition (IGT) that are necessary for understanding the information flow in a language string and its dynamics. A fresh look at these concepts from the perspective of Natural Language Processing is provided. This is then followed by a concrete application of building a parser for Sanskrit using the framework of Indian Grammatical Tradition.
This book not only documents the salient pieces of work carried out over the last quarter century under Computational Paninian Grammar, but provides the first comprehensive exposition of the ideas involved. It fills a gap for students of Computational Linguistics/Natural Language Processing who are working on Indian languages using PÀõinian Grammatical Framework for developing their computational models and do not have direct access to the texts in Sanskrit. 
Similarly for the Sanskrit scholars and the students it provides an example of concrete application of the Indian theories to solve a contemporary problem.
About the Author
Amba Kulkarni is a computational linguist. Since 1991 she has been engaged in showing the relevance of Indian Grammatical Tradition to the field of computational linguistics. She has contributed towards the building of Anusaarakas (language accessors) among English and Indian languages. She is the founder head of the Department of Sanskrit Studies, University of Hyderabad established in 2006. Since then her focus of research is on use of Indian grammatical theories for computational processing of Sanskrit texts. Under her leadership, a consortium of institutes developed several computational tools for Sanskrit and also a prototype of Sanskrit–Hindi Machine Translation system. In 2015, she was awarded a “Vishishta Sanskrit Sevavrati Sammana” by the Rashtriya Sanskrit Sansthan, New Delhi for her contribution to the studies and research on Sanskrit-based knowledge system. She was a fellow at the Indian Institute of Advanced Study, Shimla during 2015-17.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 277

Veröffentlichungsjahr: 2021

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Sanskrit Parsing

Based on the Theories of Śābdabodha

Sanskrit Parsing

Based on the Theories of Śābdabodha

Amba Kulkarni

Foreword by

Rajeev Sangal

Cataloging in Publication Data — DK

[Courtesy: D.K. Agencies (P) Ltd. <[email protected]>]

Kulkarni, Amba, author.

Sanskrit parsing : based on the theories of śābdabodha/

Amba Kulkarni; foreword by Rajeev Sangal

pages cm

Includes passages in Sanskrit (roman).

Includes bibliographical references and index.

ISBN 9788124610787

1. Sanskrit language – Parsing. 2. Parsing (Computer

grammar) 3. Sanskrit language – Semantics. I. Title.

LCC PK435.K85 2019 | DDC 491.20285635 23

ISBN: 978-81-246-1078-7

First published in India, 2021

© Indian Institute of Advanced Study, Shimla

All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage or retrieval system, without prior written permission of both the copyright owner, indicated above, and the publisher.

The views expressed in this volume are those of the author, and are not necessarily those of the publishers.

Published by:

The Secretary

Indian Institute of Advanced Study

Rashtrapati Nivas, Summerhill, Shimla - 171 005

Phones: (0177) 283 1379; Fax: 283 1389

e-mail: [email protected]

Website: www.iias.org

and

D.K. Printworld (P) Ltd.

Regd. office: “Vedaśrī”, F-395, Sudarshan Park

(Metro Station: ESI Hospital), New Delhi - 110015

Phones: (011) 2545 3975; 2546 6019

e-mail: [email protected]

Website: www.dkprintworld.com

Printed by: D.K. Printworld (P) Ltd., New Delhi

Dedicated in memory of my Father and Teacher

Anantpur Bacce Padmanabharao

who introduced me to

the Language of Mathematics and

the mathematically precise grammar of a language,

and was a source of inspiration

for all my endeavours

Foreword

SANSKRIT holds an important place in the development of theories of language. First, as a language it is rich in lexical and grammatical derivational processes, dealing with word and morpheme level to sentence level and beyond. Second, theories which were developed to analyse Sanskrit language were themselves rich and awe-inspiring. Their goal was to bring precision and clarity in the utterance which was a unique endeavour of its time in 500 bce. The theories were designed to fix meaning, namely, the meaning of Sanskrit utterance should be clear, precise, and unambiguous for all time to come. Metaphorically, the Sanskrit grammarians and theorists were solving the Y10K or Y100K problem!1

The study of language played the same role in the Indian civilization as was played by the study of geometry in the Greek civilization. Both encouraged precision of thought and formalization in reasoning. The former was clearly a much tougher domain. The success achieved therein influenced the entire civilization.

These theories of language influenced the study of language and linguistics across the world. Sanskrit language, its vocabulary, and affixes (pratyaya) were adopted in Tibetan language a millennium earlier. Later, many of the ideas travelled via the Arab world to Europe. In the nineteenth century, Sanskrit as a wonderful language was rediscovered by Europe, particularly the German and later the British scholars. Subsequently, language typological studies (Greenberg 1963) of the West were influenced by theories of language and grammar developed by the Sanskrit grammarians.

In the twentieth century, Pāṇinian model was discovered and rediscovered in a variety of ways. The ideas of formal generative grammar introduced by Chomsky in the 1950s were not only present in Pāṇini but already developed to a high order (Cardona 1976, 1988). At the same time, the Pāṇinian model was much closer to semantics, with, for example, a developed theory of kāraka(arguments of verbs) and samāsa(compounding). It used “case” or the more generalized concept of vibhakti, anticipating Fillmore (1968) by two millennia, and that too complete with the derivational (generative) process. The idea of Minimalism is inherent in the organization of Pāṇini’s Aṣṭādhyāyī (Deshpande 1985; Kiparsky 1982). What was not realized earlier was that Pāṇini defined operations on a technical representation and through that showed the computational process in derivation of sentences of the language, much like what modern computational linguistics was/is trying to do (Bharati et al. 1995).

Today, there is a young new technology called Natural Language Processing (NLP) with applications as wide as Machine Translation, Information Extraction and Retrieval, Question–Answering, Dialogue Systems, etc. Language theories from Sanskrit suddenly find a new fertile ground for their application. This makes the book even more timely.

The Sanskrit theories relate directly with language processing. In this sense, the theories are almost tailor-made for NLP. They deal with information and meaning in a central way. They address the question: How can one go from the information contained in words and their coming together in a sentence, to the meaning or vivakṣā (intension) in the speaker’s mind? Thus, human communication or conveying of meaning comes at the centre.

The author of this book presents the Indian Grammatical Tradition (IGT) with detailed references very faithfully. Every concept is introduced in the larger setting of information and meaning, is defined referring to the traditional sources, and is connected with the larger task of language processing. In this fashion, the theories become more lucid and useful at the same time, without sacrificing faithfulness.

After setting the stage in the first chapter, the author introduces the śabda-śakti (word meaning) and śābdabodha (theories of verbal cognition) in Chap. 2. These are central to the theories of language in the IGT. The different types of meaning of a word in IGT are not just intuitively easy to comprehend, but also simplify the theory conceptually. The author’s treatment is scholarly. Theories of śābdabodha show the different types of concerns in language analysis, from the perspectives of Vaiyākaraṇas (Grammarians), Naiyāyikas (Logicians), and the Mīmāṁskas (Discourse/Pragmaticians).

Śābdabodha in IGT contains the key elements for a program in Computational Linguistics. These are ākāṅkṣā (expectancy), sannidhi (planarity), and yogyatā (congruity). They allow linguistic data to be prepared and parsing to be done elegantly. My hope is that in time to come, these will permit the integration of theory-based approaches with theory-bereft approaches (viz. Statistical and Neural-based NLP). The theory will bring out the essence so that it can be handled by machine almost directly. For the phenomena, where the required concomitant knowledge is very hard to compile, it can be left for the “theory-bereft” approaches. Finally, the author presents the algorithms for parsing which is a result of complementing the traditional theories with modern efficient algorithms.

The strength of the book lies in its faithful and clear presentation of the theories of language from IGT, the identification of the key elements, and finally their use in constructing efficient algorithms. The book not only documents the saliant pieces of work carried out over the last quarter century under Computational Paninian Grammar (CPG), but provides the first comprehensive exposition of the ideas involved. It will serve as an important milestone of achievements so far.

Hopefully, the book will also open up the frontier of applying concepts from Sanskrit parsing to modern Indian languages on a bigger scale, and indeed to all languages of the world.

References

Bharati, Akshar, Vineet Chaitanya and Rajeev Sangal, 1995, Natural Language Processing: A Paninian Perspective, New Delhi: Prentice Hall of India.

Cardona, George, 1976, Panini: A Survey of Research, The Hague: Mouton & Co.

———, 1988, Panini: His Work and Its Traditions, vol. 1: Background and Introduction, Delhi: Motilal Banarsidass.

Deshpande, Madhav M., 1985, Ellipses and Syntactic Overlapping: Current Issues in Paninian Syntactic Theory, Pune: Bhandarkar Oriental Research Institute.

Fillmore, Charles J., 1968, “The Case for Case”, in Universals of Linguistic Theory,ed. E. Bach and R.T. Harms, pp. 1-88, New York: Holt Rinehart and Winston.

Greenberg, Joseph, 1963, Universals of Language, Cambridge, MA: MIT Press.

Kiparsky, P., 1982, Some Theoretical Problems in Pāṇini’s Grammar, Poona: Bhandarkar Oriental Research Institute.

Rajeev Sangal

IIIT Hyderabad

24 April 2019

1 Compare this with the Y2K problem: Computer software needed to be fixed to remove ambiguity, caused due to use of only the last two digits of a year. The problem appeared or would have appeared in the year 2000, within a span of mere forty years of the software being written. The contrast between, forty years and 10,000 years (Y10K) is too stark not to be noticed.

Preface

THIS book is an outcome of my fellowship at the Indian Institute of Advanced Study, Shimla during 2015-17. I was always fascinated by the rich Indian grammatical tradition, especially the minute attention paid to the information coding in a language string. In 2006 when I joined the Department of Sanskrit Studies at the University of Hyderabad, I decided to restrict my work to the Sanskrit Computational Linguistics, taking as much help as possible from this rich tradition. Immediately after developing a morphological analyser and a sandhi splitter, I decided to venture into the development of a sentential parser. While all the machine translation systems used several other modules such as part of speech (POS) tagger and chunker before calling the parser, I decided to call the parser right after the morphological analyser. The main reason behind this decision was that when I looked at various Indian literature there was no discussion on any kind of POS tagger or chunker. However there were discussions on various factors that help in verbal cognition. The main focus of all these discussions was the flow of information in a sentence. And this was essentially what I was looking for in order to build automatic language processors. So I decided to follow the tradition as closely as possible in the development of my parser, even at the risk of going against the current trends of using machine-learning algorithms, which I believe deserve a place only when one exhausts almost all the information sources discussed in Indian traditional grammar.

The first parser was developed in 2009 by my student N. Shailaja as a part of her MPhil dissertation. She used C Language Integrated Production System (CLIPS),1 a tool for building an expert system, for writing her rules. This first parser was further enhanced by Sheetal Pokar, Pavankumar Satuluri and Madhvachar, with funding from the Department of Information Technology (DeitY), under its Technology Development for Indian Languages (TDIL) programme. This parser had two components. The first one was formation of a graph, for which we used the CLIPS environment, and the second part was a constraint solver. This constraint solver was written in MINION.2 I noticed that the constraint specifications represented in matrix form for the MINION constraint solver resulted in a large-sized sparse matrix, which slowed down the performance of the system. This prompted me to re-examine the design of the constraint solver which resulted in a graph-based depth first traversal algorithm implemented in Perl.3 Though I had a working module for a morphological analyser, the coverage of the derivational morphology was not satisfactory. Gérard Huet’s “The Sanskrit Heritage Site”4 had good coverage of morphology as well as the best implementation of a sandhi splitter. Therefore I thought of taking advantage of existing resources instead of improving my own morphological analyser and sandhi splitter. When I started interlinking this module with the segmenter of the Heritage site, I thought it would be better to implement my parser in OCaml5 (in which the Heritage platform is developed) for better integration. I also noticed that the depth first traversal algorithm written in Perl could be improved further by noting down the compatibility conditions in the beginning. This observation along with the functional aspect of OCaml resulted in the redesigning of my algorithm further so as to make it natural from the functional programming point of view. And this was the fourth avatāra of the parser. My student Sanjeev Panchal encoded various Pāṇinian sūtras in OCaml, while I wrote the constraint solver to extract dependency tree from a graph following the edge-centric binary join.

The development of these parsers is largely influenced by the theories of śābdabodha (verbal cognition). These theories discuss in great detail the information encoding in a language string. They provided me answers to questions such as where is the information encoded, how much information is encoded, what do the words signify, what role do various significative powers of a word have in the understanding of a text and so on. Different schools, in Indian tradition, have discussed these questions. The major challenge before me was to decide which school to follow. Second, the examples discussed were few in number, often just one or two. It was therefore challenging for me to understand their stand based on these examples and the commentaries on these texts. I followed two different approaches. When I knew of a relevant concept discussed in the Śāstras, I would try to understand it and then use it appropriately to solve the problem. When I did not know where to look for the solution, I would first arrive at the solution on the basis of empirical evidence and then look for the theoretical support for it in the Śāstras. In the case of ākāṅkṣā and sannidhi, I followed the first approach. But in the case of yogyatā, since I could find hardly one or two pages of material on yogyatā with only one stock example, I, with the help of my student Sanjeev, came up with observations based on the data. These observations provided us clues for what and where to look for in the Śāstras. Of course whatever approach we followed we tested our implementation on a corpus, drawn from various classical Sanskrit texts. In all the grammatical texts which we referred to, what I found useful was that the theories of verbal import were objective. And it is this objectivity that guarantees automatic processing.

Students of Sanskrit, especially of Vyākaraṇa, Nyāya and , hear that the theories of śābdabodha are useful computationally. But the lack of any text describing the importance of śābdabodha leaves them clueless. During the last few years, I travelled all over India delivering lectures on the importance of śābdabodha from computational point of view, and I found that it generated a new enthusiasm among Sanskrit students. This also made me think over preparing a short monograph describing this importance in detail. I also met several teachers who were interested in offering a course on the contemporary relevance of Indian theories of śābdabodha, but due to lack of any teaching material, they could not.

On the other hand there are students and researchers working in the field of computational linguistics focusing on Indian languages. There are very few grammar books for Indian languages, and hardly any of these is as complete as Pāṇini’s grammar for Sanskrit. Since most of the Indo-Aryan languages have originated from Sanskrit, Pāṇini’s grammar definitely provides good insights for handling various linguistic problems of them. After the book by the Akshar Bharati group Natural Language Processing: A Paninian Perspective, though much research took place in this field, no textbook was produced that can help a student. Texts by Kunjunni Raja, B.K. Matilal, Veluri Subba Rao, Subramania Iyer, to name a few, written from the perspective of providing an overview of the contribution of Indian grammarians, are useful for researchers. But for students of computational linguistics, they do not provide any direct insights. There are several excellent translations of the original work, such as the one by Mahāmahopādhyāya Ganganath Jha of the Śābara-Bhāṣya on Mīmāṁsāsūtras, or the translations of Patañjali’s Mahābhāṣya, and a lot of secondary literature on these topics. But all this material is beyond the reach of the students of computational linguistics since these texts are written from a different perspective.

With these two strata of readers in mind, I decided to write down my understanding of these theories from a computational viewpoint. This book is the result of that exercise. The first chapter provides an overview of various computational tools for Sanskrit and then introduces the main theme of this book, viz. the dependency parsing. The second chapter introduces the Indian theories of word meaning and sentence meaning briefly, discussing various conditions of knowing the meaning of a sentence. The third chapter is my main contribution which provides interpretation of the concepts discussed in the second chapter from the computational perspective and provides computational models to implement them. The fourth chapter discusses the three different dependency parser algorithms I built, with the help of my students, each one being an improvement over the previous one.

I have tried to provide a glimpse of parallel concepts employed in the contemporary computational linguistics so that the students of linguistics in general and of computational linguistics in particular should find it easy to connect the concepts presented here with what they are familiar with. At the same time, the students of Sanskrit grammar should find the third chapter interesting where they would know the nature of problems a computational linguist is facing and see the practical demonstration of application of the Indian theories to solve the contemporary problems. I hope this book will provide confidence to the students and researchers from both the disciplines to access the relevant material on the other discipline.

What I have presented is my understanding of the concepts in the Indian grammatical tradition. If there are any errors in my understanding, or errors in the presentation, I am solely responsible for them. I would, in such cases, like to know about them so that I can rectify them and correct my understanding. I hope this book will be useful to students of computational linguistics and also Sanskrit students to understand the rich Indian grammatical tradition and its relevance to the field of computational linguistics.

1 http://www.clipsrules.net/

2 http://www.constraintmodelling.org/minion

3 http://www.perl.org

4 http://sanskrit.inria.fr

5 https://ocaml.org

Contents

Foreword

Preface

Acknowledgements

1. Introduction

1.1 Sanskrit Computational Tools: Current Status

1.1.1 Word Generators

1.1.2 Word Analysers

1.1.3 Lexical Resources

1.1.4 Tools Based on Data-driven Approaches

1.2 Sanskrit Parser

1.2.1 Constituency Structure

1.2.2 Dependency Structure

1.2.3 Parsing and Theories of Verbal Cognition

2. Understanding Texts : Indian Theories

2.1 Word Meaning

2.1.1 Abhidhā (Primary Denotation)

2.1.2 Lakṣaṇā (Implication)

2.1.3 Vyañjanā (Suggestion)

2.2 Necessary Conditions for Verbal Cognition

2.2.1 Ākāṅkṣā (Expectancy)

2.2.2 Sannidhi (Proximity)

2.2.3 Yogyatā (Congruity)

2.2.4 Tātparya (Purport)

2.3 Vākyārtha (Sentential Meaning)

2.4 Structure of Verbal Cognition

2.5 Understanding Texts: Commentary Tradition

2.5.1 Canonical word order

2.6 Conclusion

3. Śābdabodha Theories and Sanskrit Parsing

3.1 Ākāṅkṣā: Establishing Relations

3.1.1 Where is the Information?

3.1.2 What kind of Information?

3.1.3 Repository of Relations

3.1.4 How is the Information Encoded?

3.2 Sannidhi: Planarity Constraint

3.2.1 Projectivity Principle

3.2.2 Weak Non-projectivity (Planarity)

3.2.3 Empirical Evaluation

3.2.4 Conclusion

3.3 Yogyatā: Semantic Restrictions

3.3.1 Selection Restriction

3.3.2 Śabda-śakti (Level of Signification)

3.3.3 Yogyatā as a Filter

3.3.4 Modelling Yogyatā

3.3.5 Evaluation

3.4 Conclusion

4. Sanskrit Parsing

4.1 Introduction

4.1.1 Dependency Parse Structure

4.2 Design of a Parser

4.2.1 Establishing Directed Edges

4.2.2 Defining the Constraints

4.3 Solving the Constraints

4.3.1 Constraint Satisfaction Problem

4.3.2 Vertex-centric Traversal

4.3.3 Edge-centric Binary Join

4.4 Compact Display of Multiple Solutions

4.5 Conclusion

5. Conclusion

Appendices

A. Evaluation of Parsers : Various Parameters

Measure of Correctness of Parse

A.1 Precision and Recall

B. Classification of Lakṣaṇā

C. List of Relations in Pāṇinian Grammar

D. List of Relations Used in the Sanskrit Parser

Glossary

Bibliography

Index

Acknowledgements

I WAS very fortunate to be a fellow at the Indian Institute of Advanced Study, Shimla during 2015-17. At the outset, I thank the selection committee and the then Director Prof. Chetan Singh for giving me an opportunity to work in this prestigious Institute at the feet of the Himalayas. The conducive atmosphere in the Institute for carrying out research, the serene skies, the evening sunsets and the dense pine and devadāra trees touching the skies provided me the right kind of environment to carry out my work without any disturbances. Thanks are also due to my parent institute, the University of Hyderabad, for granting me leave for the above period.

Some of the material in this book was earlier published in the form of conference/seminar papers in their proceedings, or as a journal article. I thank the publishers for giving me permission to use the relevant parts in this book, with or without modification. The publications from which I have used the material are the following:

1. Kulkarni, A., S. Pokar and D. Shukl, 2010, “Designing a Constraint Based Parser for Sanskrit”, in Fourth International Sanskrit Computational Linguistics Symposium, ed. G.N. Jha, pp. 70-90, Springer-Verlag, LNAI 6465.

2. Bharati, A. and A. Kulkarni, 2010, “Information Coding in a Language: Some Insights from Paninian Grammar”, Dhīmahi, Journal of Chinmaya International Foundation Shodha Sansthan, I(1): 77-91.

3. Kulkarni, A., 2013, “A Deterministic Dependency Parser with Dynamic Programming for Sanskrit”, in Proceedings of the Second International Conference on Dependency Linguistics (DepLing2013), pp. 157-66, Prague, Czech Republic: Charles University in Prague Matfyzpress Prague Czech Republic.

4. Kulkarni, A. and K.V. Ramakrishnamacharyulu, 2013, “Parsing Sanskrit Texts: Some Relation Specific Issues”, in Proceedings of the 5th International Sanskrit Computational Linguistics Symposium, ed. M. Kulkarni, New Delhi: D.K. Printworld.

5. Kulkarni, A.P., P. Shukla, P. Satuluri and D. Shukl, 2015, “How Free Is the ‘Free’ Word Order in Sanskrit”, in Sanskrit Syntax, ed. P. Scharf, pp. 269-304, The Sanskrit Library.

6. Sanjeev, P. and A. Kulkarni, 2018, “Yogyatā as an Absence of Incongruity”,in Computational Sanskrit & Digital Humanities, ed. H. Gérard and A. Kulkarni, New Delhi: D.K. Publishers.

I thank all my fellow friends at the Indian Institute of Advanced Study, who participated in my presentations and gave valuable suggestions concerning my work. Special thanks are due to Dr Terry Varma, Prof. Madhavan and Dr Lalitha Raja who read earlier drafts of my manuscripts meticulously, and discussed various aspects, to Prof. Nirmal Sengupta and Prof. Vijay Varma for valuable discussions on related topics, and to Amit Datta and Ayswarya Sankaranarayanan, who shared my study at the Institute, for exposing me to the wonderful creative world of artists.

I also thank the anonymous reviewers who reviewed my intermediate and the final drafts of the manuscript and provided useful feedback on the manuscript.

I thank my teachers, colleagues, friends and students who took trouble to go through my draft manuscript and provided useful suggestions. While there is a danger of leaving somebody out, I still would like to put on record the names: Prof. Rajeev Sangal, Prof. K.V. Ramakrishnamacharyulu, Prof. B.N. Patnaik, Prof. Korada Subrahmanyam, Prof. Peter Scharf, Prof. Dipti Misra Sharma, Prof. Srinivas Varakhedi, Prof. Rajaram Shukla, Prof. Tirumala Kulkarni, Prof. Lalit Kumar Tripathi, Dr Sukhada, Dr Arjuna and Sanjeev Panchal. I thank my research assistants and students Dr Shailaja, Dr Sheetal Pokar, Dr Pavankumar Satuluri, Dr Madhvachar for implementation of earlier versions of ākāṅkṣā module and Sanjeev Panchal for the current implementation of both the ākāṅkṣā and yogyatā. Thanks are also due to Dr Preeti Shukl and Dr Pavankumar Satuluri who worked with me on the problem of word order and sannidhi violation in Sanskrit.

Prof. Vineet Chaitanya deserves special mention. Most of the concepts discussed here have either originated from him, or he was the sounding board for them. Discussions with him on various aspects brought clarity to my thoughts. Prof. Gérard Huet, not only went through my ever-evolving manuscript at several times, but also raised several intriguing questions which helped me broaden my vision and improve the implementation of the software as well as the content of the manuscript.

I thank Prof. Rajeev Sangal, my teacher and the leader of the computational linguistics community in India, who along with Prof. Vineet Chaitanya, is pioneer in demonstrating the utility of the Indian grammatical theories for Natural Language Processing and in particular Machine Translation, for writing the foreword.

I also thank Shri Susheel Mittal ji of the D K Printworld for readily accepting to print the book and also for providing all the necessary assistance.

Finally, I thank my sons Achyut and Kedar who kept me free from any botherations and always provided the needed emotional support for carrying out my work with more enthusiasm.

1

Introduction

LANGUAGE technology plays an important role in the digital era. The last half century has seen an exponential growth in the fields of computational linguistics and language technology. Software related to machine translation, information retrieval, information extraction, search engines, question answering systems, etc. are available and constantly being enhanced. With mobile phones becoming as powerful as computers, mobile apps are being developed in several language-related areas including language learning and language games. Such tools are not only useful for modern languages, but also play a crucial role in making classical language texts easily accessible. For example, the Perseus Digital Library Project1 provides access to a digital corpus of classical languages such as Latin, Greek and Arabic, with support for linguistic analysis and contextual reading. Sanskrit has received good attention from scholars as well as enthusiasts from all over the world. In addition to several websites serving as repositories of Sanskrit texts, in the field of computational linguistics we find individual as well as collaborative efforts during the last two decades with their linguistic software available online for public usage.2

While earlier efforts towards the development of many such linguistic tools were based on linguistic theories specially developed from the computational perspective, the last decade has seen the use of machine-learning techniques, deep-learning and big data replacing linguistic theories. The main reason behind the preference for machine learning and similar techniques over pure linguistic approaches is the cost involved in developing the language resources needed for disambiguation at various levels, and the complexity involved in the representation of such knowledge. Machine-learning and deep-learning techniques with the help of big data can capture the nuances of languages very effectively and in recent years have shown promise in the field of word-sense-disambiguation, an important module in any machine-translation system. But the disadvantage of these techniques is that whatever a machine learns cannot be controlled or modified or improved manually. In order to improve the performance of a machine, we end up needing more actionable, relevant and smart data that help reduce the unreliability.

When it comes to classical languages such as Sanskrit, we definitely want our software to produce “reliable” and “faithful” translations as well, in addition to quick translations which may be useful to get a rough idea about what the text in the source language is. Quick translations help one to judge the relevance of the text one wants to get translated before venturing into getting it translated manually. Faithful translator or an accessor, on the other hand, provides complete access to the original text, giving one confidence about the reliability of the translation. Thus, on the one hand, for classical languages we need a machine-translation system developed using fast, actionable, reliable and smart (FARS) data to provide a quick gist and on the other hand, we need complete and faithful access to the original text.

1.1 Sanskrit Computational Tools: Current Status

Sanskrit assumes a unique status when it comes to the field of linguistic analysis with its more than 2,500-year long and still extant grammatical tradition. Sanskrit grammar enjoys a similar status in India as mathematics in the West. Pāṇini’s grammar is an important milestone in the Indian grammatical tradition. Unlike grammars of other languages, it is almost complete and, together with the theories of śābdabodha (verbal understanding), this grammar provides a complete system for language analysis as well as generation. It is therefore natural to explore the use of these theories for building computational tools for language analysis that can provide complete and faithful access to the original Sanskrit texts. At the same time, we also see tools that are developed with state-of-the-art technology such as machine learning. In what follows, we give a brief summary of the various efforts in the development of Sanskrit computational tools, both grammar-based as well as non-grammar-based or data-driven.

1.1.1 WORD GENERATORS

In the recent past there have been several efforts to implement the rules of the Aṣṭādhyāyī computationally simulating the process of rule selection and the derivation process in the Aṣṭādhyāyī by Goyal, Kulkarni and Behera (2009), Misra (2009), and Subbanna and Varakhedi (2009, 2010). Pavankumar Satuluri (2016) developed a compound generator following the procedure described in the Aṣṭādhyāyī. He was not only interested in being faithful to the derivation process as described in the Aṣṭādhyāyī, but was also interested in its computational complexity. Krishna and Goyal (2016) describe a taddhita (secondary derivatives) generator that represents a sūtra from the Aṣṭādhyāyī as an object. They discuss the problems related to multiple inheritance and conflict resolution techniques. Patel and Katuri (2016) describe an “NLP order of sūtras” and implement the subanta (nominal inflections) generation rules as arranged in the Siddhāntakaumudī. In a similar effort, Swami Shivamurthy Taralabalu3 has developed a noun generator. Scharf et al. (2015) programmatically determine ātmanepada vs parasmaipada verbal terminations. In the recent developments, Sohoni and M.A. Kulkarni (2018) have developed a simulator where they translate each Pāṇinian sūtra as a Haskell module. In another effort, Sarada Susarla, Tilak Rao and Sai Susarla (2018) have developed an interpreter for sūtras in the Aṣṭādhyāyī, where each sūtra is represented as a record in JSON format. Scharf (2009a) has examined how to model various features of Pāṇinian grammar. Recently Scharf (2016) described an XML annotation scheme to represent the interpretation of sūtras in an unambiguous way so that one can translate them into a computer program to build a simulator. All these efforts tried to follow the grammar in toto, and for some of them, the motive behind the development of this software was also to interpret Pāṇini’s grammar from the computational point of view and implement them programmatically.

There are some other efforts where we notice a deviation from Pāṇini in the implementation of generators for Sanskrit words. The Heritage engine developed by Huet (2016) is one such instance, where he is interested in representing the tight coupling of word and meaning in the derivation process, but his system deviates from Pāṇini in the implementation. Kulkarni and Shukl (2009) follow a paradigm model, used in pedagogy, for noun generation. For verb generation they use a ready-made verb-form tables, and an efficient finite state transducer4 is used for computational processing.

1.1.2 WORD ANALYSERS

A word, for computational purpose, is defined as a string of characters separated by white spaces. In Sanskrit, due to the influence of oral tradition, the consecutive words are joined together. At the join, the phonemes, optionally, undergo a change termed as sandhi.* So the task of word analysers, in the case of Sanskrit, is twofold. The first task is to identify the word boundaries, and undo the sandhi operation, and the second task is to analyse such split words.

A generative grammar of any language provides rules for generation. For analysis, we require a mechanism by which we can use these rules in a reverse way. The reversal in some cases is easy and also deterministic. For example, subtraction is an inverse operation of addition and is deterministic. The reversal, however, may not always be deterministic. Let us see a simple example of non-deterministic reversal with which all of us are familiar. The multiplication tables or simple method of repetitive addition provides a mechanical way for multiplication. Given a product, to find its factors is a reverse process. Multiplication of two numbers, say 4 and 3, produces a unique number 12. But its decomposition into two factors is not unique. 12 may be decomposed into two factors as either {6, 2} or {4, 3} in addition to a trivial decomposition {12, 1}. Thus the inverse process may at times involve non-determinism. Depending upon the context, if one factor is known, the other factor gets fixed. For example, if you are interested in distributing 12 apples among 2 children, then one of the factors being 2, the other factor, viz. 6, is determined uniquely.

This is true of a generative grammar as well. To give an example, look at the following two sūtras of Pāṇini.

• anabhihite (A 2.3.15)

• kartr̥karaṇayos tr̥tīyā (A 2.3.18)

These two sūtras together, in case of a passive voice (karmaṇi prayogaḥ), assign a third case suffix (vibhakti) to both the kartr̥ (agent) as well as karaṇa (instrument) kāraka. Here is an illustrative sentence:

Skt: rāmeṇa bāṇena vāliḥ ahanyata। (1)

Gloss: By_Rāma with_an arrow Vāli was_killed.

Eng: Vāli was killed by Rāma with an arrow.

Now, when a hearer (who knows Sanskrit grammar) listens to this utterance, he notices two words ending in the third case suffix and that the construction is in the passive voice. But unless he knows that rāma (Rāma) is the name of a person and bāṇa (arrow) is used as an instrument, he may fail to get the correct reading. In the absence of such “extralinguistic” knowledge, there are two possible interpretations, viz. either rāma is kartr̥ and bāṇa is karaṇa, or bāṇa is kartr̥ and rāma is karaṇa leading to non-determinism.6

We come across non-determinism with the process of segmentation as well. As mentioned earlier, in the case of Sanskrit compound words, sandhi between the components of a compound is mandatory. Further, there is a tendency to write Sanskrit text as a continuous string (saṁhitāpāṭha