139,99 €
Computer-assisted translation (CAT) has always used translation memories, which require the translator to have a corpus of previous translations that the CAT software can use to generate bilingual lexicons. This can be problematic when the translator does not have such a corpus, for instance, when the text belongs to an emerging field. To solve this issue, CAT research has looked into the leveraging of comparable corpora, i.e. a set of texts, in two or more languages, which deal with the same topic but are not translations of one another. This work had two primary objectives. The first is to assess the input of lexicons extracted from comparable corpora in the context of a specialized human translation task. The second objective is to identify bilingual-lexicon-extraction methods which best match the translators' needs, determining the current limits of these techniques and suggesting improvements. The author focuses, in particular, on the identification of fertile translations, the management of multiple morphological structures, and the ranking of candidate translations. The experiments are carried out on two language pairs (English-French and English-German) and on specialized texts dealing with breast cancer. This research puts significant emphasis on applicability - methodological choices are guided by the needs of the final users. This book is organized in two parts: the first part presents the applicative and scientific context of the research, and the second part is given over to efforts to improve compositional translation. The research work presented in this book received the PhD Thesis award 2014 from the French association for natural language processing (ATALA).
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 387
Veröffentlichungsjahr: 2014
Contents
Acknowledgments
Introduction
I.1. Socio-economic stakes of multilingualism management
I.2. Motivation and goals
I.3. Outline
PART 1 Applicative and Scientific Context
1 Leveraging Comparable Corpora for Computer-assisted Translation
1.1. Introduction
1.2. From the beginnings of machine translation to comparable corpora processing
1.3. Term alignment from comparable corpora: a state-of-the-art
1.4. CAT software prototype for comparable corpora processing
1.5. Summary
2 User-Centered Evaluation of Lexicons Extracted from Comparable Corpora
2.1. Introduction
2.2. Translation quality evaluation methodologies
2.3. Design and experimentation of a user-centered evaluation
2.4. Discussion
3 Automatic Generation of Term Translations
3.1. Introduction
3.2. Compositional approaches
3.3. Data-driven approaches
3.4. Evaluation of term translator generation methods
3.5. Research perspectives
PART 2 Contributions to Compositional Translation
4 Morph-Compositional Translation: Methodological Framework
4.1. Introduction
4.2. Morpho-compositional translation method
4.3. Issues addressed and contributions
4.4. Evaluation methodology
4.5. Conclusion
5 Experimental Data
5.1. Introduction
5.2. Comparable corpora
5.3. Source terms
5.4. Reference data for translation generation evaluation
5.5. Translation ranking training and evaluation data
5.6. Linguistic resources
5.7. Summary
6 Formalization and Evaluation of Candidate Translation Generation
6.1. Introduction
6.2. Translation generation algorithm
6.3. Morphological splitting evaluation
6.4. Translation generation evaluation
6.5. Discussion
7 Formalization and Evaluation of Candidate Translation Ranking
7.1. Introduction
7.2. Ranking criteria
7.3. Criteria combination
7.4. Evaluation
7.5. Discussion
Conclusion and Perspectives
C1.1. Results review
C1.2. Work limitations and perspectives
PART 3 Appendices
Appendix 1 Measures
A1.1. Vectors normalization
A1.2. Vectors similarity
A1.3. Corpus comparability [LI 10, p. 645–646]
A1.4. Values standardization [GEN 77, p. 48–50]
A1.5. Evaluation measures
A1.6. Annotators agreement
Appendix 2 Data
A2.1. Comparable corpora
A2.2. Texts to be translated and reference translations
A2.3. Linguistic resources
A2.4. Source terms
A2.5. Reference data for translation generation evaluation
A2.6. Translation ranking training and evaluation data
Appendix 3 Comparable Corpora Lexicons Consultation Interface
List of Tables
List of Figures
List of Algorithms
List of Extracts
Bibliography
Index
To Elia
First published 2014 in Great Britain and the United States by ISTE Ltd and John Wiley & Sons, Inc.
Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address:
ISTE Ltd27-37 St George’s RoadLondon SW19 4EUUK
www.iste.co.uk
John Wiley & Sons, Inc.111 River StreetHoboken, NJ 07030USA
www.wiley.com
© ISTE Ltd 2014
The rights of Estelle Maryline Delpech to be identified as the author of this work have been asserted by her in accordance with the Copyright, Designs and Patents Act 1988.
Library of Congress Control Number: 2014936484
British Library Cataloguing-in-Publication DataA CIP record for this book is available from the British LibraryISBN 978-1-84821-689-1
Acknowledgments
I would like to give Béatrice Daille and Emmanuel Morin my deepest thanks for having supervised and co-supervised this doctoral research. I felt so honored to work and learn by their sides. They both have shown an effective combination of academic rigorousness and pedagogy, which has helped me to progress over the past three years. Béatrice, thank you for suggesting this thesis to me – it was a very enriching experience. Manu, thanks again for being there whenever I needed you.
I extend my warm thanks to Nabil Hathout, Élisabeth Lavault-Olléon, Emmanuel Planas and Michel Simard for honoring me by attending my viva and being part of my committee in spite of their busy schedules. Their constructive comments were especially useful. I am glad to have profited from so many complementary points of view on my work. Special thanks to Michel Simard for coming to Nantes all the way from Canada.
I am particularly thankful to Emmanuel Planas, the former Scientific Director of Lingua et Machina, for trusting me and hiring me as a research engineer five years ago. Otherwise, I would probably not have had the chance to carry out a dissertation with LINA or to work on such a fascinating research subject within such a stimulating industrial environment.
Several people contributed to the work presented in this book. I would first like to thank Claire Lemaire from the University Stendhal in Grenoble, because she was an amazing colleague and co-doctoral candidate; and also for creating the resources for processing and assessing the German language. This would not have been possible without her and I am very thankful.
I would also like to thank Geoffrey Williams and Pierre Zweigenbaum for agreeing to be a part of my doctorate monitoring committee. Their shrewd comments and advice guided me in the right direction during this research project.
I would also like to thank Léa Laporte of the Toulouse Information Technology Research Institute (Institut de Research en Informatique de Toulouse (IRIT)) and Damien François from the Catholic University in Louvain for answering my questions on the statistical data-processing. Thanks also to Van Dang, from the University of Massachusetts, for answering my questions on the use of learning-to-rank algorithms.
I am immensely grateful to Clémence de Baudus, Kiril Isakov, Mathieu Delage from the Higher Education Institute of Translation and Interpreting (ISIT) and Nicolas Auger, for their extremely detailed annotation, without which it would not have been possible to evaluate the translation system.
My thanks go also to my colleagues at Lingua et Machina, François, Étienne and Jean-François, with whom I learned a lot and who I thank for their support. François’s advice and experience were invaluable during the last year of my doctorate.
Unfortunately, I did not have the chance to spend a lot of time in the laboratory but it was always a pleasure to come to the team meetings. The welcome and atmosphere of the LINA is fantastic, and I greatly enjoyed talking with my colleagues – especially Amir Hazem and Prajol Shrestha, who were delightful fellow doctoral candidates.
Finally, I would like to thank my partner, Nicolas, for his unfailing support; my friends Émilie and Nathalie and my sister Laureen for their understanding about my lack of availability, as well as for their presence and logistical support on the day of the viva. Thanks to Loki, who is a wonderful alarm clock.
In the days of the globalization of exchanges, multilingualism is an undeniable socio-cultural asset, but it presents many challenges to our society.
First of all, the lack of knowledge of a language is often synonymous with limited access to information, and it is generally linguistic communities with little economic power, or whose language is not a prestigious one, who suffer as a result.
The case of the Internet is a good example: English – the most represented language on the web (54.8%)1 – is the first language of only 26.8% of web users2 whereas Chinese – the first language of 24.2% of the web users – is only sixth in terms of presence on the Internet (4%).
A significant portion of web-based information is therefore unavailable to many web users because of the language barrier.
In countries which are officially bilingual or multilingual or in international organizations such as the European Union, managing multilingualism falls within the remit of democracy: it is meant to ensure that each citizen has access to administrative services and legal texts in his own first language so she/he knows his/her rights and can benefit from the government’s services in a language she/he speaks fluently. This has a considerable cost: the European Union spends 1 billion Euros every year in translation and interpretation costs [FID 11].
Multilingualism also has an impact on our economy: the ELAN report [HAG 06] claimed that in 2006 the lack of language skills had cost on average 325,000 Euros to a European SMB over three years.
To deal with this social and economic cost, research has been performed to speed up and improve the process of human translation. Today, there is a whole industry devoted to this issue. The language industry provides both human translation services and a wide range of software packages intended to bring translation costs down: translation memories, bilingual terminology-extraction and management software, localization software, etc.
This is the framework of research and development in computer-assisted translation (CAT) within which my doctoral research has taken place. This research was partially funded by Lingua et Machina3 – a company specializing in multilingual content management in a corporate environment, and by the ANR project Metricc,4 devoted to the leveraging of comparable corpora.
CAT has always used translation memories. This technique requires the translator to have a corpus of previous translations available, which the CAT software can use to generate bilingual lexicons, for example. This reality is problematic when the translator does not have such a corpus. This situation arises when the texts to be translated belong to an emerging field or to several languages for which few resources are available. To solve this issue, CAT research has looked into the leveraging of comparable corpora, i.e. a set of texts, in two or more languages, which deal with the same topic but are not translations of one another.
Comparable corpora have been the focus of academic research since the 1990s [FUN 95, RAP 99], and the existence of the Workshop on Building and Using Comparable Corpora (BUCC), organized every year since 2008 on the fringe of major conferences, shows the dynamism of this research topic.
The current research mainly aims at extracting aligned pairs of terms or sentences, which are then used in cross-lingual information retrieval (CLIR) systems [REN 03, CHI 04, LI 11] or in machine translation (MT) systems [RAU 09, CAR 12]. While CAT is often mentioned as a potential applicative field, the input of comparable corpora has not, to our knowledge, been genuinely studied within this application framework. Yet it presents several issues such as scaling or the adaptation to the needs of the final users.
This book had two primary objectives. The first objective is to assess the input of lexicons extracted from comparable corpora in the context of a specialized human translation task. Care has been taken to highlight the needs of translators and to understand how the comparable corpora can be best leveraged for CAT.
The second objective is to identify bilingual-lexicon-extraction methods, which best match the translators’ needs. Determining the current limits of these techniques and suggesting improvements is the focus of this research. We will focus, in particular, on the identification of fertile translations (cases in which the target term has more words than the source term), the management of multiple morphological structures and the ranking of candidate translations (the algorithms usually return several candidate translations for a single-source term).
The experiments are carried out in two language pairs (English–French and English–German) and on specialized texts dealing with breast cancer. This research has significant emphasis on applicability, and our methodological choices are guided by the needs of the final users.
This book is organized in two parts:
Part 1 presents the applicative and scientific context of the research. In Chapter 1, a historical overview of the beginning of MT is presented and we show how the focus of research efforts gradually turn toward CAT and the leveraging of comparable corpora. This book presents the current techniques to extract bilingual lexicons and detail the way in which the writer created the prototype of a CAT tool meant to leverage comparable corpora. Chapter 2 is devoted to the applicative assessment of this tool: we observe how the lexicons, thus extracted, enable translators to work more efficiently in their work. This assessment highlights the specific needs of human translation which are not dealt with in the classical techniques of term alignment. This is why this research took a different path, toward a different type of method, which aims to generate the translations of terms which can then be filtered using the corpus rather than to align terms that had been previously extracted from corpora. These techniques are described in Chapter 3. In this chapter, the focus is mainly on the so-called compositional approaches. Their limits are explored and Part 1 concludes with an indication of possible fruitful avenues for future research.
Part 2 of the book is given over to the efforts to improve compositional translation. Chapter 4 presents the methodological framework of the research: it describes the principle behind this approach, and attempts to highlight the contributions this work makes to compositional translation in terms of fertility, variety of the morphological structures processed and ranking of the candidate
translations. The assessment methodology is also presented. Chapter 5 describes the data which was used for experimenting with the translation method origin, nature, size and acquisition method. Chapter 6 gives details of the implementation: the translation generation algorithm is mentioned here. The translation generation method is then assessed from a variety of angles (input of resources, input of translation strategies of productive translations, etc.). Finally, Chapter 7 formalizes and experiments with several ranking methods for the generated translations.
This dissertation finishes with an assessment of the work carried out and suggestions of several research perspectives. The Appendices include an index of the measurements used throughout the book as well as extracts of the experimental data.
1 In May 2011, according to WEB TECHNOLOGY SURVEYS http://w3techs.com/technologies/ overview/content_language/all.
2http://www.internetworldstats.com/stats7.htm.
3http://www.lingua-et-machina.com,
4http://www.metricc.com.
PART 1
Applicative and Scientific Context
This chapter starts with a historical approach to computer-assisted translation (section 1.2): we will retrace the beginnings of machine translation and explain how computer-assisted translation has developed so far, with the recent appearance of the issue of comparable-corpus leveraging. Section 1.3 explains the current techniques to extract bilingual lexicons from comparable corpora. We provide an overview of the typical performances, and discuss the limitations of these techniques. Section 1.4 describes the prototyping of the computer-assisted translation (CAT) tool meant for comparable corpora and based on the techniques described in section 1.3.
From the beginning, scientific research in computer science has tried to use the machine to accelerate and replace human translation. According to [HUT 05], it was in the United States, between 1959 and 1966, that the first research in machine translation was carried out. Here, machine translation (MT) refers to the translation of a text by a machine without any human intervention. Until 1966, several research groups were created, and two types of approaches could be identified:
– On the one hand, there were the pragmatic approaches combining statistical information with trial-and-error development methods1 and whose goal was to create an operational system as quickly as possible (University of Washington, Rand Corporation and University of Georgetown). This research applied the direct translation method2 and this gave rise to the first generation of machine translation systems.
– On the other hand, theoretic approaches emerged involving fundamental linguistics and considering research in the long term (MIT, Cambridge Research Language Unit). These projects were more theoretical and created the first versions of interlingual systems.3
In 1966, a report from the Automatic Language Processing Advisory Committee [ALP 66], which assesses machine translation purely based on the needs of the American government – i.e. the translation of Russian scientific documents – announced that after several years of research, it was not possible to obtain a translation that was entirely carried out by a computer and of human quality. Only postedition would allow us to reach a good quality of translation. Yet the point of postedition is not self-evident. A study mentioned in the appendix of this book points out that “most translators found postediting tedious and even frustrating”, but many found “the output served as an aid... particularly with regard to technical terms” [HUT 96].
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
