139,99 €
This book summarizes the main problems posed by the design of a man-machine dialogue system and offers ideas on how to continue along the path towards efficient, realistic and fluid communication between humans and machines. A culmination of ten years of research, it is based on the author's development, investigation and experimentation covering a multitude of fields, including artificial intelligence, automated language processing, man-machine interfaces and notably multimodal or multimedia interfaces.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 411
Veröffentlichungsjahr: 2013
Table of Contents
Preface
Introduction
PART 1: HISTORICAL AND METHODOLOGICAL LANDMARKS
Chapter 1: An Assessment of the Evolution of Research and Systems
1.1. A few essential historical landmarks
1.2. A list of possible abilities for a current system
1.3. The current challenges
1.4. Conclusion
Chapter 2: Man-Machine Dialogue Fields
2.1. Cognitive aspects
2.2. Linguistic aspects
2.3. Computer aspects
2.4. Conclusion
Chapter 3: The Development Stages of a Dialogue System
3.1. Comparing a few development progresses
3.2. Description of the main stages of development
3.3. Conclusion
Chapter 4: Reusable System Architectures
4.1. Run-time architectures
4.2. Design-time architectures
4.3. Conclusion
PART 2: INPUTS PROCESSING
Chapter 5: Semantic Analyses and Representations
5.1. Language in dialogue and in man–machine dialogue
5.2. Computational processes: from the signal to the meaning
5.3. Enriching meaning representation
5.4. Conclusion
Chapter 6: Reference Resolution
6.1. Object reference resolution
6.2. Action reference resolution
6.3. Anaphora and coreference processing
6.4. Conclusion
Chapter 7: Dialogue Acts Recognition
7.1. Nature of dialogue acts
7.2. Identification and processing of dialogue acts
7.3. Multimodal dialogue act processing
7.4. Conclusion
PART 3: SYSTEM BEHAVIOR AND EVALUATION
Chapter 8: A Few Dialogue Strategies
8.1. Natural and cooperative aspects of dialogue management
8.2. Technical aspects of dialogue management
8.3. Conclusion
Chapter 9: Multimodal Output Management
9.1. Output management methodology
9.2. Multimedia presentation pragmatics
9.3. Processes
9.4. Conclusion
Chapter 10: Multimodal Dialogue System Assessment
10.1. Dialogue system assessment feasibility
10.2. Multimodal system assessment challenges
10.3. Methodological elements
10.4. Conclusion
Conclusion
Bibliography
Index
First published 2013 in Great Britain and the United States by ISTE Ltd and John Wiley & Sons, Inc.
Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address:
ISTE Ltd
27-37 St George’s Road
London SW19 4EU
UK
www.iste.co.uk
John Wiley & Sons, Inc.
111 River Street
Hoboken, NJ 07030
USA
www.wiley.com
© ISTE Ltd 2013
The rights of Frédéric Landragin to be identified as the author of this work have been asserted by him in accordance with the Copyright, Designs and Patents Act 1988.
Library of Congress Control Number: 2013939407
British Library Cataloguing-in-Publication Data
A CIP record for this book is available from the British Library
ISBN: 978-1-84821-457-6
Preface
The preparation of this book was carried out while preparing an accreditation to supervise research. This is a synthesis covering the past 10 years of research, since my doctorate [LAN 04], in the field of man–machine dialogue. The goal here is to outline the theories, methods, techniques and challenges involved in the design of computer programs that are able to understand and produce speech. This synthesis covers the presentation of important works in the field as well as a more personal approach, visible through the choice of the themes explored, for example. How can a machine talk, understand what is said and carry out a conversation close to natural conversation between two human beings? What are the design stages of a man–machine dialogue system? What are the understanding, thinking and interaction abilities expected from such systems? How should they be implemented? How can we get closer to the realistic and fluid aspect of human dialogue? Can a dialogue system lie?
These questions are at the origin of my path, which oscillated between linguistics and computer science, between pure research and development, between public and private research laboratories: INRIA, then Thales and currently the CNRS. These are also questions that second-year Masters students asked me during the man–machine dialogue class that I held at University Paris Diderot for a few years. Thus, this book draws inspiration in part from the class preparation and aims to be accessible to a public with linguistic and automatic language processing notions, and not necessarily with knowledge of the man–machine dialogue domain.
The goal here is to explain the main issues created by each stage of the design of a man–machine dialogue system, and to show a few theoretical and technical paths used to deal with these issues. The presentation will not cover all the wealth of existing works, but beyond that, it will aim to provide the readers a glimpse of the field, which might make them want to know more.
The goal here is to show that today there still is a French school of man–machine dialogue, which has been especially active in the past few years, even if it was at times a bit slower and at times it appeared that the man–machine dialogue was an aporia. The French school is characterized by its multidisciplinary approach, its involvement in different fields, such as system development (university prototypes, general public systems, as well as – and we tend to forget them since they are confidential – military systems), implementation of assessment methods and campaigns, and software architecture design. There is a French school for multimodal dialogue, for ergonomics, for embodied conversational agents, and even for the application of machine learning techniques to the man–machine dialogue. Not all the links between these specialties are completely finalized, but the general dynamics are undeniable and encouraging.
As usual in research work, what is presented in this book is indebted to the encouragement, advice and more generally speaking the sharing of an efficient and enjoyable work environment. For their institutional as well as scientific and human encouragement, I would like to thank Francis Corblin, Catherine Fuchs, Valérie Issarny, Jean-Marie Pierrel, Laurent Romary, Jean-Paul Sansonnet, Catherine Schnedecker, Jacques Siroux, Mariët Theune, Bernard Victorri and Anne Vilnat. For the incredibly enriching Ozone experiment during my postdoctorate fellowship at INRIA, I would particularly like to thank Christophe Cérisara, Yves Laprie and especially Alexandre Denis on whom I was able to rely to implement a memorable demonstrator. For the equally memorable experiment of Thales R & T, I would like to thank, more specifically, Claire Fraboulet-Laudy, Bénédicte Goujon, Olivier Grisvard, Jérôme Lard and Célestin Sedogbo. For the wonderful workplace that is the Lattice laboratory, a Joint Research Unit of the CNRS, I would like to thank, without repeating those whom I have already mentioned, Michel Charolles for our very enriching exchanges on reference, Shirley Carter-Thomas and Sophie Prévost for the information structure, Thierry Poibeau and Isabelle Tellier for natural language processing, my successive colleagues Sylvain, Laure, Frédérique, as well as Benjamin, Denis, Fabien, Jeanne, Julie, Marie-Josèphe, Noalig, Paola, Paul, Pierre and Sylvie. I would also like to thank those with whom I was able to interact through Atala (I am more specifically thinking of Frédérique, Jean-Luc and Patrick) and within my man–machine dialogue classes, as well as those with whom I have started collaboration, even if they sometimes did not come to fruition. Many thanks then go to Ali, Anne, Gaëlle, Jean-Marie, Joëlle, Meriam, Nathalie and Tien. Finally, I would like to thank to Céline for her constant encouragement and unending support.
Introduction
The Ozone [ISS 05] system mentioned in the Preface was a demonstrator for a train ticket reservation service within the framework of the European Ozone project. It is a recurring application (or task) in man–machine dialogue, and this is the framework that we will use to provide examples throughout the book. The computer program behind the demonstrator was able to process an audio input, transcribe the captured speech into text and understand the text in order to provide an adequate answer. The task required the system to know the timetables of a set of trains in a given region, and so a database was implemented: it allowed the dialogue system to find crucial information for its answers, which, as in a human dialogue, were given orally. Until now, we have remained within the framework of the man–machine spoken dialogue, which has vocal inputs and outputs. This type of system can be used on the phone with no visual channel. Ideally, the system is quick, comprehensive and provides relevant answers so that the user has the impression he is talking spontaneously, as with a human interlocutor.
However, we had set ourselves an additional specification, that of creating a multimodal system able to manage both speech and pointing gestures carried out on a touch screen. The system was able to recognize pointing gestures and to link these gestures with words pronounced simultaneously. What was true of the system’s input had to be true of its output as well, and thus we have designed a system able to manage output multimodality, which meant that it could produce both a vocal utterance and a display on the screen. In other words, once the system had decided on an answer to give the user, it could decide either to verbalize its answer, to display it on the screen, or better yet verbalize part of it and display the rest. This is what we call a multimedia information presentation. Going beyond the issues with oral dialogue, we have reached the issues in multimodal dialogue. The systems in question involve a communication situation shared between the human user and the machine. This shared situation brings together a visual context (what appears on the computer’s screen) and gestures (which remain very simple for now since they are limited to the contact on the screen). With this communication situation, we get closer to in-person human dialogue: the user faces the machine when speaking and sees a visual display that the machine also “sees”.
To work, the program thus had to be carried out on a computer with at least a microphone, a speaker and a touch screen, which was much less common in 2004 than it is now. Table I.1 shows an example of the dialogue that the system could have with a user. The successive speaking turns are shown with a letter (U for user, S for system) and a number to identify the analyses and discussions more easily.
Table I.1.Man–machine dialogue example
Utterance
Action on the screen
S1 :
“Hello, I am the train ticket reservation system.”
Display of a map on screen
U1 :
“Hello, I would like to go to Paris.”
–
S2 :
“Here are your possible itineraries.”
Two itineraries appear
U2 :
“How long with this itinerary which seems shorter?”
Gesture pointing to one of the itineraries
S3 :
“Twenty minutes.”
Highlighting the chosen itinerary
U4 :
“Very well, I would like to book a single journey.”
–
S4 :
…
…
A dialogue like this one is a type of discourse – that is a series of sentences linked each other – with the specificity that it only involves two speakers and not only one. When a dialogue involves more than two speakers, we can refer to it as a multilog. If we take the succession of words “here are your possible itineraries”, we use the term sentence as long as we take these words, their organization and their meaning out of context, and the term utterance if we take the context into account, that is the fact that this sentence was uttered by the system S at a specific moment in the dialogue, and, in this case, at the same time as a display action (which gives the word “here” a specific meaning, and this word is meant to present multimedia information). Depending on the context, a sentence can be the source of various utterances.
The example in Table I.1 is an interaction, according to the terminology adopted. In S1, the system presents itself; then, from U1 to U4, the dialogue focuses on the purchase of a train ticket. The extract from U1 to U4 is an exchange: the goal defined in U1 is reached in U4, which closes the exchange without putting an end to the interaction. An exchange necessarily involves two speakers, and has various speaking turns, at least two. S1, U1 … U4 are interventions that match the speaking turns. An intervention only involves a single speaker and defines itself as the biggest monologal unit in an exchange. An intervention can be understood as a single speech act (action performed by speech, such as giving an order and answering a question), such as in S2 or S3, or various speech acts, such as in S1 or U1 where the first act is a greeting and the second act is the transmission of information.
Based on the use of language (or natural language opposed to the artificial languages of computer science), the dialogue is studied due to notions of linguistics. The analysis of utterances thus falls within the field of pragmatics, a study of the language in use. The analysis of the sentences themselves falls within the field of linguistics. More specifically, the analysis of the meaning of sentences and concepts involved falls within the field of semantics. At the level of sentence construction, we focus on words, on units that create the lexicon, on groups of words, on the order in which they appear and the relations between them, which is syntax. In an oral dialogue, we also focus on the phonic materialization of sentences, the prominences, the rhythm and the melody, which falls within the field of prosody. To all these analytical outlines, we can add all the phenomena characterizing natural language, especially the fact that there are a variety of ways to express a single meaning, or that the language is in its essence vague and imprecise, which can lead to ambiguities (more than one interpretation of an utterance is possible) and underspecification (the interpretation of an utterance can be incomplete). This is the wealth and diversity of language, which a natural language dialogue system needs to take into account if it wants to be comprehensive. Language in a dialogue situation is characterized by wealth and diversity which are notably expressed through utterance combinations, i.e. the way in which an utterance is linked to the previous one, and the way in which various successive utterances create an exchange, and, in a general manner, in the dialogue structure which, builds itself along with the interaction, and is also an object of analysis. When this structure does not reflect a rigidity of codified protocol but a natural use of the language, we reach a final definition, that of natural dialogue in a natural language.
This is the field of research and development covered in this book, and it has been already explored in many books, whether as is the aspect of system presentations or of sufficiently formal theories which in the end authorize computer implementation. As an example, and in chronological order, we will mention a set of books whose reading is useful, even crucial, for any specialist in the field of man–machine dialogue: [REI 85], [PIE 87], [SAB 89], [CAR 90], [BIL 92], [KOL 93], [LUZ 95], [BER 98], [REI 00], [ASH 03], [COH 04], [HAR 04], [MCT 04], [LOP 05], [CAE 07], [JUR 09], [JOK 10], [RIE 11], [GIN 12] and [KÜH 12]. To provide the reader with a few points of reference and approach the main aspects of the field, we will give a chronological outline of the field’s history in Chapter 1.
The field of man–machine dialogue covers various scientific disciplines. We have mentioned computer and language sciences, but we will also see in Chapter 2 that other disciplines can provide theories and supplementary points of view. With the aim of designing a machine that has abilities close to a human being (we try to get as close to human abilities as possible, without simulating them), we can find inspiration in all kinds of studies focusing on language and dialogue so as to model them in a computational framework which would allow for their use in man–machine dialogue.
The field of man–machine dialogue (from now on MMD) has links with other fields, such as natural language processing (NLP), of which it is an essential application; artificial intelligence (AI), from which it arises and which completes the linguistic aspects with the reasoning and decision-making aspects; man–machine interfaces (MMIs), which it helps enrich by offering vocal interaction possibilities in addition to graphic user and touch screen interactions; and, more recently, question-answering systems (QAS) and embodied conversational agents (ECAs), which are some of its aspects – the first to focus on the natural language interrogation of large databases and the second on the visual and vocal rendering of the avatar representing the machine-interlocutor – which have become fully fledged research fields. The MMD field thus brings together various issues that can be separated into three major categories:
According to the type of system considered (tool versus partner, or to put it differently, by offering the user a logic of doing or of making do), according to the communication modalities between the user and the system (written dialogue versus oral dialogue), according to the part given to a task underpinning the dialogue (dialogue in an open domain versus closed domain) and according to the importance given to the language (dialogue favoring the task versus dialogue favoring linguistic fluidity and realism), these issues give rise to many approaches and ways of implementation. The approaches can be theoretical, for example extending and testing a syntactic or particular pragmatic theory, or practical (favoring robustness). The implementations can be symbolical or statistical, etc. Chapter 3 will analyze these aspects describing the stages of achievements of the MMD system. As for the question of software architecture, Chapter 4 will complete and finish the first part of the book with the crucial challenges such as reusability and design of generic models, like what is being done in the field of MMI.
Processing utterances at the system’s input is the focus of the second part of the book, with Chapter 5 covering the fundamental lexical, syntactic, prosodic and semantic aspects, Chapter 6 analyzing the issue of resolving contextual reference and Chapter 7 discussing the recognition and interpretation of speech acts in the context of a dialogue. We will quickly go over the questions of automatic recognition of speech and the so-called low-level processes to focus on the high-level processes that revolve around the meaning of utterances: semantics, reference and speech acts. Based on the example of U2 in Table I.1, Chapter 5 which focuses on semantic analysis will show how to represent the meaning of the sentence “how long with this itinerary which seems shorter?”. The problem is complex because the sentence has a main clause and a subordinate clause, and because the main clause has no verb. Without such a linguistic analysis, an MMD system can hardly be called comprehensive. Chapter 6 focuses on the reference and will show how the utterance and the pointing gesture of U2 allow us to provide the demonstrative referential expression “this itinerary” with a referent, in this case a specific train journey. Without this ability to solve the reference, an MMD system can hardly know what is referred to in the dialogue. Chapter 7, that focuses on the speech acts, will show how this U2 intervention can be interpreted as a set of two speech acts, the first act being a question and the second act commenting on the referred train journey, a comment that can be processed in different ways by the system, for example if it is indeed, or not, the shortest itinerary. In this case again, Chapter 7 will highlight an essential aspect of an MMD system: without an ability to identify speech acts, a system can hardly know how to react and answer the user.
The system’s internal and output processing determines its behavior and are the focus of the third part of this book. In Chapter 8, we will see how identifying speech acts allows the system to reason according to the acts identified, to the task and the dialogue already carried out. This question highlights the issue of putting the user’s utterance into perspective and determining the appropriate reaction in return. Beyond all the processes studied in the second part of the book, this is where we have to reason not at the level of a single utterance, but at that of the dialogue as a whole. We will thus speak of dialogue management. In Chapter 9, we will see how a system can carry out the reaction on which it has decided. This is the question of automatic message generation, a question that takes a specific direction when we take into account avatars (and we join here the field of ECAs), or even, much more simply, as we have mentioned before, the possibility of presenting information on a screen at the same time as a message is verbally expressed.
Finally, Chapter 10 will deal with an aspect that concerns the design stages as well as the final system once the computer implementation is finished. This is the question of evaluation, a delicate question inasmuch as an MMD system integrates components with various functionalities, and in which, as we have seen, the system types that can be considered have themselves highly varied priorities and characteristics. This question will lead us to conclude on the field of MMD, the current state of achievement and the challenges for the years to come.
Man–machine dialogue (MMD) systems appear more present in the works of science-fiction than in reality. How many movies do we know which show computers, robots, or even fridges and toys for children who can talk and understand what they are told? The reality is more complex: some products that have come from new technologies, such as cell phones or robot companions, talk and understand a few words, but they are far from the natural dialogue which science-fiction has been promising for years.
The ideas for application are not lacking. Implementing a dialogue with a machine could be useful for getting targeted information, and this could be for any type of information: transportation [LAM 00], various stores, tourist or recreational activities [SIN 02], library collections, financial administrative procedures [COH 04], etc., see [GAR 02] and [GRA 05]. The dialogue is indeed adapted to the step-by-step elaboration of a request, a request that would be difficult to hold in a single utterance or in a command expressed in a computer language. The first field of application of MMD that includes question–answering systems (QAS) is sometimes defined as . When the dialogue only concerns a single topic, for example railway information, we talk of closed-domain dialogue. When the dialogue can be about pretty much anything, for example the questioning of the encyclopedic database as IBM Watson recently did with a TV show task, we talk of open-domain dialogue [ROS 08]. If we reuse the example of the introduction, a unique utterance with no dialogue could be as follows: “I would like to book a single journey to Paris taking the shortest itinerary as long as it takes less than half an hour (otherwise I do not wish to make a reservation)”. The elaboration of a natural dialogue is much more flexible: it allows the user to express a first simple request and then improve it according to the machine’s answer; it allows the machine to transfer information for a future action, and confirm or negate along the way [PIE 87]. The total number of words to arrive at the same result might be greater, but the spontaneity of the utterances and their speed as well as the ease of production is more than fair compensation. The example of questioning a yellow-page-style directory, [LUZ 95] shows another advantage of dialogue: the user can obtain the address of a taxidermist even when he/she does not know the name of this profession. Through the conversation, the dialogue, the user gets the machine to understand exactly what he/she is looking for. There is a joint construction of a common concept to both interlocutors, and this joint construction is the point of the dialogue compared to the unique utterance or the computer language request.
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
