Textual Information Access -  - E-Book

Textual Information Access E-Book

0,0
169,99 €

oder
-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

This book presents statistical models that have recently been developed within several research communities to access information contained in text collections. The problems considered are linked to applications aiming at facilitating information access:

  • information extraction and retrieval;
  • text classification and clustering;
  • opinion mining;
  • comprehension aids (automatic summarization, machine translation, visualization).

In order to give the reader as complete a description as possible, the focus is placed on the probability models used in the applications concerned, by highlighting the relationship between models and applications and by illustrating the behavior of each model on real collections.

Textual Information Access is organized around four themes: informational retrieval and ranking models, classification and clustering (regression logistics, kernel methods, Markov fields, etc.), multilingualism and machine translation, and emerging applications such as information exploration.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 762

Veröffentlichungsjahr: 2013

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Introduction

PART 1: INFORMATION RETRIEVAL

Chapter 1. Probabilistic Models for Information Retrieval

1.1. Introduction

1.2. 2-Poisson models

1.3. Probability ranking principle (PRP)

1.4. Language models

1.5. Informational approaches

1.6. Experimental comparison

1.7. Tools for information retrieval

1.8. Conclusion

1.9. Bibliography

Chapter 2. Learnable Ranking Models for Automatic Text Summarization and Information Retrieval

2.1. Introduction

2.2. Application to automatic text summarization

2.3. Application to information retrieval

2.4. Conclusion

2.5. Bibliography

PART 2: CLASSIFICATION AND CLUSTERING

Chapter 3. Logistic Regression and Text Classification

3.1. Introduction

3.2. Generalized linear model

3.3. Parameter estimation

3.4. Logistic regression

3.5. Model selection

3.6. Logistic regression applied to text classification

3.7. Conclusion

3.8. Bibliography

Chapter 4. Kernel Methods for Textual Information Access

4.1. Kernel methods: context and intuitions

4.2. General principles of kernel methods

4.3. General problems with kernel choices (kernel engineering)

4.4. Kernel versions of standard algorithms: examples of solvers

4.5. Kernels for text entities

4.6. Summary

4.7. Bibliography

Chapter 5. Topic-Based Generative Models for Text Information Access

5.1. Introduction

5.2. Topic-based models

5.3. Topic models

5.4. Term models

5.5. Similarity measures between documents

5.6. Conclusion

5.7. Appendix: topic model software

5.8. Bibliography

Chapter 6. Conditional Random Fields for Information Extraction

6.1. Introduction

6.2. Information extraction

6.3. Machine learning for information extraction

6.4. Introduction to conditional random fields

6.5. Conditional random fields

6.6. Conditional random fields and their applications

6.7. Conclusion

6.8. Bibliography

PART 3: MULTILINGUALISM

Chapter 7. Statistical Methods for Machine Translation

7.1. Introduction

7.2. Probabilistic machine translation: an overview

7.3. Phrase-based models

7.4. Modeling reorderings

7.5. Translation: a search problem

7.6. Evaluating machine translation

7.7. State-of-the-art and recent developments

7.8. Useful resources

7.9. Conclusion

7.10. Acknowledgments

7.11. Bibliography

PART 4: EMERGING APPLICATIONS

Chapter 8. Information Mining: Methods and Interfaces for Accessing Complex Information

8.1. Introduction

8.2. The multidimensional visualization of information

8.3. Domain mapping via social networks

8.4. Analyzing the variability of searches and data merging

8.5. The seven types of evaluation measures used in IR

8.6. Conclusion

8.7. Acknowledgments

8.8. Bibliography

Chapter 9. Opinion Detection as a Topic Classification Problem

9.1. Introduction

9.2. The TREC and TAC evaluation campaigns

9.3. Cosine weights - a second glance

9.4. Which components for a opinion vectors?

9.5. Experiments

9.6. Extracting opinions from speech: automatic analysis of phone polls

9.7. Conclusion

9.8. Bibliography

Appendix A. Probabilistic Models: An Introduction

A.1. Introduction

A.2. Supervised categorization

A.3. Unsupervised learning: the multinomial mixture model

A.4. Markov models: statistical models for sequences

A.5. Hidden Markov models

A.6. Conclusion

A.7. A primer of probability theory

A.8. Bibliography

List of Authors

Index

First published 2012 in Great Britain and the United States by ISTE Ltd and John Wiley & Sons, Inc.

Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address:

ISTE Ltd

John Wiley & Sons, Inc.

27-37 St George’s Road

111 River Street

London SW19 4EU

Hoboken, NJ 07030

UK

USA

www.iste.co.uk

www.wiley.com

© ISTE Ltd 2012

The rights of Eric Gaussier & François Yvon to be identified as the author of this work have been asserted by them in accordance with the Copyright, Designs and Patents Act 1988.

Library of Congress Cataloging-in-Publication Data

Textual information access: statistical models/edited by Eric Gaussier, François Yvon.

p. cm.

Includes bibliographical references and index.

  ISBN 978-1-84821-322-7

 1. Text processing (Computer science)--Statistical methods. 2. Automatic indexing. 3. Discourse analysis--Data processing. I. Gaussier, Eric. II. Yvon, François.

  QA76.9.T48T56 2011

  005.52--dc23

2011041292

British Library Cataloguing-in-Publication Data

A CIP record for this book is available from the British Library

ISBN: 978-1-84821-322-7

Introduction1

The information society in which we live produces a constantly changing flow of diverse types of data which needs be processed quickly and efficiently, be it for professional or leisure purposes. Our capacity to evolve in this society depends more and more on our capacity to find the information that is most suitable for our needs, on our capacity to filter this information so as to extract the main topics, snippets, tendencies and opinions, but also on our capacity to visualize, summarize and translate this information. These various processes raise two important issues: on the one hand, the development of complex mathematical models which fully take into account the data to be processed, and, on the other hand, the development of efficient algorithms associated with these models and capable of processing large quantities of data and of providing practical solutions to the above problems.

In order to meet these requirements, several scientific communities have turned to probabilistic models and statistical methods which allow both richness in modeling and robustness in processing large quantities of data. Such models and methods can furthermore adapt to the evolution of data sources. Scientific communities involved in text processing have not stayed away from this movement, and methods and tools for natural language processing or information retrieval are largely based today on complex statistical models which have been developed over several years.

Students, engineers or researchers exposed to the area of textual information access are faced with an abundant literature which exploits different statistical models, sometimes difficult to apprehend and not always presented in detail. The main objective of this book is to precisely present, in the most explicit way, the statistical models used to access textual information. The problems we address are linked to traditional applications of information access:

– information extraction and information retrieval;

– text classification and text clustering;

– comprehension aid via automatic summarization, machine translation and visualization tools;

– opinion detection.

Beyond these applications, which have all been the subject of specifically dedicated work, be it in information retrieval, data mining, machine learning or natural language processing, we have tried here to focus on the main statistical and probabilistic models underlying them so as to propose a homogeneous and synthetic view of the fundamental methods of the tools used for textual information access. Such a summary seems all the more desirable as the different communities concerned converge, as we will see throughout the various chapters of this book, on numerous points. These points relate to the numeric representations derived from textual data as well as to the development of models for large data sets and to the reliance on standard benchmarks and evaluation campaigns for evaluation purposes. The scope of the models presented here actually goes beyond text-based applications, and readers are likely to re-use in other domains, such as image/video processing or recommendation systems, what they will learn in this book.

That said, we have wished to maintain a strong relationship between models and their main applications. For this reason, each chapter presents, for one or several problems, an associated group of models and algorithms (for learning and/or interference). The links between applications on the one hand and models on the other hand are explained and illustrated, as much as possible, on real collections.

For the sake of readability, the contributions to this work are organized into 4 parts. Part 1 concerns Information Retrieval and comprises two chapters. The first one, entitled “Probabilistic Models for Information Retrieval”, written by S. Clinchant and E. Gaussier, presents an overview of probabilistic models used for information retrieval, from the binary independence model to the more recent models founded on the concepts of information theory. The mathematical grounds and hypotheses on which these models rely are described in detail and the chapter concludes by comparing the performance of these models. The second chapter, entitled “Learnable Ranking Models for Automatic Text Summarization and Information Retrieval”, written by M.-R. Amini, D. Buffoni, P. Gallinari, T.V. Truong, and N. Usunier, presents learning models for ranking functions, which have recently gained attention from several research communities, since (a) they allow a complete modeling of certain information access problems, and (b) yield very good performance in several practical settings. These models are presented here in relation to their application to automatic text summarization and information retrieval.

Part 2 of this work classification and partitioning, concerns text classification and clustering, and comprises four chapters. The first one, entitled “Logistic Regression and Text Classification”, written by S. Aseervatham, E. Gaussier, A. Antoniadis, M. Burlet, and Y. Denneulin, concerns a family of models among which the simplest and most popular is the logistic regression model. After a review of generalized linear models and the IRLS algorithm, the logistic regression model (under both its binomial and multinomial forms) is presented in detail along with associated regularization methods (ridge, LASSO, and selected ridge). This model is illustrated through a document classification task involving a large number of categories. The second chapter, entitled “Kernel Methods for Textual Information Access”, is written by J.-M. Renders. It presents different kernels which have been proposed for text processing. After a review of the fundamentals of kernel methods and of the contexts in which kernels are used (logistic regression, large margin separators, principal component analysis), this chapter describes the kernels associated with different representations of texts: kernels for bags of words, for chains (of characters or words), for trees, for graphs, or for probability distributions. The following chapter, entitled “Topic-based Generative Models for Text Information Access” is a contribution by J.-C. Chappelier. It describes topic models, focusing on PLSI (Probabilistic Latent Semantic Indexing) and LDA (Latent Dirichlet Allocation), which constitute the basis of most of the topic models used today. Topic models aim at automatically discovering topics underlying (hence the term latent) a document collection, and are used in applications such as clustering, classification and information extraction. This chapter provides several illustrations of the topics discovered by these models along with a detailed description of two (generic) methods to estimate the parameters of these models: through variational approximation and through Gibbs sampling. The last chapter of this part, entitled “Conditional Random Fields for Information Extraction”, written by I. Tellier and M. Tommasi, gives a detailed description of a particular graphic model, conditional random fields. This model was recently introduced in machine learning and natural language processing in order to account for the complex dependencies between elements of a sequence, or more generally of between subparts of structured representations. Conditional random fields generalize hidden Markov models which have been (and still are) largely exploited to label word sequences. The targeted application here is information extraction and named entity recognition, but these models can also be applied for information retrieval in structured documents such as XML documents.

Part 3 of this work concerns Multilinguilism. The reader will find a single but substantial chapter dedicated to “Statistical Methods for Machine Translation” written by A. Allauzen and F. Yvon. The main probabilistic models used in machine translation systems are described in detail, as are the different modules comprised in a statistical translation system. The authors also discuss the evaluation of translation systems as well as recent research paths.

Finally, the last part of this work is dedicated to Emerging applications, and comprises two chapters. The first one, entitled “Information Mining: Methods and Interfaces for Accessing Complex Information”, is written by J. Mothe, K. Englmeier and F. Murtagh. It introduces numerous tools for information visualization, for buildings maps from data sets along different aspects, and for analyzing the variability of different results (e.g. for data fusion or to analyze the dependencies between different evaluation measures such as those used in information retrieval). The second chapter entitled “Opinion Detection as a Topic Classification Problem” is a contribution written by J.-M. Torres-Moreno, M. El-Bèze, P. Bellot, and F. Béchet. It describes in detail the task of detecting opinions in text documents. After a review of the problems encountered in this task, and the associated evaluation campaigns, the authors present several systems developed by different teams. The performance of most systems on standard data sets is also supplied. Finally, the authors investigate opinion detection on audio streams, thus widening the scope of this book.

In addition to these contributions, which aim at presenting in detail the main statistical models used to access textual information, the reader will find in the Appendix a general introduction to probability for text mining, written by F. Yvon. This Appendix presents in detail the probabilistic models used for the statistical analysis of texts, the objective being to give a theoretical foundation to the probabilistic models presented throughout this book.

As one can see, this book contains rich and varied contributions. Even if the models considered are sometimes sophisticated, we have tried (a) to present them with a constant concern for precision, and (b) to illustrate them within the framework of standard information access applications. The appendix should, we hope, allow readers not familiar with probability reasoning to acquire the necessary foundations required for understanding the different chapters. We also hope this book will be a valuable reference to researchers and engineers who wish to review the different statistical models and methods used in textual information access, as well as to Masters and engineering school students who want to study further specific models in this domain.

To conclude, we want to thank the contributors to this book for having followed us in this enterprise and for having dedicated a huge amount of work, with no guarantee of profit, for presenting their research work as clearly as possible and for relating it to the work conducted in connected areas. We are convinced all this effort and work was worth it, and that the result will benefit many.

Notations

Wherever possible, we have tried to use the following notations throughout the entire book. In some chapters, additional and/or alternative notations have sometimes been chosen to adhere to the conventions of a specific sub-domain. These alternative notations will be introduced when needed at the begining of the corresponding chapter(s).

Table 1.Notations

1 Introduction written by Eric GAUSSIER and François YVON.

PART 1

Information Retrieval

Chapter 1

Probabilistic Models for Information Retrieval1

In this chapter, we wish to present the main probabilistic models for information retrieval. We recall that an information retrieval system is characterized by three components which are as follows:

1) a module for indexing queries;

2) a module for indexing documents;

3) a module for matching documents and queries.

Here, we are not interested in the indexing modules, which are the subjects of development elsewhere (see for example [SAV 10]). We are interested only in the matching module. In addition, among all the information retrieval models, we will concentrate only on the probabilistic models, as they are considered to be the strongest performers in information retrieval and have been the subject of a large number of developments over recent years.

1.1. Introduction

Information Retrieval (IR) organizes collections of documents and responds to user queries by supplying a list of documents which are deemed relevant for the user’s requirements. In contrast to databases, (a) information retrieval systems process non-structured information, such as the contents of text documents, and (b) they fit well within a probabilistic framework, which is generally based on the following assumption:

Assumption 1.The words and their frequency in a single document or a collection of documents can be considered as random variables. Thus, it is possible to observe the frequency of a word in a corpus and to study it as a random phenomenon. In addition, it is possible to imagine a document or query as the result of a random process.

Initial IR models considered words as predicates of first order logic. From this point of view, a document is considered to be relevant if it implies, in the logical sense, the query. Later, vector space models represented documents in vector spaces the axes of which correspond to different indexing terms. Thus, the similarity between a document and a query can be calculated by the angle between the two associated vectors in the vector space. Beyond the Boolean and vector representation, the probabilistic representation provides a paradigm that is very rich in models. For example, it is possible to use different probability laws for modeling the frequency of words.

In all these models, a pre-processing stage is necessary to achieve a useful representation of the documents. This pre-processing consists of filtering the words that are used frequently (empty words), then normalizing the surface form of the words (removing conjugations and plurals) and then finally counting, for each term, the number of occurrences in a document. Consider for example the following document (extracted from “The Crow and the Fox”, by Jean de la Fontaine):

“Mr Crow, perched on a tree,

Holding a cheese in his beak.

Mr Fox, enticed by the smell,

This is what he said:

Well, hello, Mr Crow

How lovely you are! How handsome you seem!”

The filtering of empty words leads to the removal of words such as “a” and “the”, etc. Afterward, the word occurrences are counted: the term Crow occurs twice in this document, whereas the term cheese appears once. We can thus represent a document by a vector, the coordinates of which correspond to the number of occurrences of a particular term, and a collection of documents by a group of such vectors, in matrix form.

In all the models, we shall see that the number of occurrences of different words are considered to be statistically independent. Thus, we can suppose that the random variable corresponding to the number of occurrences of cheese is independent of that of the random variable for Crow. We define the random variable associated with the word w as Xw. A document is a multi-varied random variable noted Xd. The definitions used in this chapter are summarized in Table 1.1. These definitions represent those that are more commonly (and more recently) used in information retrieval. We will often refer to a probability law for predicting the number of occurrences as a frequency law.

Table 1.1.Notations

Historically, we can classify the probabilistic models for information retrieval under three main categories:

1) Probability ranking principle

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!

Lesen Sie weiter in der vollständigen Ausgabe!