Geographical Information Retrieval in Textual Corpora - Christian Sallaberry - E-Book

Geographical Information Retrieval in Textual Corpora E-Book

Christian Sallaberry

0,0
139,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

This book addresses the field of geographic information extraction and retrieval from textual documents. Geographic information retrieval is a rapidly emerging subject, a trend fostered by the growing power of the Internet and the emerging possibilities of data dissemination.

After positioning his work in this field in Chapter 1, the author makes proposals in the following two chapters. Chapter 2 focuses on spatial and temporal information indexing and retrieval in corpora of textual documents. Propositions for both spatial and temporal information retrieval (IR) are made. Chapter 3 tackles the use of generalized spatial and temporal indexes, which are produced from there in the framework of multi-criteria IR. Geographic IR (GIR) is discussed at length, since this IR combines the criteria of spatial, temporal and thematic research.

The author provides a rich bibliographical study of the current approaches focused on the modeling and retrieval of spatial and temporal information in textual documents, and similarity measures developed thus far in the literature.

The book concludes with a broad perspective of the remaining scientific challenges. Several areas of research are discussed, such as integration of a domain-based ontology, modeling of spatial footprints from the interpretation of spatial relation, and parsing of relations between features deemed relevant within a document resulting from a GIR process.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 236

Veröffentlichungsjahr: 2013

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Contents

Foreword

Acknowledgments

Introduction

1 Access by Geographic Content to Textual Corpora: What Orientations?

1.1. Introduction

1.2. Access by geographic content to textual corpora

1.3. Reinforcement of GIR by contributions from NLP, reasoning and multicriteria IR

1.4. Toward the construction of a multicriteria IR engine

2 Spatial and Temporal Information Retrieval in Textual Corpora

2.1. Introduction

2.2. Review of challenges, hypotheses and research objectives

2.3. Spatial and temporal information in textual documents: literature review

2.4. Proposition for spatial and temporal information indexing and retrieval in textual corpora

2.5. Summary

3 Multicriteria Information Retrieval in Textual Corpora

3.1. Introduction

3.2. Review of challenges, hypotheses and research objectives

3.3. Standardization and combination of criteria: literature review

3.4. Proposition for indexing by tiling and multicriteria IR in textual corpora

3.5. Evaluation and discussion

3.6. Summary

4 General Conclusion

4.1. Summary

4.2. Perspectives

Bibliography

Index

First published 2013 in Great Britain and the United States by ISTE Ltd and John Wiley & Sons, Inc.

Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address:

ISTE Ltd 27-37 St George’s Road London SW19 4EU UK www.iste.co.uk

John Wiley & Sons, Inc. 111 River Street Hoboken, NJ 07030 USA www.wiley.com

© ISTE Ltd 2013

The rights of Christian Sallaberry to be identified as the author of this work have been asserted by him in accordance with the Copyright, Designs and Patents Act 1988.

Library of Congress Control Number: 2013940049

British Library Cataloguing-in-Publication Data

A CIP record for this book is available from the British Library

ISSN: 2051-2481 (Print)

ISSN: 2051-249X (Online)

ISBN: 978-1-84821-596-2

Foreword

This very well-documented book addresses the field of geographic information extraction and retrieval from textual documents. Geographic information retrieval from documents is, indeed, a rapidly emerging subject, a trend fostered by the growing power of the Internet and the emerging possibilities of data dissemination. Information is processed from the identification of spatial and temporal features in textual documents, data indexing and manipulation of the relevance of identified items, multicriteria retrieval and an evaluation of query results by the development of several prototypes.

The author first introduces the principles of document retrieval and then illustrates the roles and importance of spatial and temporal information in textual documents. The addressed scientific challenges lie at the intersection of information retrieval techniques, natural language processing and qualitative spatial reasoning. The contributions presented address the development of spatial and temporal data models, geographic information extraction and analysis as well as symbolic annotations. Christian Sallaberry develops several of his recent contributions oriented around the development of spatial and temporal information indexing and textual document retrieval, these propositions being, by themselves, a worthwhile contribution of this monograph.

The book is usefully completed by a rich bibliographical study of current approaches focused on the modeling and retrieval of spatial and temporal information in textual documents and similarity measures developed so far in published literature. This allows Christian Sallaberry to develop a contribution in which the linguistic annotations, as well as the developed framework, enable us to identify, interpret and retrieve spatiotemporal information. This approach is typically qualitative in the sense that the spatial and temporal features identified in a corpus can be described from spatial and temporal relationships. These relationships play an important role in the derivation of spatial and temporal indexes and the execution of information retrieval processes, where spatial and temporal similarity measures allow us to trigger and rank query results.

The framework is completed by a multicriteria information retrieval approach. To develop and present his contribution, Christian Sallaberry introduces a useful literature review of spatiotemporal query homogenization. He introduces a spatial and temporal indexing approach based on the concepts of tiling and relevance scores, and different degrees of preference levels.

The conclusion of this book provides a broad perspective on the remaining scientific challenges. Several areas of research are discussed: integration of a domain-based ontology, modeling spatial relations in the interpretation of spatial features, generalization of these approaches in relation to the temporal and semantic dimensions, and semantic enrichment from annotations. All these domains are challenging and very attractive areas of research.

Overall, this book constitutes of a very well documented contribution, original and useful in a domain undergoing rapid development. The approach is original and brings a contribution to the field of geographic information extraction and retrieval from textual documents. It should raise wide interest for researchers in the fields of geographic and textual information processing as well as in developers of information and Web data processing systems. I hope it will generate many new vocations!

Dr Christophe CLARAMUNTProfessor in Computer ScienceNaval Academy Research InstituteLanveoc-Poulmic, June 2013

Acknowledgments

This book has its origins in my accreditation to direct research. Thus, my acknowledgments go first to Mauro Gaio, Professor at the UPPA1, for his help in this long preparation for the accreditation to direct research. I would also like to thank my reviewers Mohand Boughanem, Professor at the Paul Sabatier University in Toulouse, Christophe Claramunt, Professor at the Naval Academy of Brest, and Ross Purves, Professor at the University of Zurich, for their expert reports and numerous pieces of advice that have enabled me to improve the original manuscript. Finally, thanks go to Marie-Aude Aufaure, Professor at the Ecole Centrale de Paris, Florence Le Ber, PhD Supervisor at the Ecole nationale du génie de l’eau et de l’environnement de Strasbourg, and Thierry Nodenot, Professor at the UPPA, who have also carefully examined the manuscript of my accreditation to direct research.

I will present the results of the work conducted as a team within the laboratory of LIUPPA2. Therefore, I would like to thank once again Mauro Gaio for associating me with his research works on natural language processing and reasoning aiming at spatial and temporal information marking and analyzing in bodies of text. My thanks go equally to my colleagues at the LIUPPA. I would like to mention, in particular, my colleagues Marie-Noëlle Bessagnet, Annig Lacayrelle and Albert Royer as well as the four doctoral students Pierre Laforcade, Julien Lesbegueries, VanTien NGuyen and Damien Palacio whose works I have been able to jointly accompany: they have offered me the possibility to share in fruitful collaborations that have contributed a lot to this research. I also address my gratitude to all other colleagues with whom I have had the chance of working within the context of different research projects. I would like to specifically thank my colleagues at the IRIT3 institute of Toulouse, Guillaume Cabanac and Gilles Hubert, for their confidence and their pertinent proposals that have contributed a lot to these results.

Without all these meetings, the work presented in this book would not have seen the light of day.

1 University of Pau and Pays de l’Adour: www.univ-pau.fr/.

2 Computer Science Laboratory of the University of Pau and Pays de l’Adour: liuppa. univ-pau.fr/.

3 Institut de recherche en informatique de Toulouse: www.irit.fr/.

Introduction

I.1. Geographic information retrieval

The work presented in this book lies within the field of geographic information retrieval (GIR). Information retrieval (IR) is finding documents which satisfy an information need from within a collection of documents generally stored on the Internet [MAN 08b]. GIR, first named and defined by Ray Larson [LAR 96], aims at retrieving documents which satisfy geographic characteristics: thus, the geographic zones featured in documents resulting from a GIR partially or entirely cover those expressed in the query. The series of GIR conferences1 that began in 2004 [PUR 04] has heavily contributed to the development of GIR. GIR focuses on the spatial dimension in the first place and then, for textual documents, is extended by the thematic dimension conveyed by meaningful terms (other than spatial). We find the spatio-textual search dimension in the series of SSTD2 conferences [JEN 01] beginning in 2005 [VAI 05], GeoCLEF3 [GEY 05] beginning in 2005 [BUC 05] and GIS4 [PIS 93] beginning in 2007 [LIE 07]. More notably, it is the series of RIAO5 [ARS 85], GIR and GIS conferences that associated the temporal dimension with the spatial dimension and/or the thematic dimension: spatio-temporal-textual search, respectively, in 2004 [WID 04], in 2007 [MAR 07] and in 2010 [LIU 10].

Numerous research publications discuss these dimensions of GIR. We can name the books Georeferencing [HIL 06]; The Geospatial Web [SCH 07]; Linguistique et recherche d’information, la problématique du temps (Linguistics and information retrieval, the temporal issue) [BAT 11] or the theses “Toponym resolution by text” [LEI 07]; “Geographic aware web text mining” [MAR 08a]; “Temporal information retrieval” [ALO 08]; “Geographic information retrieval: classification, disambiguation and modeling” [OVE 09], “Geographically constrained information retrieval” [AND 10], “Traitement automatique du langage pour l’indexation thématique et l’extraction d’informations temporelles” (Natural Language Processing for the extraction and indexing of thematic and temporal information) [KEV 11]. These works mainly target GIR in textual or multimedia documents consisting of a few lines to a number of pages available on the Web.

The work presented in this book focuses on digital libraries (DL) and, in particular, textual corpora as the application domain. We can refer to the Google Books6 project with more than 10 million books digitized to date, the World Digital Library7 with 6,142 digital objects at the moment, the Europeana8 project with 10 million digitized objects to date or the Gallica9 project of the National Library of France (BnF) with more than a million textual documents (books, periodicals, reviews and journals) digitized. Similar to many libraries and multimedia libraries, the MIDR10 of Pau Pyrénées digitizes various kinds of documents (literary works, travelogues, newspapers, old geographical maps, lithographs, postcards, etc.), which have a common attribute of dealing with a small territory (the Pyrénées11) in a given period of history (mainly the 18th and 19th Centuries). This kind of document repository contains a great deal of references to history, geography, heritage; in other words to the territory [KER 11]. The objective of these different projects is to provide, to the widest audience, new means of accessing document repositories now available in digital formats. Thus, these projects implement processes of marking information, constructing indexes and querying by using these indexes.

The documents composing the corpus of MIDR are of particular importance in their richness in geographical indications relative to the Pyrenean territory. User categories such as “tourist”, “student”, “pedagogue”, “scholar” and “librarian” have been identified by the staff of MIDR. These users intend to take advantage of the corpus by using an adapted information system capable, in particular, of offering search possibilities from the viewpoint of the territory represented by the corpus. As stated by Jihad Farhat and Luc Girard [FAR 04], document management systems (DMS) and search engines complement each other in order to support the activities of library users and professionals. We propose extending the functionalities of these systems through specific services dedicated to the processing of the spatial, temporal and thematic dimensions of information. Thus, in comparison to content on the Web, we only consider document repositories such as those of MIDR that are stable (the content of a book does not change over the course of time) and homogeneous to allow thorough indexing relative to each of these three dimensions.

I.2. From spatial and temporal information indexing to multicriteria information retrieval

Literature relative to GIR in textual corpora presents the following challenges:

1) the recognition and interpretation of the spatial and temporal named entities;
2) the spatial and temporal indexation for purposes of IR;
3) the matching of document/query couples and the calculation of relevance scores dedicated to spatial IR on the one hand and temporal IR on the other hand;
4) the multicriteria IR combining the spatial, temporal and thematic dimensions;
5) the evaluation of such GIR systems.

In the laboratory of LIUPPA12, within the T2I13 team, the work corresponding to point 1, under the direction of Mauro Gaio [GAI 08], constitutes the basis of the work relative to points 2–5 [PAL 12a].

Thus, the recognition and determination of the spatial and temporal named entities [LEI 11, BAT 11] in textual documents is supported by two main classes of approaches. The first class relies on a set of rules, established by experts, allowing an interpreter to determine whether a term is a named entity or not. The second class is based on a manually annotated learning corpus allowing, after statistical processing, the automatic construction of rules for the discovery of named entities which can be applied to larger corpora. In accordance with the first class of rules, our work team proposes a set of manually built rules dedicated to the expression of space and time in a corpus composed of travelogues: these rules allow the marking as well as the first symbolic interpretation of the detected entities (classification followed by an analysis of a last associated spatial or temporal relation). Following this interpretation, we have distinguished absolute entities such as “the City of Pau” and “the year 2000” from the so-called relative entities such as “the surroundings of the City of Pau” and “at the beginning of the year 2000”. Let us recall that we are only dealing with the textual contents of documents, regardless of their structure or associated meta-descriptions.

The indexation associates a numerical interpretation (geometry, calendar period) with the detected spatial and temporal entities in the texts. The organization of the indexes can, for example, dissociate completely the references from space and theme in the independent indexes or, on the other hand, can combine these two dimensions in specific structures stored in one and the same index [VAI 05]. As seen in Clough et al. [CLO 06], we have chosen to work on independent spatial, temporal and thematic indexes. The algorithms that interpret the symbolic representation of entities take the absolute and relative aspects of their description into account. The resulting numerical representation corresponds to the outcome of a search in such resources as gazetteers in the case of absolute entities, for example.

The matching and calculation of the relevance scores have equally been the subject of numerous propositions in spatial IR [AND 10] as well as in temporal IR [ALO 08, BAT 11]. As for the majority of these propositions, we have developed a spatial and a temporal IR supported by ad hoc formulas adapted to our corpus.

The combination of the spatial, temporal and/or thematic dimensions in GIR is generally implemented using filtering approaches [VAI 05, LIE 07]. For a greater power of expression in the querying process, we have introduced requirement and preference operators that can be associated with each search criterion. Taking into account the different levels of requirements as expressed in the query, we have developed a method of aggregation of the results coming from different IR systems (IRSs). This method is inspired by the aggregation approaches established in decision support systems [MAR 99] as well as in multicriteria information retrieval systems [FAR 08].

The implementation of the first GIR prototypes emphasizes the necessity to evaluate such systems [CAR 11, MAN 11]. However, with the exception of such campaigns as TEMPEVAL [VER 09] devoted to time and GEOCLEF [GEY 05] devoted to space and theme, there are, to our knowledge, no evaluation frameworks of GIR systems that combine the spatial, temporal and thematic dimensions of information. We have therefore proposed an experimental framework devoted to this type of evaluation. We have established a testing collection as well as an experimentation protocol which we implement for the evaluation of our prototypes.

To deal with these different lines of research, the book is organized into two main chapters (Chapters 2 and 3). Chapter 2 details the indexation and the retrieval of spatial and temporal information in textual corpora. We deal with spatial IR on the one hand and temporal IR on the other hand. Chapter 3 discusses at the handling of spatial and temporal indexes obtained earlier in the context of multicriteria information retrieval. We broadly discuss the GIR here because this is an IR that combines the spatial, temporal and thematic search criteria.

The indexing of spatial and temporal information in textual documents constitutes the basis of this work. In this indexing part, the quality of the recognition and interpretation of spatial and temporal named entities is paramount. The following processes use the results of this indexation for the purposes of separate spatial IR, separate temporal IR or multicriteria IR combining the three geographic dimensions.

Indexing of spatial and temporal information in textual documents

In Chapter 2, we first look at the modeling of the spatial and temporal information in the context of specialized information retrieval devoted to non-structured textual corpora (section 2.3.3). We propose spatial and temporal core models [LES 06, GAI 08] (section 2.4.2) devoted to such information interpretation (section 2.4.3) and representation in the indexes in order to implement matching calculations in the research phase. We design and experiment a first method of extraction and indexation of spatial information (section 2.4.4) based on our core model and a specific semantic processing [LES 06]. We adopt a similar approach in order to propose a method of extraction and indexation (section 2.4.4) of temporal information based on our core mode and a specific semantic processing [LEP 07].

Retrieval of spatial and temporal information in textual documents

In Chapter 2, we also describe the IR approaches implemented in the systems devoted to spatial and temporal information (sections 2.3.1 and 2.3.6). We propose a method of spatial information retrieval (section 2.4.5.1) using functions of geographic information systems (GISs) in order to calculate geo-referenced representations of spatial entities and implement spatial relevance calculations [SAL 07a]. Using a similar approach, we propose a method of dedicated temporal information retrieval [LEP 07] (section 2.4.5.2).

Generalization of data representations for multicriteria information retrieval

In Chapter 3, we deal with each dimension of the geographic information in a specific way and then combine them in IR scenarios. To avoid possible biases, it is important, before any combination, to standardize the representation of the data as well as the approaches of processing data relative to the different dimensions (section 3.3.1). We propose a generic approach comparable to the generalization by truncation or lemmatization of terms in classic approaches to IR. Thus, from indexed representations of spatial and temporal information, we build higher-level indexes (section 3.4.1) appropriate for the implementation of proven IR models [PAL 10c, SAL 11].

Multicriteria information retrieval

In Chapter 3, we also discuss the multicriteria information retrieval approaches (section 3.3.2). We propose submitting each search criterion to the appropriate IR system of spatial, temporal or thematic dimension (section 3.4.3). It should be noted that, for the thematic dimension, we limit ourselves to the approaches implemented for terms in classic IR. We offer several approaches for the combination of results from different indexes and IR systems. We also propose, according to the type of user involved, new operators with the aim of associating a higher level of expressiveness with each criterion of the query and, consequently, improving the quality of the results [PAL 10b, PAL 10c, PAL 11, PAL 12a].

I.3. Organization of the book

This book is divided into the following chapters:

– Chapter 1 presents the positioning of the work in the field of GIR.

– Chapter 2 is devoted to spatial and temporal information in textual documents. It describes our propositions relative to indexation and to spatial and temporal information retrieval in textual documents.

– Chapter 3 deals with the generalization of data representation with the aim of preparing the combination of results from multi-dimensional (spatial, temporal and thematic) and multicriteria information retrieval. This chapter describes our propositions for multi-dimensional and multicriteria information retrieval.

– Chapter 4 is devoted to the first overview followed by a presentation of a set of perspectives as extensions of this work in the field of GIR.

1http://www.geo.unizh.ch/~rsp/gir10/.

2http://dblab.cs.ucr.edu/conferences/sstd01/.

3http://ir.shef.ac.uk/geoclef/2005/.

4http://www.informatik.uni-trier.de/~ley/db/conf/gis/.

5http://www.informatik.uni-trier.de/~ley/db/conf/riao/.

6http://books.google.fr/books/.

7http://www.wdl.org.

8http://www.europeana.eu/portal/.

9http://gallica.bnf.fr/.

10 Médiathèque Intercommunale à Dimension Régionale de Pau Pyrénées – http://www.agglopau.fr/.

11http://en.wikipedia.org/wiki/Pyrénées/.

12 Laboratoire informatique de l’université de Pau et des Pays de l’Adour: liuppa.univ-pau.fr/.

13 Processing of spatial, temporal and thematic information for the adaptation of contextual and user interaction (Traitement des Informations spatiales, temporelles et thématiques pour l’adaptation de l’Interaction au contexte et à l’utilisateur): http://liuppa.univ-pau.fr/live/EquipesdeRecherche/Equipe_T2I/.

1

Access by Geographic Content to Textual Corpora: What Orientations?

1.1. Introduction

The volume of digital corpora is always on the rise and the retrieval of relevant documents is an increasingly delicate task. The ambiguity of natural language terms contributes to this difficulty in the automatic interpretation of the expression of the need for information as well as in the automatic evaluation of the correspondence between documents and needs. The multiple meanings of the terms and their numerous uses in varied contexts make delicate, indeed, the task of information retrieval. Our working hypothesis therefore consists of distinguishing the spatial, temporal and thematic dimensions in order to implement dedicated approaches in the processes of indexing and information retrieval (IR). The objective is to contribute to a better content analysis of textual corpora as well as to a better grasp of the search criteria expressed in a query. Let us recall that we are studying textual corpora with “territorial” denotations, digitized, to which processes of character recognition have been applied but whose logical structure has not been conserved.

This chapter is organized as follows. Section 1.2 presents the general context related to geographic information retrieval (GIR). Section 1.3 introduces privileged fields of research as well as the position of our study. Section 1.4 gives a rough sketch of our research approach in the construction of spatial, temporal and multicriteria search engines.

1.2. Access by geographic content to textual corpora

The study concerning the processing of information in text is mainly detailed in theses [BAZ 05, LES 07, PAL 10a, KER 11]. Following a number of reminders related to document retrieval and textual corpora, we will describe the characteristics of corpora with “territorial” denotations and their uses. This category of corpora will constitute the field of experimentation for our propositions.

1.2.1. Document retrieval and textual corpora

Document retrieval or information retrieval [BAE 99, BOU 08] is traditionally defined as a set of techniques allowing us to select, from a collection of documents, information that is likely to meet the needs of the user.

A collection of documents (document repository or corpus) is the information accessible via the document retrieval system (or information retrieval system, IRS). It consists of documents, unit elements. Textual documents are represented by a set of descriptors (terms, for example) stored in files of descriptive instructions (metadata) or indexes whose structure can be more complex [BES 04]. However, the notion of document in itself is vague. Generally defined by its container (e.g. a book, the physical object that contains the text), it often varies and the expected result of a query may not be an entire book but one or more particularly relevant fragments. This is indeed the reason why we use the expression “document unit” or “document fragment” to define the unit of text returned to the user [BAZ 05].

Finally, a query corresponds to the expression of the information needs of the user. It constitutes the input parameter to the retrieval system and is expressed in a query language that is often simple: a choice of keywords and logical operators, for instance. Nevertheless, other languages are presented in literature: natural language, graphical language, etc. [GOK 09].

1.2.2. Textual corpora with “territorial” denotations

A textual corpus with “territorial” denotations is composed of travelogues, stories, newspapers, novels, poems, etc. These documents describe/discuss a territory. As detailed in [KER 11], the territorial dimension is symbolized in textual documents by a significant frequency of toponyms, outlined facts or described observations. Toponyms denote, for example, streams, cities and buildings. The facts describe, for example, political or sport-related events as well as various other events. The observations refer to architecture, botany, geology, agriculture, etc. These categories of information are, in a general way, linked to a location or a period of time.

– Territory: The Longman Dictionary defines the term territory as “an area for which one person or branch of an organization is responsible”. Kergosien [KER 11] presents a consistent overview of the notion of territory. Among the different definitions proposed, we will retain the following two [KER 11, p. 70]: “A globally accepted definition in geography describes territory as a space on which an authority is exercised and is limited by political and administrative borders. This definition is subject to debate, however the notion of territory generally integrates a geographic space composed of places (spatial component) as well as relations with different subjects (thematic component) and/or references to a period (temporal component)”. It also describes a second point of view, that of geomatics. “Geomatics is the scientific field hovering between geography and computer science which mainly deals with problems of storage, processing and diffusion of geographic information. The characterization of geographic information in a particular territory is defined in the form of geographic entities (GEs) composed of spatial (SEs), temporal (TEs) and thematic entities. It should be noted that each one of these entities is not always specified or can be implicit”. Kergosien [KER 11] proposes an approach of ontology construction as a tool for the structured representation of a territory but also as a support to IR and to the browsing of document repositories.

– Examples of corpora: Territory is at the heart of numerous types of corpora. We can quote, for example, the French-speaking corpus of archives, mainly composed of texts, maps and lithographies related to the city of Saint-Étienne and to its river Furan1; the multi-lingual corpus (German and French) of the Swiss Alpine Club2, composed of reports, accounts, essays and thoughts under the theme of mountaineering; tourist guides such as the different ranges of Lonely Planet3 books or of the Michelin guide4; and the equally numerous hiking guides5 and other travel blogs6.

These corpora have the principal characteristic of containing a very large number of place names (spatial named entities will be defined further on); the places referred to in such a way generally have a fine level of detail in a relatively confined space (a river, a city and mountain range, for example). The Geotopia7 and Text+Berg digital8 projects are good examples of this. The objective of the first is to experiment with georeferencing techniques in order to help organize, transmit, share and interpret archival data [JOL 11]. The second aims to digitize and promote a corpus of alpine literature [VOL 10].

– The corpus of MIDR: MIDR9, from a perspective of cultural heritage promotion, has digitized and implemented the optical character recognition of its heritage document repository with the aim of indexing it into a document retrieval system. This way, the digitized documents can benefit from a renewed visibility and be exploited by a larger public. It should be noted that this digitalization, keeping in mind the cost of the operation, has been carried out by a provider without the correction of errors and the recovery of the documents’ structure, with the exception of their division into paragraphs.

Let us recall that this corpus is composed of documents of different types (literary studies, travelogues, newspapers, old geographic maps, lithographies, postcards, etc.), which have the common denominator of dealing with the Pyrénées territory in the 18th and 19th Centuries. A preliminary study of the corpus has revealed a predominant geographic connotation in the documents, as much in the literary studies dealing with travelogues as in the local periodicals whose articles relate to information about the territory. An experimentation has allowed us, for example, to extract almost 10,000 spatial named entities from 10 books within the corpus (i.e. 600,000 terms).

Indeed, a large amount of information makes reference to places, spatial indications as well as descriptions of landscape, temporal indicators and dates, implying a significant importance of these documents for the geographic aspect. Let us consider, as an example, travelogues (see the excerpt in