E-Book
139,99 €

Named Entities for Computational Linguistics E-Book

Damien Nouvel

0,0

139,99 €

oder

Leseprobe lesen

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: John Wiley & Sons
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

One of the challenges brought on by the digital revolution of the recent decades is the mechanism by which information carried by texts can be extracted in order to access its contents.

The processing of named entities remains a very active area of research, which plays a central role in natural language processing technologies and their applications. Named entity recognition, a tool used in information extraction tasks, focuses on recognizing small pieces of information in order to extract information on a larger scale.

The authors use written text and examples in French and English to present the necessary elements for the readers to familiarize themselves with the main concepts related to named entities and to discover the problems associated with them, as well as the methods available in practice for solving these issues.

Details

Sie lesen das E-Book in den Legimi-Apps auf:

Android

iOS

von Legimi
zertifizierten E-Readern

Seitenzahl: 265

Veröffentlichungsjahr: 2016

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Cover

Title

Introduction

1 Named Entities for Accessing Information

1.1. Research program history

1.2. Task using named entities as a basic representation

1.3. Conclusion

2 Named Entities, Referential Units

2.1. Issues with the named entity concept

2.2. The notions of meaning and reference

2.3. Proper names

2.4. Definite descriptions

2.5. The meaning and referential functioning of named entities

2.6. Conclusion

3 Resources Associated with Named Entities

3.1. Typologies: general and specialist domains

3.2. Corpora

3.3. Lexicons and knowledge databases

3.4. Conclusion

4 Recognizing Named Entities

4.1. Detection and classification of named entities

4.2. Indicators for named entity recognition

4.3. Rule-based techniques

4.4. Data-driven and machine-learning systems

4.5. Unsupervised enrichment of supervised methods

4.6. Conclusion

5 Linking Named Entities to References

5.1. Knowledge bases

5.2. Formalizing polysemy in named entity mentions

5.3. Stages in the named entity linking process

5.4. System performance

6 Evaluating Named Entity Recognition

6.1. Classic measurements: precision, recall and F-measures

6.2. Measures using error counts

6.3. Evaluating associated tasks

6.4. Evaluating preprocessing technologies

6.5. Conclusion

Conclusion

Appendices

Appendix 1: Glossary

Appendix 2: Named Entities: Research Programs

Appendix 3: Summary of Available Corpora

Appendix 4: Annotation Formats

Appendix 5: Named Entities: Current Definitions

Bibliography

Index

End User License Agreement

Guide

Cover

Table of Contents

Begin Reading

List of Tables

1 Named Entities for Accessing Information

Table 1.1. An example of a document and an MUC-3 form, from [GRI 97]

3 Resources Associated with Named Entities

Table 3.1. NE typology in the context of MUC. (*) denotes examples added in the course of MUC-7

Table 3.2. ACE typology (V5.6)

Table 3.3. Examples using ACE typology (V5.6)

Table 3.4. ESTER-2 typology

Table 3.5. Types (in bold) and subtypes (in italics) used in the QUAERO typology

Table 3.6. Transverse and specific components

Table 3.7. Examples using the QUAERO typology

Table 3.8. Extract from the typology proposed by Sekine. Taken from version 7.1, described in full at http://nlp.cs.nyu.edu/ene/version710Beng.html

Table 3.9. Annotation of a statement using a variety of typologies

Table 3.10. Description of the training corpora produced in the context of ACE 2005. Broadcast news includes both radio and televised news programs, broadcast conversation covers radio or televised conversations (such as interviews), newswire refers to newspaper articles, Weblog refers to Website data and usenet refers to forum data. Volumes are given as a number of words

Table 3.11. Description of the corpus produced in the context of the ESTER-2 campaign

Table 3.12. Description of the transcribed speech corpus from the Quaero program

Table 3.13. Description of the historic press corpus from the Quaero program

Table 3.14. Description of the Etape corpus

Table 3.15. Description of the GermEval 2014 corpus

Table 3.16. Description of the Harem corpus

4 Recognizing Named Entities

Table 4.1. Textual annotation modes

5 Linking Named Entities to References

Table 5.1. Extracting entity variants from a resource

List of Illustrations

1 Named Entities for Accessing Information

Figure 1.1. Multi-level annotation with types (func.ind, pers.ind and org.adm) and components (qualifier, kind, name, first.name and last.name)

3 Resources Associated with Named Entities

Figure 3.1. Example of full annotation using the QUAERO typology. The types are org.ent, prod.rule, loc.adm.sup and loc.phys.geo; the components are kind and name

4 Recognizing Named Entities

Figure 4.1. Unitex automaton, from [FRI 04]. The automaton begins by differentiating broad types of medical establishments: hospitals, clinics, hospices, sanatoria, mortuaries, institutes, etc. It then divides these establishments into sub-types, where differentiation is required, e.g. by medical specialism, with or without passing through auxiliary nodes (“of”, “teaching”, “public”, etc.)

Figure 4.2. Comparison of symbolic and data-driven models

Figure 4.3. Majority class model

Figure 4.4. Hidden state Markov model

Figure 4.5. Indicator use in tokens

Figure 4.6. Illustration of a CRF model

5 Linking Named Entities to References

Figure 5.1. Mentions in a text and references from a knowledge base

6 Evaluating Named Entity Recognition

Figure 6.1. Top: slot-based alignment. Bottom: tree-based alignment (from [BEN 15a])

Figure 6.2. System comparison for structured entities

Figure 6.3. Illustration of the entity linking evaluation process in TAC-KBP 2014. The rectangles represent clusters, while the different levels of colors refer to a kb_id. The different shapes refer to different entity types, and the numbers refer to the doc_id (from [JI 14])

Figure 6.4. Illustration of the entity clustering evaluation process (i.e. without references in the knowledge base) used in TAC-KBP 2014. Top: precision evaluation; bottom: recall. The rectangles represent clusters, while different shapes represent entity types and numbers refer to the doc_id. Levels of colors are ignored (from [JI 14])

Appendix 4: Annotation Formats

Figure A4.1. Tag-based annotation

Figure A4.2. BIO (Begin, Inside, Outside) annotation

Figure A4.3. BILOU (Begin, Inside, Last, Outside, Unique) annotation

Pages

iii

100

101

102

103

104

105

106

107

108

109

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

131

132

133

134

135

137

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

169

170

171

172

173

Named Entities for Computational Linguistics

Damien Nouvel

Maud Ehrmann

Sophie Rosset

FOCUS SERIES

Patrick Paroubek

First published 2016 in Great Britain and the United States by ISTE Ltd and John Wiley & Sons, Inc.

Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address:

ISTE Ltd27-37 St George’s RoadLondon SW19 4EUUK

www.iste.co.uk

John Wiley & Sons, Inc.111 River StreetHoboken, NJ 07030USA

www.wiley.com

© ISTE Ltd 2016The rights of Damien Nouvel, Maud Ehrmann and Sophie Rosset to be identified as the authors of this work have been asserted by them in accordance with the Copyright, Designs and Patents Act 1988.

Library of Congress Control Number: 2015959094

British Library Cataloguing-in-Publication DataA CIP record for this book is available from the British LibraryISSN 2051-2481 (Print)ISSN 2051-249X (Online)ISBN 978-1-84821-838-3

Introduction

The digital revolution of the recent decades, resulting from a combination of data digitization and its spreading on a global level, touches on the fundamentals of humanity, namely communication, and affects all human activities. Today we live in a digitized and connected world, this is obvious, and our lifestyles, be it studying, working, enjoying entertainment or being a citizen, have changed drastically. These changes could be characterized in their early days by an unprecedented increase in the number of published documents which led to an overwhelming flood of diverse data, resulting from interactive contents (with Web 2.0) and novel modes of publishing and sharing of knowledge (with the Semantic Web). In this context, data processing systems allow us not only to to store data but also to leverage it. Starting from raw data, one objective is to extract and structure information in order to automatically develop and exploit knowledge. Natural Language Processing (NLP) is a part of this with respect to the language nature of the data.

Lesen Sie weiter in der vollständigen Ausgabe!

Tausende von E-Books und Hörbücher

Ihre Zahl wächst ständig und Sie haben eine Fixpreisgarantie.

Sie haben über uns geschrieben: