139,99 €
One of the challenges brought on by the digital revolution of the recent decades is the mechanism by which information carried by texts can be extracted in order to access its contents.
The processing of named entities remains a very active area of research, which plays a central role in natural language processing technologies and their applications. Named entity recognition, a tool used in information extraction tasks, focuses on recognizing small pieces of information in order to extract information on a larger scale.
The authors use written text and examples in French and English to present the necessary elements for the readers to familiarize themselves with the main concepts related to named entities and to discover the problems associated with them, as well as the methods available in practice for solving these issues.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 265
Veröffentlichungsjahr: 2016
Cover
Title
Copyright
Introduction
1 Named Entities for Accessing Information
1.1. Research program history
1.2. Task using named entities as a basic representation
1.3. Conclusion
2 Named Entities, Referential Units
2.1. Issues with the named entity concept
2.2. The notions of meaning and reference
2.3. Proper names
2.4. Definite descriptions
2.5. The meaning and referential functioning of named entities
2.6. Conclusion
3 Resources Associated with Named Entities
3.1. Typologies: general and specialist domains
3.2. Corpora
3.3. Lexicons and knowledge databases
3.4. Conclusion
4 Recognizing Named Entities
4.1. Detection and classification of named entities
4.2. Indicators for named entity recognition
4.3. Rule-based techniques
4.4. Data-driven and machine-learning systems
4.5. Unsupervised enrichment of supervised methods
4.6. Conclusion
5 Linking Named Entities to References
5.1. Knowledge bases
5.2. Formalizing polysemy in named entity mentions
5.3. Stages in the named entity linking process
5.4. System performance
6 Evaluating Named Entity Recognition
6.1. Classic measurements: precision, recall and F-measures
6.2. Measures using error counts
6.3. Evaluating associated tasks
6.4. Evaluating preprocessing technologies
6.5. Conclusion
Conclusion
Appendices
Appendix 1: Glossary
Appendix 2: Named Entities: Research Programs
Appendix 3: Summary of Available Corpora
Appendix 4: Annotation Formats
Appendix 5: Named Entities: Current Definitions
Bibliography
Index
End User License Agreement
Cover
Table of Contents
Begin Reading
1 Named Entities for Accessing Information
Table 1.1. An example of a document and an MUC-3 form, from [GRI 97]
3 Resources Associated with Named Entities
Table 3.1. NE typology in the context of MUC. (*) denotes examples added in the course of MUC-7
Table 3.2. ACE typology (V5.6)
Table 3.3. Examples using ACE typology (V5.6)
Table 3.4. ESTER-2 typology
Table 3.5. Types (in bold) and subtypes (in italics) used in the QUAERO typology
Table 3.6. Transverse and specific components
Table 3.7. Examples using the QUAERO typology
Table 3.8. Extract from the typology proposed by Sekine. Taken from version 7.1, described in full at http://nlp.cs.nyu.edu/ene/version710Beng.html
Table 3.9. Annotation of a statement using a variety of typologies
Table 3.10. Description of the training corpora produced in the context of ACE 2005. Broadcast news includes both radio and televised news programs, broadcast conversation covers radio or televised conversations (such as interviews), newswire refers to newspaper articles, Weblog refers to Website data and usenet refers to forum data. Volumes are given as a number of words
Table 3.11. Description of the corpus produced in the context of the ESTER-2 campaign
Table 3.12. Description of the transcribed speech corpus from the Quaero program
Table 3.13. Description of the historic press corpus from the Quaero program
Table 3.14. Description of the Etape corpus
Table 3.15. Description of the GermEval 2014 corpus
Table 3.16. Description of the Harem corpus
4 Recognizing Named Entities
Table 4.1. Textual annotation modes
5 Linking Named Entities to References
Table 5.1. Extracting entity variants from a resource
1 Named Entities for Accessing Information
Figure 1.1. Multi-level annotation with types (func.ind, pers.ind and org.adm) and components (qualifier, kind, name, first.name and last.name)
3 Resources Associated with Named Entities
Figure 3.1. Example of full annotation using the QUAERO typology. The types are org.ent, prod.rule, loc.adm.sup and loc.phys.geo; the components are kind and name
4 Recognizing Named Entities
Figure 4.1. Unitex automaton, from [FRI 04]. The automaton begins by differentiating broad types of medical establishments: hospitals, clinics, hospices, sanatoria, mortuaries, institutes, etc. It then divides these establishments into sub-types, where differentiation is required, e.g. by medical specialism, with or without passing through auxiliary nodes (“of”, “teaching”, “public”, etc.)
Figure 4.2. Comparison of symbolic and data-driven models
Figure 4.3. Majority class model
Figure 4.4. Hidden state Markov model
Figure 4.5. Indicator use in tokens
Figure 4.6. Illustration of a CRF model
5 Linking Named Entities to References
Figure 5.1. Mentions in a text and references from a knowledge base
6 Evaluating Named Entity Recognition
Figure 6.1. Top: slot-based alignment. Bottom: tree-based alignment (from [BEN 15a])
Figure 6.2. System comparison for structured entities
Figure 6.3. Illustration of the entity linking evaluation process in TAC-KBP 2014. The rectangles represent clusters, while the different levels of colors refer to a kb_id. The different shapes refer to different entity types, and the numbers refer to the doc_id (from [JI 14])
Figure 6.4. Illustration of the entity clustering evaluation process (i.e. without references in the knowledge base) used in TAC-KBP 2014. Top: precision evaluation; bottom: recall. The rectangles represent clusters, while different shapes represent entity types and numbers refer to the doc_id. Levels of colors are ignored (from [JI 14])
Appendix 4: Annotation Formats
Figure A4.1. Tag-based annotation
Figure A4.2. BIO (Begin, Inside, Outside) annotation
Figure A4.3. BILOU (Begin, Inside, Last, Outside, Unique) annotation
C1
iii
iv
ix
v
x
xi
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
99
100
101
102
103
104
105
106
107
108
109
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
131
132
133
134
135
137
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
169
170
171
172
173
Damien Nouvel
Maud Ehrmann
Sophie Rosset
FOCUS SERIES
Patrick Paroubek
First published 2016 in Great Britain and the United States by ISTE Ltd and John Wiley & Sons, Inc.
Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address:
ISTE Ltd27-37 St George’s RoadLondon SW19 4EUUK
www.iste.co.uk
John Wiley & Sons, Inc.111 River StreetHoboken, NJ 07030USA
www.wiley.com
© ISTE Ltd 2016The rights of Damien Nouvel, Maud Ehrmann and Sophie Rosset to be identified as the authors of this work have been asserted by them in accordance with the Copyright, Designs and Patents Act 1988.
Library of Congress Control Number: 2015959094
British Library Cataloguing-in-Publication DataA CIP record for this book is available from the British LibraryISSN 2051-2481 (Print)ISSN 2051-249X (Online)ISBN 978-1-84821-838-3
The digital revolution of the recent decades, resulting from a combination of data digitization and its spreading on a global level, touches on the fundamentals of humanity, namely communication, and affects all human activities. Today we live in a digitized and connected world, this is obvious, and our lifestyles, be it studying, working, enjoying entertainment or being a citizen, have changed drastically. These changes could be characterized in their early days by an unprecedented increase in the number of published documents which led to an overwhelming flood of diverse data, resulting from interactive contents (with Web 2.0) and novel modes of publishing and sharing of knowledge (with the Semantic Web). In this context, data processing systems allow us not only to to store data but also to leverage it. Starting from raw data, one objective is to extract and structure information in order to automatically develop and exploit knowledge. Natural Language Processing (NLP) is a part of this with respect to the language nature of the data.
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
