77,99 €
A comprehensive compilation of new developments in data linkage methodology
The increasing availability of large administrative databases has led to a dramatic rise in the use of data linkage, yet the standard texts on linkage are still those which describe the seminal work from the 1950-60s, with some updates. Linkage and analysis of data across sources remains problematic due to lack of discriminatory and accurate identifiers, missing data and regulatory issues. Recent developments in data linkage methodology have concentrated on bias and analysis of linked data, novel approaches to organising relationships between databases and privacy-preserving linkage.
Methodological Developments in Data Linkage brings together a collection of contributions from members of the international data linkage community, covering cutting edge methodology in this field. It presents opportunities and challenges provided by linkage of large and often complex datasets, including analysis problems, legal and security aspects, models for data access and the development of novel research areas. New methods for handling uncertainty in analysis of linked data, solutions for anonymised linkage and alternative models for data collection are also discussed.
Key Features:
This book will be of core interest to academics, government employees, data holders, data managers, analysts and statisticians who use administrative data. It will also appeal to researchers in a variety of areas, including epidemiology, biostatistics, social statistics, informatics, policy and public health.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 582
Veröffentlichungsjahr: 2015
Cover
Title Page
Foreword
Contributors
1 Introduction
1.1 Introduction: data linkage as it exists
1.2 Background and issues
1.3 Data linkage methods
1.4 Linkage error
1.5 Impact of linkage error on analysis of linked data
1.6 Data linkage: the future
2 Probabilistic linkage
2.1 Introduction
2.2 Overview of methods
2.3 Data preparation
2.4 Advanced methods
2.5 Concluding comments
3 The data linkage environment
3.1 Introduction
3.2 The data linkage context
3.3 The tools used in the production of functional anonymity through a data linkage environment
3.4 Models for data access and data linkage
3.5 Four case study data linkage centres
3.6 Conclusion
4 Bias in data linkage studies
4.1 Background
4.2 Description of types of linkage error
4.3 How linkage error impacts research findings
4.4 Discussion
5 Secondary analysis of linked data
5.1 Introduction
5.2 Measurement error issues arising from linkage
5.3 Models for different types of linking errors
5.4 Regression analysis using complete binary-linked data
5.5 Regression analysis using incomplete binary-linked data
5.6 Regression analysis with multi-linked data
5.7 Conclusion and discussion
6 Record linkage
6.1 Introduction
6.2 Probabilistic Record Linkage (PRL)
6.3 Multiple Imputation (MI)
6.4 Prior-Informed Imputation (PII)
6.5 Example 1: Linking electronic healthcare data to estimate trends in bloodstream infection
6.6 Example 2: Simulated data including non-random linkage error
6.7 Discussion
Acknowledgements
Appendix A
7 Using graph databases to manage linked data
7.1 Summary
7.2 Introduction
7.3 Graph approach
7.4 Methodologies
7.5 Algorithm implementation
7.6 New approaches facilitated by graph storage approach
7.7 Conclusion
Acknowledgements
8 Large-scale linkage for total populations in official statistics
8.1 Introduction
8.2 Current practice in record linkage for population censuses
8.3 Population-level linkage in countries that operate a population register: register-based censuses
8.4 New challenges in record linkage: the Beyond 2011 Programme
8.5 Summary
9 Privacy-preserving record linkage
9.1 Introduction
9.2 Chapter outline
9.3 Linking with and without personal identification numbers
9.4 PPRL approaches
9.5 PPRL for very large databases: blocking
9.6 Privacy considerations
9.7 Hardening Bloom filters
9.8 Future research
9.9 PPRL research and implementation with national databases
Acknowledgements
10 Summary
10.1 Introduction
10.2 Part 1: Data linkage as it exists today
10.3 Part 2: Analysis of linked data
10.4 Part 3: Data linkage in practice: new developments
10.5 Concluding remarks
References
Index
Advert page
End User License Agreement
Chapter 02
Table 2.1 Elementary examples of matching pairs of records (dependent on context).
Table 2.2 Matching results via matching strategies (0.2% false matches among designated matches).
Table 2.3 Examples of name parsing.
Table 2.4 Examples of address parsing.
Table 2.5 Summary of three pairs of files.
Table 2.6 ‘Pseudo-truth’ data with actual error rates.
Table 2.7 Bringing together two files.
Chapter 04
Table 4.1 Summary of included studies.
Table 4.2 Quantitative measures for assessing linkage quality.
Chapter 05
Table 5.1 Simulation results for linear regression.
Table 5.2 Simulation values of relative bias and relative RMSE (both expressed in percentage terms) for linear model parameter estimates under uncorrelated sample to population multi-linkage.
Chapter 07
Table 7.1 Sample data for comparison.
Table 7.2 Pairwise comparison of sample data.
Table 7.3 ‘Master linkage table’ of sample data.
Table 7.4 Product purchases as relational tables.
Table 7.5 Initial records for clerical review.
Table 7.6 Grouped records after clerical review.
Table 7.7 Decision link lookup table for clerical review.
Chapter 08
Table 8.1 Process-level view of Census to CCS automatic linkage.
Table 8.2 Examples of score-based auto-linking and clerical linking.
Table 8.3 Examples of name and date strings transformed into hash values.
Table 8.4 Uniqueness of link-keys derived from the Patient Register (PR) and the inconsistencies they resolve between true link pairs.
Table 8.5 Agreement scores to be used in score-based linking.
Table 8.6 Example of score data for link candidates derived from similarity tables.
Table 8.7 Table of link results: Census QA comparison.
Chapter 02
Figure 2.1 Log frequency versus weight, matches and non-matches combined. L, lower; U, upper.
Figure 2.2 Estimates versus truth: (a–c) Cumulative matches (tail of distribution, independent EM,
λ
= 0.2); (d–f) Cumulative false-match rates by weight (independent EM,
λ
= 0.2); (g–i) Cumulative false matches (independent EM,
λ
= 0.99, small sample).
Chapter 03
Figure 3.1 Record linkage within a trusted third-party mechanism.
Figure 3.2 Single-centre model – solid (heavy) weighted arrow shows movement of data, and the dashed line indicates a remote ‘view’ of the data but no transfer of actual data.
Figure 3.3 Functions separated within a single research centre through the use of firewalls.
Figure 3.4 Functions separated spatially and organisationally.
Figure 3.5 Secure multiparty computation.
Chapter 04
Figure 4.1 Classification of missed and false matches; *positive predictive value = true positives/total links; **negative predictive value = true negatives/total non-links.
Figure 4.2 Flowchart of study inclusion criteria.
Figure 4.3 Hypothetical data linkage inclusion flowchart for cross-sectional data linkage.
Chapter 06
Figure 6.1 Primary data file with four records where the set
B
variables are recorded and the set
A
variables are located in a linking file.
X
represents a recorded variable value and 0 a missing value. Initially, all the set
A
variables are missing and also some set
B
variables are missing as shown.
Figure 6.2 Primary data file with four records where the primary record file set
B
variables are recorded and the set
A
variable values for records 2 and 3 have been correctly transferred, unequivocally, via deterministic linkage with a linking file.
X
represents a variable with known value and 0 a missing value.
Figure 6.3 Comparison of prior-informed imputation (PII) and probabilistic record linkage (PRL) with gold-standard data: rates of infection by quarter of calendar year.
Figure 6.4 Generation of simulated linking files (for 50% match rate example).
Figure 6.5 Comparison of prior-informed imputation (PII) and probabilistic record linkage (PRL) with simulated data and 10% match rate: rates of infection.
Figure 6.6 Comparison of prior-informed imputation (PII) and probabilistic record linkage (PRL) with simulated data and 50% match rate: rates of infection.
Figure 6.7 Comparison of prior-informed imputation (PII) and probabilistic record linkage (PRL) with simulated data and 70% match rate: rates of infection.
Figure 6.8 Comparison of prior-informed imputation (PII) and probabilistic record linkage (PRL) with simulated data and 70% match rate: difference in adjusted rates of infection by hospital.
Chapter 07
Figure 7.1 Alice, Bob, Carol and Dave and their relationships.
Figure 7.2 Alice, Bob and Dave's ‘likes’.
Figure 7.3 Bob, Carol and Dave's travel experience.
Figure 7.4 Node and edge properties.
Figure 7.5 Product purchases as a graph.
Figure 7.6 Traversing the purchase graph to find similar purchases.
Figure 7.7 Pairwise comparisons in graph form.
Figure 7.8 Composing a project from records and links.
Figure 7.9 Linkage without (left) and with (right) equivalence links.
Figure 7.10 Weighted edge with weight vector.
Figure 7.11 Match weight histogram with cut-off thresholds.
Figure 7.12 Comparison graph with review links.
Figure 7.13 Reviewed cluster boundary and clerical review links.
Chapter 08
Figure 8.1 Beyond 2011 linkage strategy.
Figure 8.2 Example of link-key creation for linking records.
Figure 8.3 Forename extraction from dataset 1.
Figure 8.4 Forename extraction from dataset 2.
Figure 8.5 String comparison and hashing. The hash values in this example are for illustrative purposes only.
Figure 8.6 Undertake three-way join using similarity tables.
Chapter 09
Figure 9.1 Standard model for linking two databases with a trusted third party.
Figure 9.2 PPRL in a decentralised healthcare system [pprlhealthorg2.pdf].
Figure 9.3 Example of the encoding of SMITH and SMYTH with two different cryptographic functions into 30-bit Bloom filters A and B.
Figure 9.4 F-measure for canopy clustering (CC), sorted neighbourhood (SN) and Multibit trees (MBT). Errors in 5, 10, 15 and 20% of the records. Results for CC and MBT are nearly identical.
Figure 9.5 Time in minutes for finding best matching pairs (1 000–1 000 000 records) for canopy clustering (CC), sorted neighbourhood (SN) and Multibit trees (MBT), 10% errors.
Figure 9.6 The effect of adding random bits to files with 10 000 Bloom filters each on the F-score depending on the similarity threshold. Initial identifiers have no errors.
Figure 9.7 The effect of adding random bits to files with 10 000 Bloom filters each on the F-score depending on the similarity threshold. Twenty per cent errors in the identifiers.
Figure 9.8 The effect of adding random bits to files with 10 000 Bloom filters each on the F-score depending on the similarity threshold, errors in identifiers and different overlap. First row: 10% errors in the identifiers, second row: 20% errors in the identifiers. First column: 50% overlap, second column 75% overlap, third column 100% overlap.
Figure 9.9 Rehashing a CLK
l
= 1000 with a window of size
w
= 8. The bits covered by the window are used as integer seed for the generation of a set of
k
re
random numbers.
Figure 9.10 F-scores for PPRL of two files with 10 000 records after rehashing of CLKs with
k
= 1, 2, 4, 8, 12 hash functions, window size
w
= 8, 10, 12, 16 and step size
s
equal to window size (
s
=
w
).
Cover
Table of Contents
Begin Reading
ii
iii
iv
xi
xii
xiii
xiv
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
29
30
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
152
151
153
156
157
158
160
161
162
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
Established by WALTER A. SHEWHART and SAMUEL S. WILKS
Editors: David J. Balding, Noel A. C. Cressie, Garrett M. Fitzmaurice,Geof H. Givens, Harvey Goldstein, Geert Molenberghs, David W. Scott,Adrian F. M. Smith, Ruey S. Tsay, Sanford WeisbergEditors Emeriti: J. Stuart Hunter, Iain M. Johnstone, Joseph B. Kadane,Jozef L. Teugels
A complete list of the titles in this series appears at the end of this volume.
Edited by
Katie Harron
London School of Hygiene and Tropical Medicine, UK
Harvey Goldstein
University of Bristol and University College London, UK
Chris Dibben
University of Edinburgh, UK
This edition first published 2016© 2016 John Wiley & Sons, Ltd
Registered OfficeJohn Wiley & Sons, Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom
For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com.
The right of the author to be identified as the author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.
Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The publisher is not associated with any product or vendor mentioned in this book.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. It is sold on the understanding that the publisher is not engaged in rendering professional services and neither the publisher nor the author shall be liable for damages arising herefrom. If professional advice or other expert assistance is required, the services of a competent professional should be sought
The advice and strategies contained herein may not be suitable for every situation. In view of ongoing research, equipment modifications, changes in governmental regulations, and the constant flow of information relating to the use of experimental reagents, equipment, and devices, the reader is urged to review and evaluate the information provided in the package insert or instructions for each chemical, piece of equipment, reagent, or device for, among other things, any changes in the instructions or indication of usage and for added warnings and precautions. The fact that an organization or Website is referred to in this work as a citation and/or a potential source of further information does not mean that the author or the publisher endorses the information the organization or Website may provide or recommendations it may make. Further, readers should be aware that Internet Websites listed in this work may have changed or disappeared between when this work was written and when it is read. No warranty may be created or extended by any promotional statements for this work. Neither the publisher nor the author shall be liable for any damages arising herefrom.
Library of Congress Cataloging-in-Publication data applied for
ISBN: 9781118745878
A catalogue record for this book is available from the British Library.
Cover image: Cover photograph courtesy of Harvey Goldstein.
The methodology of data linkage is emerging as an important area of scientific endeavour as the opportunities for harnessing big data expand. Linkage of administrative data offers an efficient tool for research to improve understanding of people’s lives and for building a stronger evidence base for policy and service development. Linkage between data collected in primary research studies and administrative data could transform the scope, design and efficiency of primary studies such as surveys and clinical trials. To realise these potential advantages, we need to address the core methodological challenge of how to minimise the errors associated with linking imperfect or incomplete personal identifiers while preserving individual privacy. A second challenge is the need for methods to adapt to the widening opportunities for combining ‘big data’ and developments in computing systems.
A third challenge is for data governance to keep pace with new developments while addressing public concerns and perceived benefits and threats. As research needs public support, strict controls on use of linked data for research are a key tool for allaying public concerns, despite being burdensome and sometimes restricting the scope of research. In contrast, outside of research, linkage of large data sources captured as part of routine administration of services or transactions is fast becoming a normal feature of daily life. Data linkage is now essential for many daily tasks. If we buy a car, take out insurance, apply to university, use a reward card or obtain a driving licence, consent to linkage to additional information is often conditional – no consent, no service. Governance of linkage for these disparate purposes for data linkage, research and running services, could be brought closer together. Moreover, they share a need for methodological developments to reduce the harms of linkage error and minimise threats to privacy.
A fourth challenge for linkage methodology is that opportunities for research into data linkage tend to be restricted to the few analysts or researchers able to access identifiable data. Data linkers and analysts evaluating linkage error are often constrained by commercial or service interests or by Data Protection Regulations. Publication of linkage algorithms or of evaluations demonstrating linkage error may be bad for business, reveal problems in services or threaten security. However, transparency about linkage is essential for users of linked data – governments, businesses, service providers and users – to be able to interpret results from data linkage and take account of potential biases. It has long been recognised that even small amounts of error in the linkage process can produce substantially biased results, particularly when errors are more likely to occur in records belonging to specific groups of people. This book is important because it opens up the black box surrounding data linkage and should encourage more transparency about linkage processes and their potential impacts.
The book draws together methods arising from a range of contexts, including statistical and computer science perspectives, with detailed coverage of novel methods such as graph databases, and of new applications for linkage, such as the Beyond 2011 evaluation of whether the UK census could be replaced by linkage of administrative data sources. It combines excellent coverage of the current state of the science of data linkage methods with discussion of potential future developments.
This book is essential reading for data linkers, for users of linked data and for those who determine governance frameworks within which linkage can take place. There are few settings across the world where administrative data contains complete and wholly accurate identifiers and where linkage error is not a problem. This book should help to develop methods to improve linkage quality and transparency about linkage processes and errors, to ensure that the power of data linkage for services and research is realised.
Professor Ruth GilbertInstitute of Child Health, University College London, UK
Owen AbbottOffice for National StatisticsLondon, UK
Megan BohenskyDepartment of Medicine Melbourne EpiCentreUniversity of MelbourneParkville, Victoria, Australia
Raymond ChambersNational Institute for Applied Statistics Research Australia (NIASRA)School of Mathematics and Applied StatisticsUniversity of WollongongWollongong, New South Wales, Australia
Chris DibbenUniversity of EdinburghEdinburgh, UK
Mark ElliotUniversity of ManchesterManchester, UK
James M. FarrowSANT DatalinkAdelaide, South Australia, AustraliaandFarrow NorrisSydney, New South Wales, Australia
Harvey GoldsteinInstitute of Child HealthUniversity College LondonLondon, UKandGraduate School of EducationUniversity of BristolBristol, UK
Heather GowansUniversity of OxfordOxford, UK
Katie HarronLondon School of Hygiene and Tropical MedicineLondon, UK
Peter JonesOffice for National StatisticsLondon, UK
Gunky KimNational Institute for Applied Statistics Research Australia (NIASRA) School of Mathematics and Applied StatisticsUniversity of WollongongWollongong, New South Wales, Australia
Darren LightfootUniversity of St AndrewsSt Andrews, UK
Martin RalphsOffice for National StatisticsLondon, UK
Rainer SchnellResearch Methodology Group, University Duisburg-EssenDuisburg, Germany
William E. WinklerCenter for Statistical Research and MethodologyU.S. Bureau of the CensusSuitland, MD, USA
Katie Harron1, Harvey Goldstein2,3 and Chris Dibben4
1London School of Hygiene and Tropical Medicine, London, UK
2Institute of Child Health, University College London, London, UK
3Graduate School of Education, University of Bristol, Bristol, UK
4University of Edinburgh, Edinburgh, UK
The increasing availability of large administrative databases for research has led to a dramatic rise in the use of data linkage. The speed and accuracy of linkage have much improved over recent decades with developments such as string comparators, coding systems and blocking, yet the methods still underpinning most of the linkage performed today were proposed in the 1950s and 1960s. Linkage and analysis of data across sources remain problematic due to lack of identifiers that are totally accurate as well as being discriminatory, missing data and regulatory issues, especially concerned with privacy.
In this context, recent developments in data linkage methodology have concentrated on bias in the analysis of linked data, novel approaches to organising relationships between databases and privacy-preserving linkage. Methodological developments in data linkage bring together a collection of chapters on cutting-edge developments in data linkage methodology, contributed by members of the international data linkage community.
The first section of the book covers the current state of data linkage, methodological issues that are relevant to linkage systems and analyses today and case studies from the United Kingdom, Canada and Australia. In this introduction, we provide a brief background to the development of data linkage methods and introduce common terms. We highlight the most important issues that have emerged in recent years and describe how the remainder of the book attempts to deal with these issues. Chapter 2 summarises the advances in linkage accuracy and speed that have arisen from the traditional probabilistic methods proposed by Fellegi and Sunter. The first section concludes with a description of the data linkage environment as it is today, with case study examples. Chapter 3 describes the opportunities and challenges provided by data linkage, focussing on legal and security aspects and models for data access and linkage.
The middle section of the book focusses on the immediate future of data linkage, in terms of methods that have been developed and tested and can be put into practice today. It concentrates on analysis of linked data and the difficulties associated with linkage uncertainty, highlighting the problems caused by errors that occur in linkage (false matches and missed matches) and the impact that these errors can have on the reliability of results based on linked data. This section of the book discusses two methods for handling linkage error, the first relating to regression analyses and the second to an extension of the standard multiple imputation framework. Chapter 7 presents an alternative data storage solution compared to relational databases that provides significant benefits for linkage.
The final section of the book tackles an aspect of the potential future of data linkage. Ethical considerations relating to data linkage and research based on linked data are a subject of continued debate. Privacy-preserving data linkage attempts to avoid the controversial release of personal identifiers by providing means of linking and performing analysis on encrypted data. This section of the book describes the debate and provides examples.
The establishment of large-scale linkage systems has provided new opportunities for important and innovative research that, until now, have not been possible but that also present unique methodological and organisational challenges. New linkage methods are now emerging that take a different approach to the traditional methods that have underpinned much of the research performed using linked data in recent years, leading to new possibilities in terms of speed, accuracy and transparency of research.
A statistical definition of data linkage is ‘a merging that brings together information from two or more sources of data with the object of consolidating facts concerning an individual or an event that are not available in any separate record’ (Organisation for Economic Co-operation and Development (OECD)). Data linkage has many different synonyms (record linkage, record matching, re-identification, entity heterogeneity, merge/purge) within various fields of application (computer science, marketing, fraud detection, censuses, bibliographic data, insurance data) (Elmagarmid, Ipeirotis and Verykios, 2007).
The term ‘record linkage’ was first applied to health research in 1946, when Dunn described linkage of vital records from the same individual (birth and death records) and referred to the process as ‘assembling the book of life’ (Dunn, 1946). Dunn emphasised the importance of such linkage to both the individual and health and other organisations. Since then, data linkage has become increasingly important to the research environment.
The development of computerised data linkage meant that valuable information could be combined efficiently and cost-effectively, avoiding the high cost, time and effort associated with setting up new research studies (Newcombe et al., 1959). This led to a large body of research based on enhanced datasets created through linkage. Internationally, large linkage systems of note are the Western Australia Record Linkage System, which links multiple datasets (over 30) for up to 40 years at a population level, and the Manitoba Population-Based Health Information System (Holman et al., 1999; Roos et al., 1995). In the United Kingdom, several large-scale linkage systems have also been developed, including the Scottish Health Informatics Programme (SHIP), the Secure Anonymised Information Linkage (SAIL) Databank and the Clinical Practice Research Datalink (CPRD). As data linkage becomes a more established part of research relating to health and society, there has been an increasing interest in methodological issues associated with creating and analysing linked datasets (Maggi, 2008).
Data linkage brings together information relating to the same individual that is recorded in different files. A set of linked records is created by comparing records, or parts of records, in different files and applying a set of linkage criteria or rules to determine whether or not records belong to the same individual. These rules utilise the values on ‘linking variables’ that are common to each file. The aim of linkage is to determine the true match status of each comparison pair: a match if records belong to the same individual and a non-match if records belong to different individuals.
As the true match status is unknown, linkage criteria are used to assign a link status for each comparison pair: a link if records are classified as belonging to the same individual and a non-link if records are classified as belonging to different individuals.
In a perfect linkage, all matches are classified as links, and all non-matches are classified as non-links. If comparison pairs are misclassified (false matches or missed matches), error is introduced. False matches occur when records from different individuals link erroneously; missed matches occur when records from the same individual fail to link.
In deterministic linkage, a set of predetermined rules are used to classify pairs of records as links and non-links. Typically, deterministic linkage requires exact agreement on a specified set of identifiers or matching variables. For example, two records may be classified as a link if their values of National Insurance number, surname and sex agree exactly. Modifications of strict deterministic linkage include ‘stepwise’ deterministic linkage, which uses a succession of rules; the ‘n−1’ deterministic procedure, which allows a link to be made if all but one of a set of identifiers agree; and ad hoc deterministic procedures, which allow partial identifiers to be combined into a pseudo-identifier (Abrahams and Davy, 2002; Maso, Braga and Franceschi, 2001; Mears et al., 2010). For example, a combination of the first letter of surname, month of birth and postcode area (e.g. H01N19) could form the basis for linkage.
Strict deterministic methods that require identifiers to match exactly often have a high rate of missed matches, as any recording errors or missing values can prevent identifiers from agreeing. Conversely, the rate of false matches is typically low, as the majority of linked pairs are true matches (records are unlikely to agree exactly on a set of identifiers by chance) (Grannis, Overhage and McDonald, 2002). Deterministic linkage is a relatively straightforward and quick linkage method and is useful when records have highly discriminative or unique identifiers that are well completed and accurate. For example, the community health index (CHI) is used for much of the linkage in the Scottish Record Linkage System.
Newcombe was the first to propose that comparison pairs could be classified using a probabilistic approach (Newcombe et al., 1959). He suggested that a match weight be assigned to each comparison pair, representing the likelihood that two records are a true match, given the agreement of their identifiers. Each identifier contributes separately to an overall match weight. Identifier agreement contributes positively to the weight, and disagreement contributes a penalty. The size of the contribution depends on the discriminatory power of the identifier, so that agreement on name makes a larger contribution than agreement on sex (Zhu et al., 2009). Fellegi and Sunter formalised Newcombe’s proposals into the statistical theory underpinning probabilistic linkage today (Fellegi and Sunter, 1969). Chapter 2 provides details on the match calculation.
In probabilistic linkage, link status is determined by comparing match weights to a threshold or cut-off match weight in order to classify as a match or non-match. In addition, manual review of record pairs is often performed to aid choice of threshold and to deal with uncertain links (Krewski et al., 2005). If linkage error rates are known, thresholds can be selected to minimise the total number of errors, so that the number of false matches and missed matches cancels out. However, error rates are usually unknown. The subjective process of choosing probabilistic thresholds is a limitation of probabilistic linkage, as different linkers may choose different thresholds. This can result in multiple possible versions of the linked data.
There are certain problems with the standard probabilistic procedure. The first is the assumption of independence for the probabilities associated with the individual matching variables. For example, observing an individual in any given ethnic group category may be associated with certain surname structures, and hence, the joint probability of agreeing across matching variables may not simply be the product of the separate probabilities. Ways of dealing with this are suggested in Chapters 2 and 6. A second typical problem is that records with match weights that do not reach the threshold are excluded from data analysis, reducing efficiency and introducing bias if this is associated with the characteristics of the variables to be analysed. Chapter 6 suggests a way of dealing with this using missing data methods. A third problem occurs when the errors in one or more matching variables are associated with the values of the secondary data file variables to be transferred for analysis. This non-random linkage error can lead to biases in the estimates from subsequent analyses, and this is discussed in Chapters 4–6. Chapter 4 reviews the literature and sets out the situations where linkage bias of any kind can arise, including the important case when individual consent to linkage may be withheld so leading to incomplete administrative registers. Chapter 5 looks explicitly at regression modelling of linked data files when different kinds of errors are present, and Chapter 6 proposes a Bayesian procedure for handling incomplete linkages.
One of the features of traditional probabilistic methods is that once weights have been computed, the full pattern of similarities that give rise to these weights, based upon the matching variables, is either discarded or stored in a form that requires any future linkage to repeat the whole process. In Chapter 7, a graphical approach to data storage and retrieval is proposed that would give the data linker efficient access to such patterns from a graph database. In particular, it would give the linker the possibility to readily modify her algorithm or update files as further information becomes available. Chapter 7 discusses implementation details.
Quality of data linkage ultimately depends on the quality of the underlying data. If datasets to be linked contained sufficiently accurate, complete and discriminative information, data linkage would be a straightforward database merging process. Unfortunately, many administrative datasets contain messy, inconsistent and missing data. Datasets also vary in structure, format and content. The way in which data are entered can influence data quality. For example, errors may be more likely to occur in identifiers that are copied from handwritten forms, scanned or transcribed from a conversation. These issues mean that techniques to handle complex and imperfect data are required. Although data preparation is an important concern when embarking on a linkage project, we do not attempt to cover this in the current volume. A good overview can be found in Christen (2012a).
Linkage error occurs when record pairs are misclassified as links or non-links. Errors tend to occur when there is no unique identifier (such as NHS number or National Insurance number) or when available unique identifiers are prone to missing values or errors. This means that linkage relies on partial identifiers such as sex, date of birth or surname (Sariyar, Borg and Pommerening, 2012).
False matches, where records from different individuals link erroneously, occur when different individuals have similar identifiers. These errors occur more often when there is a lack of discriminative identifiers and file sizes are large (e.g. different people sharing the same sex, date of birth and postcode). For records that have more than the expected number of candidate records in the linking file, the candidate(s) with the most agreeing identifiers or with the highest match weight is typically accepted as a link. This may not always be the correct link.
Missed matches, where records from the same individual fail to link, occur where there are errors in identifiers. This could be due to misreporting (e.g. typographical errors), changes over time (e.g. married women’s surnames) or missing/invalid data that prevent records from agreeing.
Many linkage studies report the proportion of records that were linked (match rate). Other frequently report measures of linkage quality are sensitivity and specificity (Christen and Goiser, 2005). These measures are directly related to the probability of false matches and missed matches. However, interpretation of these measures is not always straightforward. For example, match rate is only relevant if all records are expected to be matched. Furthermore, such measures of linkage error can be difficult to relate to potential bias in results.
Derivation of measures of linkage error can also be difficult, as estimation requires that either the error rate is known or that the true match status of comparison pairs is known. A common method for measuring linkage error is the use of a gold-standard dataset. Gold-standard data may be an external data source or a subset of data with additional identifiers available (Fonseca et al., 2010; Monga and Patrick, 2001). Many linkage projects create a gold-standard dataset by taking a sample of comparison pairs and submitting the records to manual review (Newgard, 2006; Waien, 1997). The aim of the manual review is to determine the true match status of each pair (Belin and Rubin, 1995; Gill, 1997; Morris et al., 1997; Potz et al., 2010). Once a gold-standard dataset has been obtained, it is used to calculate sensitivity, specificity and other measures of linkage error by comparing the true match status of each comparison pair (in the gold-standard data) with the link status of each pair (Wiklund and Eklund, 1986; Zingmond et al., 2004). These estimates are assumed to apply to the entire linked dataset (the gold-standard data are assumed to be representative). This is a reasonable assumption if the gold-standard data were a random sample; otherwise, potential biases might be introduced.
Manual review is convenient but can take a substantial amount of time, particularly for large files (Qayad and Zhang, 2009; Sauleau, Paumier and Buemi, 2005). It also may not always be completely accurate. If samples are only taken from linked pairs – which is often the case due to the smaller number of links compared to non-links – the rate of missed matches would not be estimated. If the sample of pairs reviewed is not representative, estimates of linkage error may be biased.
Although a large body of literature exists on methods and applications of data linkage, there has been relatively little methodological research into the impact of linkage error on analysis of linked data. The issue is not a new one – Neter, Maynes and Ramanathan (1965) recognised that even relatively small errors could result in substantially biased estimates (Neter, Maynes and Ramanathan, 1965). The lack of comprehensive linkage evaluation seems to be due to a lack of awareness of the implications of linkage error, possibly resulting from a lack of communication between data linkers and data users. However, the relevance and reliability of research based on linked data are called into question in the presence of linkage error (Chambers, 2009).
Data custodians (organisations that hold personally identifiable data that could be used for research) have a responsibility to protect privacy by adhering to legislation and guidelines, avoiding unauthorised copies of data being made and distributed and ensuring data are used only for agreed purposes. For these reasons, data custodians can be unwilling or unable to release identifiable data for linkage. To overcome this issue, many linkage projects adhere to the ‘separation principle’. This means that the people performing the linkage (the data linkers – sometimes a trusted third party) do not have access to ‘payload’ data and people performing the analysis (the data users) do not have access to any personal identifiers. This protects confidentiality and means that linked datasets can be used for a range of purposes (Goeken et al., 2011). Approaches to privacy and security are discussed in detail in Chapter 3.
While the potential knowledge benefits from data linkage can be very great, these have to be balanced against the need to ensure the protection of individuals’ personal information. Working with data that is non-personal (i.e. truly anonymous) guarantees such protection but is rarely practicable in a data linkage context. Instead, what is required is the construction of a data linkage environment in which the process of re-identification is made so difficult that the data can be judged as practicably anonymous. This type of environment is created both through the governance processes operating across the environment and the data linkage and analysis models that structure the operational processes. Chapter 3 reviews the main models that are used and their governance processes. Some examples from across the world are presented as case studies.
The separation principle is recognised as good practice but means that researchers often lack the information needed to assess the impact of linkage error on results and are unable to report useful evaluations of linkage (Baldi et al., 2010; Harron et al., 2012; Herman et al., 1997; Kelman, Bass and Holman, 2002). Separation typically means that any uncertainty in linkage is not carried through to analysis. The examples of linkage evaluation that do appear in the literature are often extreme cases of bias due to linkage error. However, as there is a lack of consistent evaluation of linkage, it is difficult to identify the true extent of the problem.
Reported measures of linkage error are important, as they offer a simple representation of linkage quality. However, in isolation, measures of sensitivity and specificity cannot always provide interpretation of the validity of results (Leiss, 2007). Although it is useful to quantify linkage error, it is most important to understand the impact of these errors on results.
The impact of linkage error on analysis of linked data depends on the structure of the data, the distribution of error and the analysis to be performed (Fett, 1984; Krewski et al., 2005). In some studies, it may be important to capture all true matches, and so a more specific approach could be used. For example, if linkage was being used to detect fraud, it may be important that all possible links were captured. In other studies, it might be more important that linked records are true matches, and missed matches are less important. For example, if linked healthcare records were being used to obtain medical history, it might be more important to avoid false matches (German, 2000). For these reasons, it is important that the impact of linkage error is understood for a particular purpose, with linkage criteria ideally tailored to that purpose. Bias due to linkage error is explored in detail in Chapter 4.
Methods for data linkage have evolved over recent years to address the dynamic, error-prone, anonymised or incomplete nature of administrative data. However, as the size and complexity of these datasets increase, current techniques cannot eliminate linkage error entirely. Manual review is not feasible for the linkage of millions of records at a time. With human involvement in the creation of these data sources, recording errors will always be an issue and lead to uncertainty in linkage. Furthermore, as opportunities for linkage of data between organisations and across sectors arise, new challenges will emerge.
Chapter 9 looks at record linking approaches that are used to support censuses and population registers. There is increasing interest in this with the growing availability of large-scale administrative datasets in health, social care, education, policing, etc. It looks at issues of data security and, like Chapter 3, addresses the balance between individual privacy protection and knowledge and explores technical solutions that can be implemented, especially those that can operate on so-called ‘hashed’ data or data that is ‘pseudonymised at source’ where the linker only has access to linking variables where the original information has been transformed (similarly but irreversibly) into non-disclosive pseudonyms.
The second decade of the twenty-first century is an exciting and important era for data linkage, with increasing amounts of resources being applied and a broad range of different disciplinary expertise being applied. Our hope is that the present volume, by setting out some of these developments, will encourage further work and interest.
William E. Winkler
Center for Statistical Research and Methodology, U.S. Bureau of the Census, Suitland, MD, USA
This chapter comprehensively sets out the theory behind probabilistic linkage methods, giving a historical account of its development and setting out the current scene. It looks at the various practical approaches that are used, with examples.
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
Lesen Sie weiter in der vollständigen Ausgabe!
