ACSPRI Conferences, RC33 Eighth International Conference on Social Science Methodology

Font Size:  Small  Medium  Large

Unbiased Regression Estimation for Multi-Linked Data in the Presence of Correlated Linkage Errors

Gunky Kim

Building: Law Building
Room: Breakout 6 - Law Building, Room 022
Date: 2012-07-11 01:30 PM – 03:00 PM
Last modified: 2011-12-19


Probabilistic data linkage is widely used when direct measurement is impossible or extremely costly. One important application is where different data sets relating to the same individuals at different points in time are 'multi-linked' to provide a synthetic longitudinal data record for each individual. However, even with a unique identifier, there exists the possibility that linkage errors in the merged data could lead to such a longitudinal record being actually made up of data items from different individuals. For example, a recent Australian Bureau of Statistics evaluation of different methods for linking census data reported a best case linkage method with 87% correct linkage, with much lower correct linkage rates for more realistic linkage methods. These linkage errors will lead to bias and loss of efficiency in regression modelling using the merged data set. Kim and Chambers (2011) describe methods for correcting the bias due to linkage errors when multiple data sets are probabilistically multi-linked. However, these methods assume independent pairwise linkage errors. A more realistic scenario, however, is to allow dependent pairwise linkage errors, in the sense that it is more likely that if the records corresponding to two different individuals in data sets A and B are incorrectly linked, then it is quite likely that the records for the same two individuals in data sets A and C will also be incorrectly linked. In this paper we show how the bias due to correlated linkage errors in the resulting merged data set can be corrected. Our methods are based on the inference framework described in Chambers (2009), and we focus on the situation where the merged data set is obtained by linking three separate data sources via two possibly dependent linkage operations. These data sources could represent different registers for the same population at different points in time or they could correspond to where a survey sample is linked to two separate population registers, one contemporaneous with the survey and the other containing historical information.