What's in a name? String Comparators in Probabilistic Data Linking
Paul David Campbell, Gokay Saher, Noel Hansen, Peter Rossiter
Building: Law Building
Room: Breakout 6 - Law Building, Room 022
Date: 2012-07-11 03:30 PM – 05:00 PM
Last modified: 2011-12-19
Abstract
Probabilistic data linking is applied when no unique record identifier is available on the two datasets to be linked. In place of a unique identifier a number of non-uniquely identifying fields are used, such as name, sex, and date of birth. For each potential linked pair, a decision must be made about whether there is agreement on each linking field. For some fields, such as sex, agreement or disagreement is quite clear. But for fields such as name, minor coding errors or misspellings can lead to a given individual possessing slightly different name strings.
The Australian Bureau of Statistics has adopted the widely used Winkler comparator to evaluate names. The comparator takes two strings, and uses the number of characters common to the two strings, the length of the two strings, and the extent to which the common characters appear in their correct order, to produce a score between 0 and 1, reflecting similarity between the two strings. The comparator works well in most cases. However, testing suggests it gives inflated scores when one of the strings is very short. This presentation will outline the problem in more detail, and introduce some alternative string comparators, which aim to punish more heavily the case of a short string being compared with a longer one.
The Australian Bureau of Statistics has adopted the widely used Winkler comparator to evaluate names. The comparator takes two strings, and uses the number of characters common to the two strings, the length of the two strings, and the extent to which the common characters appear in their correct order, to produce a score between 0 and 1, reflecting similarity between the two strings. The comparator works well in most cases. However, testing suggests it gives inflated scores when one of the strings is very short. This presentation will outline the problem in more detail, and introduce some alternative string comparators, which aim to punish more heavily the case of a short string being compared with a longer one.