Finding a Needle in a Haystack: The Theoretical and Empirical Foundations of Assessing Disclosure Risk for Contextualized Microdata
Kristine Witkowski
Building: Law Building
Room: Breakout 1 - Law Building, Room 024
Date: 2012-07-10 01:30 PM – 03:00 PM
Last modified: 2011-12-21
Abstract
In their efforts to broadly release information that has high scientific value, producers may consider releasing the attributes of geographies instead of directly identifying the locations of respondents. Informing the design and production of such data files, this study describes various factors that are of concern when evaluating disclosure risk of contextualized microdata and some of the empirical steps that are involved in their assessment. Utilizing synthetic sets of survey respondents, I illustrate how different postulates shape the assessment of risk when considering: (1) estimated probabilities that unidentified geographic areas are represented within a survey; (2) the number of people in the population who share the same personal and contextual identifiers as a respondent; and (3) the anticipated amount of coverage error in census population counts and extant files that provide identifying information (i.e., name, address).
Informing the construction of anonymized research data files that contain the attributes of spatial units, I then conduct reidentification experiments for nearly 15,000 simulated datasets to assess likely patterns of disclosure risk for alternative database designs, particularly those relating to: (1) direct geographic identifiers, as determined by known division, state, and MSA-status of study locations; (2) the type of geographic entity, as determined by well-known spatial units used in the administration of surveys and governments; (3) the number of indirect geographic identifiers provided in a dataset, as determined by samples of geographic attribute sets; and (4) the coarseness of these contextual measures, as determined by global recoding schema.
Informing the construction of anonymized research data files that contain the attributes of spatial units, I then conduct reidentification experiments for nearly 15,000 simulated datasets to assess likely patterns of disclosure risk for alternative database designs, particularly those relating to: (1) direct geographic identifiers, as determined by known division, state, and MSA-status of study locations; (2) the type of geographic entity, as determined by well-known spatial units used in the administration of surveys and governments; (3) the number of indirect geographic identifiers provided in a dataset, as determined by samples of geographic attribute sets; and (4) the coarseness of these contextual measures, as determined by global recoding schema.