Assessing the Quality of Survey Data and the Dirty Data Index
Jörg Blasius, Victor Thiessen
Building: Law Building
Room: Breakout 6 - Law Building, Room 022
Date: 2012-07-12 03:30 PM – 05:00 PM
Last modified: 2012-03-29
Abstract
Responses to a set of items in survey data are not only associated with socio-demographic characteristics such as age, gender, and educational level, they are also associated with different kinds of response styles, such as acquiescence response style, extreme response style, and midpoint responding. Further, there are misunderstandings of questions, arbitrary responses, fatigue and other effects, which also reduce the quality of data. In general, when analyzing a battery of items, responses are related to the substantive concept, which we are mainly interested in, and to methodological effects. Applying categorical principal component analysis (CatPCA) to an item battery of survey data allows us to assess what part of the responses is due to substantive relationships and what part is attributable to methodological artifacts. In a first paper, Blasius and Thiessen (2009) demonstrated that the share of tied data in CatPCA can be used as a rough indicator for assessing the quality of data. This idea has been further developed so that we are now able to provide with a coefficient, the Dirty Data Index (DDI) to describe the quality of responses in a given item set. Using different examples, we will show which part of variation can be explained by the substantive concept and which part is due to methodological induced variation.