Survey data cleaning - Daniel Soto

Our demand prediction research project in Indonesia uses a 1200-household survey as part of the data collection. The data is collected by UNCEN students using electronic tablet computers and the OpenDataKit platform. The data is stored on the Ona platform for analysis and retrieval.

In any survey, there are entries in the survey results that need to be updated because of errors or misunderstandings during the survey process. It is common to directly edit the survey data file to correct these errors as they are identified. This approach lacks traceability, and depending on the implementation, permanently changes the record of the original data.

For this survey, I created an approach where each change to the data is recorded as an entry in a text file that details the field to be changed and the reason. This text file is parsed and used to generate a file of "cleaned" data. This approach ensures that the original data is never modified so that we know exactly how the survey was collected. We also have a record, under version control, of each change to the survey. From this we then have the derived data file that we can use for analysis.

For the implementation, I have a YAML file with the fields necessary to identify the record, the field to be changed, the original and corrected data, the reason and the date.

_uuid: 0bc2426a-7b7b-49aa-b5ce-055a694c23b0
_id: 93086
field: data_field
original_value:  old value
corrected_value:  new value
reason:  order of magnitude error
date: 2015-03-21

I use the python yaml package to load the file, which gives me a list of dictionaries. The survey data is in a python pandas dataframe. I can then iterate over this list, get the id of a record, check for consistency with existing values, and change the field within that record. After iterating over everything, I save the corrected dataframe to an excel spreadsheet and CSV file.