Secondary abstract: |
We researched the quality of survey responses. We don't know if answers really reect the opinion of interviewees. We believe that inconsistent, respondents can be detected with the use of machine learning techniques. Our idea is to build a prediction model for every question of a survey. With the models, we get a probability distribution for every answer in the survey. We use cross-validation to get distributions for all instances. We evaluate them with Brier score, information score, probabilities, classification accuracy, Birer ranking, information score ranking, probability ranking and classification accuracy ranking. We merge these scores, and get an inconsistency score for every instance (interviewee) of the survey. We visualize these inconsistent cases for a better comprehension.
We developed the method with the statistical system R and packages CORElearn [14], MASS [20] and rpart [16]. For the visualization we used package CORElearn and data mining software Orange [5].
For testing purposes we used data sets Monk, B2B, B2C, DPS and hearnig aid. As prediction models we mostly used random forests, because of their superb accuraccy. Missing values were imputed with the use of k-nearest neighbor (kNN), modus, mean, or the instance was simply removed from the data. We generated inconsistent data and tried to identify these cases. There were some variance in our incosistency scores, so we reduced it by averaging the scores. For a better comprehension and indetification, we have plotted the cases that were identi_ed as inconsistent. The results depend on the data and evaluation method. Brier score, probabilities, Brier ranking and prabability ranking in most cases identified all inconsistent instances (interviewees). Other methods sometimes failed to identify inconsistent cases. The approach is computationaly demanding for larger datasets. |