ePrints.FRI - University of Ljubljana, Faculty of Computer and Information Science

Analysis of survey consistency with machine learning

Dejan Ognjenović (2011) Analysis of survey consistency with machine learning. EngD thesis.

Download (3357Kb)


    We researched the quality of survey responses. We don't know if answers really reect the opinion of interviewees. We believe that inconsistent, respondents can be detected with the use of machine learning techniques. Our idea is to build a prediction model for every question of a survey. With the models, we get a probability distribution for every answer in the survey. We use cross-validation to get distributions for all instances. We evaluate them with Brier score, information score, probabilities, classification accuracy, Birer ranking, information score ranking, probability ranking and classification accuracy ranking. We merge these scores, and get an inconsistency score for every instance (interviewee) of the survey. We visualize these inconsistent cases for a better comprehension. We developed the method with the statistical system R and packages CORElearn [14], MASS [20] and rpart [16]. For the visualization we used package CORElearn and data mining software Orange [5]. For testing purposes we used data sets Monk, B2B, B2C, DPS and hearnig aid. As prediction models we mostly used random forests, because of their superb accuraccy. Missing values were imputed with the use of k-nearest neighbor (kNN), modus, mean, or the instance was simply removed from the data. We generated inconsistent data and tried to identify these cases. There were some variance in our incosistency scores, so we reduced it by averaging the scores. For a better comprehension and indetification, we have plotted the cases that were identi_ed as inconsistent. The results depend on the data and evaluation method. Brier score, probabilities, Brier ranking and prabability ranking in most cases identified all inconsistent instances (interviewees). Other methods sometimes failed to identify inconsistent cases. The approach is computationaly demanding for larger datasets.

    Item Type: Thesis (EngD thesis)
    Keywords: survey, machine learning, Brier score, information score, rank, probabilities, R, CORElearn, MASS, rpart, Orange, Monk, B2B, B2C, DPS, Hearing aid, random forest
    Number of Pages: 56
    Language of Content: Slovenian
    Mentor / Comentors:
    Name and SurnameIDFunction
    prof. dr. Marko Robnik Šikonja276Mentor
    Link to COBISS: http://www.cobiss.si/scripts/cobiss?command=search&base=50070&select=(ID=00008632660)
    Institution: University of Ljubljana
    Department: Faculty of Computer and Information Science
    Item ID: 1506
    Date Deposited: 16 Sep 2011 11:20
    Last Modified: 26 Sep 2011 20:07
    URI: http://eprints.fri.uni-lj.si/id/eprint/1506

    Actions (login required)

    View Item