ePrints.FRI - University of Ljubljana, Faculty of Computer and Information Science

Directed generation of synthetic examples in machine learning based on the classification reliability estimates

Domen Košir (2010) Directed generation of synthetic examples in machine learning based on the classification reliability estimates. EngD thesis.

[img] PDF
Download (1991Kb)

    Abstract

    The field of machine learning has made a lot of progress in the recent years. As it is used more frequently in real-world problems, a new issue has emerged: Studies have shown that imbalanced data can lead to poor performance by some classifiers. Imbalanced datasets are composed of many ''normal'' examples and few ''interesting'' ones (which correspond to the observed real-world phenomenon). Typical examples are credit card fraud detection, detection and diagnosis of diseases in tissue samples and detection of suspicious behaviur in surveillance camera videos. The imbalance in data can be ''natural'' or we can have imbalanced data due to economic or privacy reasons. When presented with highly imbalanced data, some standard classifiers can ignore the minority class which leads to lower classification accuracy. Various solutions have been proposed to counter this problem. Some solutions include modifications of classification algorithms while other solutions modify the data itself. In this thesis, we focus onto the latter. Random undersampling, random oversampling and SMOTE (Synthetic Minority Oversampling TEchnique) have been implemented and tested. In addition, three new varations of SMOTE algorithm have been proposed in this thesis. All three estimate classification reliability (Kukar et al.) of minority examples and then use these estimates while generating synthetic examples. The data balancing algorithms were tested with 10-fold cross validation using 10 datasets from the UCI Machine Learning Repository and four different classifiers (decision trees, naive Bayes, k-nearest neighbors algorithm and support vector machines). The results have shown that it is feasible to improve classifiers' performance by balancing the data with one of our versions of SMOTE algorithm. The most significant improvements in classification accuracy were observed when we balanced small datasets with low shares of minority examples.

    Item Type: Thesis (EngD thesis)
    Keywords: machine learning, imbalanced data, imbalanced datasets, random undersampling, random oversampling, SMOTE
    Number of Pages: 68
    Language of Content: Slovenian
    Mentor / Comentors:
    Name and SurnameIDFunction
    doc. dr. Zoran Bosnić3826Mentor
    Link to COBISS: http://www.cobiss.si/scripts/cobiss?command=search&base=50070&select=(ID=00007965268)
    Institution: University of Ljubljana
    Department: Faculty of Computer and Information Science
    Item ID: 1155
    Date Deposited: 08 Sep 2010 16:44
    Last Modified: 13 Aug 2011 00:37
    URI: http://eprints.fri.uni-lj.si/id/eprint/1155

    Actions (login required)

    View Item