Directed generation of synthetic examples in machine learning based on the classification reliability estimates

Domen Košir (2010) Directed generation of synthetic examples in machine learning based on the classification reliability estimates. EngD thesis.

PDF
Download (1991Kb)

Abstract

The field of machine learning has made a lot of progress in the recent years. As it is used more frequently in real-world problems, a new issue has emerged: Studies have shown that imbalanced data can lead to poor performance by some classifiers. Imbalanced datasets are composed of many ''normal'' examples and few ''interesting'' ones (which correspond to the observed real-world phenomenon). Typical examples are credit card fraud detection, detection and diagnosis of diseases in tissue samples and detection of suspicious behaviur in surveillance camera videos. The imbalance in data can be ''natural'' or we can have imbalanced data due to economic or privacy reasons. When presented with highly imbalanced data, some standard classifiers can ignore the minority class which leads to lower classification accuracy. Various solutions have been proposed to counter this problem. Some solutions include modifications of classification algorithms while other solutions modify the data itself. In this thesis, we focus onto the latter. Random undersampling, random oversampling and SMOTE (Synthetic Minority Oversampling TEchnique) have been implemented and tested. In addition, three new varations of SMOTE algorithm have been proposed in this thesis. All three estimate classification reliability (Kukar et al.) of minority examples and then use these estimates while generating synthetic examples. The data balancing algorithms were tested with 10-fold cross validation using 10 datasets from the UCI Machine Learning Repository and four different classifiers (decision trees, naive Bayes, k-nearest neighbors algorithm and support vector machines). The results have shown that it is feasible to improve classifiers' performance by balancing the data with one of our versions of SMOTE algorithm. The most significant improvements in classification accuracy were observed when we balanced small datasets with low shares of minority examples.

Item Type:

Thesis (EngD thesis)

Keywords:

machine learning, imbalanced data, imbalanced datasets, random undersampling, random oversampling, SMOTE

Number of Pages:

Language of Content:

Slovenian

Mentor / Comentors:

Name and Surname	ID	Function
doc. dr. Zoran Bosnić	3826	Mentor

Link to COBISS:

http://www.cobiss.si/scripts/cobiss?command=search&base=50070&select=(ID=00007965268)

Institution:

University of Ljubljana

Department:

Faculty of Computer and Information Science

Item ID:

1155

Date Deposited:

08 Sep 2010 16:44

Last Modified:

13 Aug 2011 00:37

URI:

http://eprints.fri.uni-lj.si/id/eprint/1155

Actions (login required)

View Item