Domen Košir (2010) Directed generation of synthetic examples in machine learning based on the classification reliability estimates. EngD thesis.
Abstract
The field of machine learning has made a lot of progress in the recent years. As it is used more frequently in real-world problems, a new issue has emerged: Studies have shown that imbalanced data can lead to poor performance by some classifiers. Imbalanced datasets are composed of many ''normal'' examples and few ''interesting'' ones (which correspond to the observed real-world phenomenon). Typical examples are credit card fraud detection, detection and diagnosis of diseases in tissue samples and detection of suspicious behaviur in surveillance camera videos. The imbalance in data can be ''natural'' or we can have imbalanced data due to economic or privacy reasons. When presented with highly imbalanced data, some standard classifiers can ignore the minority class which leads to lower classification accuracy. Various solutions have been proposed to counter this problem. Some solutions include modifications of classification algorithms while other solutions modify the data itself. In this thesis, we focus onto the latter. Random undersampling, random oversampling and SMOTE (Synthetic Minority Oversampling TEchnique) have been implemented and tested. In addition, three new varations of SMOTE algorithm have been proposed in this thesis. All three estimate classification reliability (Kukar et al.) of minority examples and then use these estimates while generating synthetic examples. The data balancing algorithms were tested with 10-fold cross validation using 10 datasets from the UCI Machine Learning Repository and four different classifiers (decision trees, naive Bayes, k-nearest neighbors algorithm and support vector machines). The results have shown that it is feasible to improve classifiers' performance by balancing the data with one of our versions of SMOTE algorithm. The most significant improvements in classification accuracy were observed when we balanced small datasets with low shares of minority examples.
Actions (login required)