Anže Starič (2010) Machine learning techniques for UCSD Data Mining Contest. EngD thesis.
With participation in machine learning competitions we get acquainted with new problem domains and new types of problems. We are forced to look for and try out new techniques and search for innovative problem solving approaches. In UCSD Data Mining Contest, our task was to rank the ordering consumer pool according to who is most likely to become a customer of the retailer. In the following dissertation we have developed a technique for predicting the probability of a consumer becoming a customer of the retailer. Standard machine learning algorithms were evaluated and attribute analysis has been performed on the train dataset. In order to improve the score of standard algorithms review of methods that augment Naive Bayes for ranking has also been carried out and the most promising one has been implemented by using the Orange framework. We have also assessed the impact of data discretization on the Naive Bayes and evaluated ensemble techniques that combine the Naive Bayes Classifiers. Results show that ranking of potential customers is indeed a hard task for standard machine learning algorithms. Augmented Naive Bayes performed slightly better in terms of AUC, but the best results were produced using a combination of data discretization and standard Naive Bayes Classifier. AUC scores achieved were relatively low compared to scores achieved on other machine learning problems. This suggests that more attributes should be introduced into dataset before using this method in production environment.
Actions (login required)