Matej Pičulin (2011) Clustering-based discretization of numeric attributes. EngD thesis.
Abstract
We propose a new method for discretization, which uses clustering to determine candidate boundaries. We use two well-known clustering methods: k-means clustering and hierarchical clustering. Discretization is well-know and difficult problem in machine learning and data mining, especially for strongly dependent attributes. Most existing methods do not take dependencies into account, therefore we develop an algorithm, which will finds dependencies implicitly with the help of clustering. First we present some known discretization methods and classification algorithms, which we use in the presentation. We present the idea of clustering-based discretization and try to answer the following questions: which clustering method to use, how many clusters do we need, how do clusters vote for boundaries and how to choose final boundaries from candidates. We extensively test the approach on artificial domains with strong dependencies and on real domains. We test several variations of cluster-based discretization and show the methods can solve some cases with strongly dependent attributes. Finally, we suggest possible improvements and extensions of the work.
Actions (login required)