Clustering-based discretization of numeric attributes

Matej Pičulin (2011) Clustering-based discretization of numeric attributes. EngD thesis.

Abstract

We propose a new method for discretization, which uses clustering to determine candidate boundaries. We use two well-known clustering methods: k-means clustering and hierarchical clustering. Discretization is well-know and difficult problem in machine learning and data mining, especially for strongly dependent attributes. Most existing methods do not take dependencies into account, therefore we develop an algorithm, which will finds dependencies implicitly with the help of clustering. First we present some known discretization methods and classification algorithms, which we use in the presentation. We present the idea of clustering-based discretization and try to answer the following questions: which clustering method to use, how many clusters do we need, how do clusters vote for boundaries and how to choose final boundaries from candidates. We extensively test the approach on artificial domains with strong dependencies and on real domains. We test several variations of cluster-based discretization and show the methods can solve some cases with strongly dependent attributes. Finally, we suggest possible improvements and extensions of the work.

Item Type:

Thesis (EngD thesis)

Keywords:

discretization, clustering, machine learning, numeric attributes

Number of Pages:

Language of Content:

Slovenian

Mentor / Comentors:

Name and Surname	ID	Function
prof. dr. Marko Robnik Šikonja	276	Mentor

Link to COBISS:

http://www.cobiss.si/scripts/cobiss?command=search&base=50070&select=(ID=00008291924)

Institution:

University of Ljubljana

Department:

Faculty of Computer and Information Science

Item ID:

1308

Date Deposited:

21 Mar 2011 12:18

Last Modified:

13 Aug 2011 00:38

URI:

http://eprints.fri.uni-lj.si/id/eprint/1308

Actions (login required)

View Item