Nejc Ilc (2009) Comparison of the methods for pattern clustering. EngD thesis.
Abstract
Clustering or cluster analysis is a fundamental machine learning task, which is, unfortunatelly, an ill-posed problem, caused by large diversity of problem domains. Many different approaches have been used to solve it, which consequently reflects as a long list of clustering methods. Moreover, it is hard to determine, which clustering of particular data is better than another, because there does not exist an universal similarity metric, which would be the most appropriate for all different problems. In the thesis, four chosen methods for clustering are being examined, each of which has its interesting features. These are: KMC, ECMC, EM GMM in CSC. In addition, new criteria for the evaluation of clustering correctness appear, which are inherently subject to a peer comparison. My intention was to carry out a comprehensive analysis of the chosen methods and objectively evaluate the results of the clustering of individual typical problem domain. To achieve this, four internal and six external evaluation criteria or indices were used. On their basis final evaluation of the effectiveness of various methods is given. Several synthetic and real data sets on which the clustering has been performed out have been selected to reflect the typical problems in this field. The final results of the comparison shows that the application of knowledge of information theory, which exploits novel CSC method, contribute to a better outcome depending on the selected criteria and the data sets. It also opens up considerable potential to continue its improvement and is also the motivation for using alternative approaches to solve the clustering problem.
Actions (login required)