CLUSTERING-BASED ANALYSIS OF TEXT SIMILARITY

Borja Bovcon (2013) CLUSTERING-BASED ANALYSIS OF TEXT SIMILARITY. EngD thesis.

Preview

Abstract

The focus of this thesis is comparison of analysis of text-document similarity using clustering algorithms. We begin by defining main problem and then, we proceed to describe the two most used text-document representation techniques, where we present words filtering methods and their importance, Porter's algorithm and tf-idf term weighting algorithm. We then proceed to apply all previously described algorithms on selected data-sets, which vary in size and compactness. Fallowing this, we categorize documents in different clusters using clustering algorithms and similarity/distance measures. In final chapter we evaluate obtained clusters and analyse results of evaluation. As a conclusion, we hand-pick the best possible combination of described methods for determining text-document similarity.

Item Type:

Thesis (EngD thesis)

Keywords:

clustering, text, documents, analysis, kmeans, kmeans++, kmedoids++, distance measures, similarity, dissimilarity, comparison

Number of Pages:

Language of Content:

Slovenian

Mentor / Comentors:

Name and Surname	ID	Function
Zoran Bosnić	3826	Mentor

Link to COBISS:

http://www.cobiss.si/scripts/cobiss?command=search&base=50070&select=(ID=10144340)

Institution:

University of Ljubljana

Department:

Faculty of Computer and Information Science

Item ID:

2200

Date Deposited:

21 Sep 2013 16:58

Last Modified:

30 Sep 2013 10:19

URI:

http://eprints.fri.uni-lj.si/id/eprint/2200

Actions (login required)

View Item