Borja Bovcon (2013) CLUSTERING-BASED ANALYSIS OF TEXT SIMILARITY. EngD thesis.
Abstract
The focus of this thesis is comparison of analysis of text-document similarity using clustering algorithms. We begin by defining main problem and then, we proceed to describe the two most used text-document representation techniques, where we present words filtering methods and their importance, Porter's algorithm and tf-idf term weighting algorithm. We then proceed to apply all previously described algorithms on selected data-sets, which vary in size and compactness. Fallowing this, we categorize documents in different clusters using clustering algorithms and similarity/distance measures. In final chapter we evaluate obtained clusters and analyse results of evaluation. As a conclusion, we hand-pick the best possible combination of described methods for determining text-document similarity.
Item Type: | Thesis (EngD thesis) |
Keywords: | clustering, text, documents, analysis, kmeans, kmeans++, kmedoids++, distance measures, similarity, dissimilarity, comparison |
Number of Pages: | 69 |
Language of Content: | Slovenian |
Mentor / Comentors: | Name and Surname | ID | Function |
---|
Zoran Bosnić | 3826 | Mentor |
|
Link to COBISS: | http://www.cobiss.si/scripts/cobiss?command=search&base=50070&select=(ID=10144340) |
Institution: | University of Ljubljana |
Department: | Faculty of Computer and Information Science |
Item ID: | 2200 |
Date Deposited: | 21 Sep 2013 16:58 |
Last Modified: | 30 Sep 2013 10:19 |
URI: | http://eprints.fri.uni-lj.si/id/eprint/2200 |
---|
Actions (login required)