Borja Bovcon (2013) CLUSTERING-BASED ANALYSIS OF TEXT SIMILARITY. EngD thesis.
Abstract
The focus of this thesis is comparison of analysis of text-document similarity using clustering algorithms. We begin by defining main problem and then, we proceed to describe the two most used text-document representation techniques, where we present words filtering methods and their importance, Porter's algorithm and tf-idf term weighting algorithm. We then proceed to apply all previously described algorithms on selected data-sets, which vary in size and compactness. Fallowing this, we categorize documents in different clusters using clustering algorithms and similarity/distance measures. In final chapter we evaluate obtained clusters and analyse results of evaluation. As a conclusion, we hand-pick the best possible combination of described methods for determining text-document similarity.
| Item Type: | Thesis (EngD thesis) |
| Keywords: | clustering, text, documents, analysis, kmeans, kmeans++, kmedoids++, distance measures, similarity, dissimilarity, comparison |
| Number of Pages: | 69 |
| Language of Content: | Slovenian |
| Mentor / Comentors: | | Name and Surname | ID | Function |
|---|
| Zoran Bosnić | 3826 | Mentor |
|
| Link to COBISS: | http://www.cobiss.si/scripts/cobiss?command=search&base=50070&select=(ID=10144340) |
| Institution: | University of Ljubljana |
| Department: | Faculty of Computer and Information Science |
| Item ID: | 2200 |
| Date Deposited: | 21 Sep 2013 16:58 |
| Last Modified: | 30 Sep 2013 10:19 |
| URI: | http://eprints.fri.uni-lj.si/id/eprint/2200 |
|---|
Actions (login required)