ePrints.FRI - University of Ljubljana, Faculty of Computer and Information Science

CLUSTERING-BASED ANALYSIS OF TEXT SIMILARITY

Borja Bovcon (2013) CLUSTERING-BASED ANALYSIS OF TEXT SIMILARITY. EngD thesis.

[img]
Preview
PDF
Download (2281Kb)

    Abstract

    The focus of this thesis is comparison of analysis of text-document similarity using clustering algorithms. We begin by defining main problem and then, we proceed to describe the two most used text-document representation techniques, where we present words filtering methods and their importance, Porter's algorithm and tf-idf term weighting algorithm. We then proceed to apply all previously described algorithms on selected data-sets, which vary in size and compactness. Fallowing this, we categorize documents in different clusters using clustering algorithms and similarity/distance measures. In final chapter we evaluate obtained clusters and analyse results of evaluation. As a conclusion, we hand-pick the best possible combination of described methods for determining text-document similarity.

    Item Type: Thesis (EngD thesis)
    Keywords: clustering, text, documents, analysis, kmeans, kmeans++, kmedoids++, distance measures, similarity, dissimilarity, comparison
    Number of Pages: 69
    Language of Content: Slovenian
    Mentor / Comentors:
    Name and SurnameIDFunction
    Zoran Bosnić3826Mentor
    Link to COBISS: http://www.cobiss.si/scripts/cobiss?command=search&base=50070&select=(ID=10144340)
    Institution: University of Ljubljana
    Department: Faculty of Computer and Information Science
    Item ID: 2200
    Date Deposited: 21 Sep 2013 16:58
    Last Modified: 30 Sep 2013 10:19
    URI: http://eprints.fri.uni-lj.si/id/eprint/2200

    Actions (login required)

    View Item