ePrints.FRI - University of Ljubljana, Faculty of Computer and Information Science

TEXT MININIG TOOLS FOR SLOVENE LANGUAGE

Maks Horvat (2013) TEXT MININIG TOOLS FOR SLOVENE LANGUAGE. EngD thesis.

[img]
Preview
PDF
Download (842Kb)

    Abstract

    We introduce the use of various tools for Slovenian language processing and adapt them for NLTK library. To automatically determine the part of speech tags we use algorithms from the NLTK library. From Gigafida corpus we build several taggers: n-gram, Brill, naive Bayes, maximum entropy and hidden Markov model. We measure the accuracy of part of speech tags and time complexity of the taggers. We also incorporated Obeliks program for lemmatization and part of speech tags assignment. For text parsing and identification of named entities we use dependencyParser and SLNER tools. We develop and test a module for information retrieval. We use inverted index, search with boolean operators, vector representation of documents and cosine similarity.

    Item Type: Thesis (EngD thesis)
    Keywords: text mining, natural language processing, Slovenian language, language tools
    Number of Pages: 56
    Language of Content: Slovenian
    Mentor / Comentors:
    Name and SurnameIDFunction
    izr. prof. dr. Marko Robnik Šikonja276Mentor
    Link to COBISS: http://www.cobiss.si/scripts/cobiss?command=search&base=50070&select=(ID=9903956)
    Institution: University of Ljubljana
    Department: Faculty of Computer and Information Science
    Item ID: 2040
    Date Deposited: 17 May 2013 09:27
    Last Modified: 11 Jun 2013 11:11
    URI: http://eprints.fri.uni-lj.si/id/eprint/2040

    Actions (login required)

    View Item