ePrints.FRI - University of Ljubljana, Faculty of Computer and Information Science

Statistical analysis of Slovene language corpuses

Aleksander Ključevšek (2016) Statistical analysis of Slovene language corpuses. EngD thesis.

Download (561Kb)


    Natural language processing is an important area of computational linguistics and artificial intelligence . Mostly, its existing applications are developed for and based on English texts. We developed an application for the statistical analysis of large text corpora, which takes into account the unique characteristics of Slovene as a strongly inflected language. Since modern text corpora consist of several billion words, we paid special attention to efficient parallel algorithms that are capable of processing these collections in a relatively short amount of time. We analyzed the Gigafida corpus - consisting of 1.2 billion words - on multiple levels: string level, word level, n-gram level, prefix and suffix level, as well as word formation processes of Slovene.

    Item Type: Thesis (EngD thesis)
    Keywords: statistical language analysis, text corpus, Gigafida, parallel algorithms
    Number of Pages: 45
    Language of Content: Slovenian
    Mentor / Comentors:
    Name and SurnameIDFunction
    izr. prof. dr. Marko Robnik Šikonja276Mentor
    doc. dr. Simon KrekComentor
    Link to COBISS: http://www.cobiss.si/scripts/cobiss?command=search&base=51012&select=(ID=1537132483)
    Institution: University of Ljubljana
    Department: Faculty of Computer and Information Science
    Item ID: 3570
    Date Deposited: 13 Sep 2016 14:01
    Last Modified: 22 Sep 2016 11:20
    URI: http://eprints.fri.uni-lj.si/id/eprint/3570

    Actions (login required)

    View Item