ePrints.FRI - University of Ljubljana, Faculty of Computer and Information Science

Matjaž Juršič (2007) . Prešeren awards for students.

[img]
Preview
PDF
Download (1887Kb)

    Abstract

    Lemmatization is the process of determining teh canonical form of a word, called lemma, from its inflectional variants. We have developed a language independent system, LemmaGen, consisting of a set of tools for automatically learning of lemmatizers from lexicons of pre-lemmatized words. The system consists of three modules that can be used independently or sequentially. The input to the first module is a lexicon of lemmatized words from which it learns Ripple Down Rules that best describe word lemmatization. The next module takes these rules, which are in the form of RDR trees, and produces an efficient structure for fast lemmatizatio - the actual lemmatizer. In the last step we use the lemmatizer to transform the original input text into a set of lemmatized words. LemmaGen was applied to 14 different Multext and Multext-East lexicons and produced efficient lemmatizers for the corresponding languages. Its evaluation on the 14 lexicins shows that LemmaGen considerably outperforms the lemmatizers generated by the previously developed RDR leraning algorithm, both in terms of accuracy and efficiency. We used lemmatization also as a step in the analysisof a corpus of press-agency news and show improved result inerpretation, achieved by using LemmaGen in news preprocessing.

    Item Type: Thesis (Prešeren awards for students)
    Keywords: Lemmatization, RDR, Text preprocessing, Text mining, Knowledge discovery
    Number of Pages: 83
    Language of Content: Slovenian
    Mentor / Comentors:
    Name and SurnameIDFunction
    prof. dr. Blaž Zupan106Mentor
    prof. dr. Nada LavračComentor
    Link to COBISS: http://www.cobiss.si/scripts/cobiss?command=search&base=51012&select=(ID=6234964)
    Institution: University of Ljubljana
    Department: Faculty of Computer and Information Science
    Item ID: 3711
    Date Deposited: 03 Jan 2017 12:16
    Last Modified: 10 Feb 2017 09:24
    URI: http://eprints.fri.uni-lj.si/id/eprint/3711

    Actions (login required)

    View Item