ePrints.FRI - University of Ljubljana, Faculty of Computer and Information Science

Domain specific adaptation of a statistical machine translation engine in Slovene language

Jože Kadivec (2016) Domain specific adaptation of a statistical machine translation engine in Slovene language. MSc thesis.

[img]
Preview
PDF
Download (3607Kb)

    Abstract

    Machine translation, especially statistical machine translation gained a lot of interest in recent years, mainly thanks to the increase of publicly available multilingual language resources. In terms of obtaining the basic understanding of the target language text, the majority of free machine translation systems give us satisfactory results but are not accurate enough for specific domain texts. For some foreign languages, research shows increases in the quality of the machine translation if trained with the in-domain data. Such research has not yet been conducted for the Slovenian language which presents the motivation for our research. Additional motivation is the nonexistence of a publicly available language model for the Slovenian language. This master thesis focuses on a statistical machine translation system adaptation for a specific domain in the Slovenian language. Various approaches for the adaptation to a specific domain are described. We set up the Moses machine translation system framework and acquire and adapt existing general corpora for the Slovenian language as a basis for building a comparative linguistic model. Annotated and non-annotated Slovenian corpus, ccGigafida, is used to create a linguistic model of the Slovenian language. For the pharmaceutical domain, existing English-Slovenian translations and other linguistic resources have been found and adapted to serve as a learning base for the machine translation system. We evaluate the impact of various linguistic resources on the quality of machine translation for the pharmaceutical domain. The evaluation is conducted automatically using the BLEU metrics. In addition, some test translations are manually evaluated by experts and potential system users. The analysis shows that test translations, translated with the domain model, achieve better results than translations that are generated using the out-of-domain model. Surprisingly, bigger, combined model, does not achieve better results than the smaller domain model. The manual analysis of the resulting fluency and adequacy shows that translations that achieve a high BLEU grade can achieve lower fluency or adequacy grades than the test translations that otherwise achieved a lower BLEU grade. The experiment with the addition of the domain-based dictionary to the in-domain translation model shows a gain of 1 BLEU grade and assures the use of the desired terminology.

    Item Type: Thesis (MSc thesis)
    Keywords: machine translation, statistical machine translation, statistical machine translation system adaptation for specific domain, statistical machine translation system adaptation for pharmaceutical domain, factor model, Moses phrase based model, Cohen’s kappa, Fleiss kappa, agreement of ratters
    Number of Pages: 118
    Language of Content: Slovenian
    Mentor / Comentors:
    Name and SurnameIDFunction
    izr. prof. dr. Marko Robnik Šikonja276Mentor
    Link to COBISS: http://www.cobiss.si/scripts/cobiss?command=search&base=51012&select=(ID=1537309379)
    Institution: University of Ljubljana
    Department: Faculty of Computer and Information Science
    Item ID: 3471
    Date Deposited: 31 Aug 2016 11:13
    Last Modified: 15 Dec 2016 10:26
    URI: http://eprints.fri.uni-lj.si/id/eprint/3471

    Actions (login required)

    View Item