Domain specific adaptation of a statistical machine translation engine in Slovene language

Jože Kadivec (2016) Domain specific adaptation of a statistical machine translation engine in Slovene language. MSc thesis.

Preview

PDF
Download (3607Kb)

Abstract

Machine translation, especially statistical machine translation gained a lot of interest in recent years, mainly thanks to the increase of publicly available multilingual language resources. In terms of obtaining the basic understanding of the target language text, the majority of free machine translation systems give us satisfactory results but are not accurate enough for specific domain texts. For some foreign languages, research shows increases in the quality of the machine translation if trained with the in-domain data. Such research has not yet been conducted for the Slovenian language which presents the motivation for our research. Additional motivation is the nonexistence of a publicly available language model for the Slovenian language. This master thesis focuses on a statistical machine translation system adaptation for a specific domain in the Slovenian language. Various approaches for the adaptation to a specific domain are described. We set up the Moses machine translation system framework and acquire and adapt existing general corpora for the Slovenian language as a basis for building a comparative linguistic model. Annotated and non-annotated Slovenian corpus, ccGigafida, is used to create a linguistic model of the Slovenian language. For the pharmaceutical domain, existing English-Slovenian translations and other linguistic resources have been found and adapted to serve as a learning base for the machine translation system. We evaluate the impact of various linguistic resources on the quality of machine translation for the pharmaceutical domain. The evaluation is conducted automatically using the BLEU metrics. In addition, some test translations are manually evaluated by experts and potential system users. The analysis shows that test translations, translated with the domain model, achieve better results than translations that are generated using the out-of-domain model. Surprisingly, bigger, combined model, does not achieve better results than the smaller domain model. The manual analysis of the resulting fluency and adequacy shows that translations that achieve a high BLEU grade can achieve lower fluency or adequacy grades than the test translations that otherwise achieved a lower BLEU grade. The experiment with the addition of the domain-based dictionary to the in-domain translation model shows a gain of 1 BLEU grade and assures the use of the desired terminology.

Item Type:

Thesis (MSc thesis)

Keywords:

machine translation, statistical machine translation, statistical machine translation system adaptation for specific domain, statistical machine translation system adaptation for pharmaceutical domain, factor model, Moses phrase based model, Cohen’s kappa, Fleiss kappa, agreement of ratters

Number of Pages:

118

Language of Content:

Slovenian

Mentor / Comentors:

Name and Surname	ID	Function
izr. prof. dr. Marko Robnik Šikonja	276	Mentor

Link to COBISS:

http://www.cobiss.si/scripts/cobiss?command=search&base=51012&select=(ID=1537309379)

Institution:

University of Ljubljana

Department:

Faculty of Computer and Information Science

Item ID:

3471

Date Deposited:

31 Aug 2016 11:13

Last Modified:

15 Dec 2016 10:26

URI:

http://eprints.fri.uni-lj.si/id/eprint/3471

Actions (login required)

View Item