ePrints.FRI - University of Ljubljana, Faculty of Computer and Information Science

Algorithm for the analysis of short sentences in Slovenian language

Robert Jakomin (2012) Algorithm for the analysis of short sentences in Slovenian language. EngD thesis.

[img]
Preview
PDF
Download (698Kb)

    Abstract

    In recent years computing has been successfully applied to problem solving in linguistics. This statement is especially valid in the area of syntax, where computer-processed collections of texts, corpora, allow instant searching and verification of linguistic assumptions. For the users unskilled in computing, complicated user interfaces of text search engines and morpho-syntactically unmarked texts are often a problem. This is particularly acute for synthetic languages such as Slovenian. The aim of the thesis was to develop an efficient algorithm for the analysis of short sentences in Slovenian language (up to three words) using own morphological tagging and to implement the algorithm in a programming language with a graphical user interface for searching the database of syntactically and morphologically marked sentences. The time complexity of the developed morphological tagger is of the order O(n), where n is the number of given word forms of a language. The time complexity of the developed algorithm is of the order k*O(n), where n is the number of words in a sentence and k is the constant number of iterations in the sentence analysis. The morphological tagger is also suitable for other inflective languages, while sentence analyzer only for other Slavic languages, possibly with minor adjustments. The sentence analysis algorithm has certain limitations: it is entirely dependent on the effectiveness of the morphological word tagger, as it is focused on a short sentence it cannot recognize enumerations and coordinate phrases, as it does not delve into the context of the sentence the algorithm cannot exactly distinguish between subject and predicate phrase (it is assumed that the first found noun article in nominative is the subject and the second the verbal phrase), furthermore the algorithm finds only one possibility of the often many that could result from the analysis of each sentence. Moreover, the algorithm cannot, due to its formal nature, effectively distinguish between objects and adverbial phrases. The analysis of longer sentences than three words is already possible at the current stage of development, but has not been tested yet.

    Item Type: Thesis (EngD thesis)
    Keywords: Algorithm, morphological tagging, computer sentence analysis, text corpus
    Number of Pages: 33
    Language of Content: Slovenian
    Mentor / Comentors:
    Name and SurnameIDFunction
    doc. dr. Matija Marolt271Mentor
    strok. raz. sod. dr Domen MarinčičComentor
    Link to COBISS: http://www.cobiss.si/scripts/cobiss?command=search&base=50070&select=(ID=9608532)
    Institution: University of Ljubljana
    Department: Faculty of Computer and Information Science
    Item ID: 1960
    Date Deposited: 19 Dec 2012 15:38
    Last Modified: 22 Jan 2013 12:24
    URI: http://eprints.fri.uni-lj.si/id/eprint/1960

    Actions (login required)

    View Item