Abstractive summarization for Slovene language

Andrej Jugovic (2016) Abstractive summarization for Slovene language. EngD thesis.

Preview

Abstract

The thesis focuses on automatic summarization of Slovene documents. There are large numbers of documents in digital form which we want to summarize in order to make them accessible to humans. This cannot be done manually so we want to automate the process. Our system, uses a parser for Slovene language to find triplets consisting of a subject, predicate (or verb) and object. We build a graph using the words in the triplets and weight the connections. We rank the nodes with P-PR algorithm, which assesses the importance of words in triples. We weight P-PR values of words in the triples with measures TF-IDF, Okapi BM-25, and word frequency. We chose the best triplets and use them to generate summaries. Generated summaries are evaluated with ROUGE-N and ROUGE-S measures. Evaluation is performed on a corpus, built from Wikipedia, and also with manually created summaries. The results show that humans create significantly better summaries. The best computer generated summaries are created when graph connections are weighted with the number of bigram occurrences and P-PR values are weighted with the frequency of word occurrence in triplets.

Item Type:

Thesis (EngD thesis)

Keywords:

natural language processing, document summarization, personalized PageRank algorithm, ROUGE measure, weighted links, automatic document summarization

Number of Pages:

Language of Content:

Slovenian

Mentor / Comentors:

Name and Surname	ID	Function
izr. prof. dr. Marko Robnik Šikonja	276	Comentor

Link to COBISS:

http://www.cobiss.si/scripts/cobiss?command=search&base=51012&select=(ID=1537214659)

Institution:

University of Ljubljana

Department:

Faculty of Computer and Information Science

Item ID:

3569

Date Deposited:

13 Sep 2016 13:54

Last Modified:

18 Oct 2016 10:55

URI:

http://eprints.fri.uni-lj.si/id/eprint/3569

Actions (login required)

View Item