Andrej Jugovic (2016) Abstractive summarization for Slovene language. EngD thesis.
Abstract
The thesis focuses on automatic summarization of Slovene documents. There are large numbers of documents in digital form which we want to summarize in order to make them accessible to humans. This cannot be done manually so we want to automate the process. Our system, uses a parser for Slovene language to find triplets consisting of a subject, predicate (or verb) and object. We build a graph using the words in the triplets and weight the connections. We rank the nodes with P-PR algorithm, which assesses the importance of words in triples. We weight P-PR values of words in the triples with measures TF-IDF, Okapi BM-25, and word frequency. We chose the best triplets and use them to generate summaries. Generated summaries are evaluated with ROUGE-N and ROUGE-S measures. Evaluation is performed on a corpus, built from Wikipedia, and also with manually created summaries. The results show that humans create significantly better summaries. The best computer generated summaries are created when graph connections are weighted with the number of bigram occurrences and P-PR values are weighted with the frequency of word occurrence in triplets.
Item Type: | Thesis (EngD thesis) |
Keywords: | natural language processing, document summarization, personalized PageRank algorithm, ROUGE measure, weighted links, automatic document summarization |
Number of Pages: | 68 |
Language of Content: | Slovenian |
Mentor / Comentors: | Name and Surname | ID | Function |
---|
izr. prof. dr. Marko Robnik Šikonja | 276 | Comentor |
|
Link to COBISS: | http://www.cobiss.si/scripts/cobiss?command=search&base=51012&select=(ID=1537214659) |
Institution: | University of Ljubljana |
Department: | Faculty of Computer and Information Science |
Item ID: | 3569 |
Date Deposited: | 13 Sep 2016 13:54 |
Last Modified: | 18 Oct 2016 10:55 |
URI: | http://eprints.fri.uni-lj.si/id/eprint/3569 |
---|
Actions (login required)