Jurij Leskovec (2004) . Prešeren awards for students.
Full text not available from this repository.
Abstract
Automatic documet summarization refers to the task of creating document surrogates that are smaller in sizu but reatin various characterstics of the original document, depending on the intended use. Good solution to this problem would have a great impact on today's society which is overloaded with information. While abstracts created by trained professionals involve rewriting of text, automatic summarization of documents has been focuesed on extracting sentences or keywords from text so that the overall summarysatisfies various criteria: optimal reducation of text for use in text indexing, coverage of document themes, and similar. As document summaries, in form of abstracts, have been generated by humans, it seems most natural to try to model human abstracting. However, evaluation of such model is difficult because of the text generation aspects. We apply machine learning algorithms to capture characteristics of human extracted summary sentences. In contrast to related studies, which typically rely on a minimal understanding of the semantic structure of documents, we start with deep syntatic analysis of the text. The we perform named entity consolidation and pronomial anaphora resolution. We extract elementary syntatic structures from indvidual sentences in the form of logical form triples, i.e., subject- predicate- object triples, and use semantic properties of nodes in the triples to build semantic graphs for both documents and corresponding summaries. We expect that extracted summaries would capture essential semantic relations within the document and thus their structures could be found within the document semantic graphs. We reduce the problem of summarization to acquiring machine learning models for mapping between the document graph and the graph of a summary. This means we learn models for extracting sub-structures from document semantic graphs which are chracteristic of human selected summaries. We use logical form triples as basic features and apply Support Vector Machines to learn the summarization model. For machine learning part we use tehniques similar to those from (4,5). Using those tehniques we won KDD Cup 2003, a data mining competition held at the ACM SIGKDD conference (Knowledge Discovery and Data Mining). Our approach proved to be successful. The context of the document , captured by the semantic graph, helped identifying the key concepts and relations for summarization. Using semantic graph attributes improved the performance of the learned model. Visualizations of semantic graphs can be directly used as maps of documents for intelligent documnet browsing. Our future work will involve explorations of alternative semantic structures on additional data sets, including human generated abstracts. We will also explore how to better evaluate the quality of our summaries.
Item Type: | Thesis (Prešeren awards for students) | ||||||
---|---|---|---|---|---|---|---|
Keywords: | |||||||
Number of Pages: | 88 | ||||||
Language of Content: | Slovenian | ||||||
Mentor / Comentors: |
| ||||||
Link to COBISS: | http://www.cobiss.si/scripts/cobiss?command=search&base=51012&select=(ID=4488532) | ||||||
Institution: | University of Ljubljana | ||||||
Department: | Faculty of Computer and Information Science | ||||||
Item ID: | 3720 | ||||||
Date Deposited: | 05 Jan 2017 14:07 | ||||||
Last Modified: | 13 Feb 2017 09:04 | ||||||
URI: | http://eprints.fri.uni-lj.si/id/eprint/3720 |
Actions (login required)
View Item |