ePrints.FRI - University of Ljubljana, Faculty of Computer and Information Science

Graph-based models for multi-document summarization

Canhasi Ercan (2014) Graph-based models for multi-document summarization. PhD thesis.

Download (2572Kb)


    This thesis is about automatic document summarization, with experimental results on general, query, update and comparative multi-document summarization (MDS). We describe prior work and our own improvements on some important aspects of a summarization system, including text modeling by means of a graph and sentence selection via archetypal analysis. The centerpiece of this work is a novel method for summarization that we call “Archetypal Analysis Summarization”. Archetypal Analysis (AA) is a promising unsupervised learning tool able to completely assemble the advantages of clustering and the flexibility of matrix factorization. We propose a novel AA based summarization method based on following observations. In generic document summarization, given a graph representation of a set of documents, positively and/or negatively salient sentences are values on the data set boundary. To compute these extreme values, general or weighted archetypes, we choose to use archetypal analysis and weighted archetypal analysis, respectively. While each sentence in a data set is estimated as a mixture of archetypal sentences, the archetypes themselves are restricted to being sparse mixtures, i.e. convex combinations of the original sentences. Since AA in this way readily offers soft clustering and probabilistic ranking, we suggest considering it as a method for simultaneous sentence clustering and ranking. Another important argument in favour of using AA in MDS is that in contrast to other factorization methods which extract prototypical, characteristic, even basic sentences, AA selects distinct (archetypal) sentences, thus induces variability and diversity in produced summaries. Our research contributes by presenting some new modeling approaches based on graph notation which facilitate the text summarization task. We investigate the impact of using the content-graph and multi-element graph model for language- and domain-independent extractive multi-document generic and query focused summarization. We also propose the novel version of AA, the weighted Hierarchical Archetypal Analysis. We consider the use of it for four best-known summarization tasks, including generic, query-focused, update, and comparative summarization. Experiments on summarization data sets (DUC04-07, TAC08) are conducted to demonstrate the efficiency and effectiveness of our framework for all four kinds of the multi-document summarization task.

    Item Type: Thesis (PhD thesis)
    Keywords: : multi-document summarization, archetypal analysis, weighted archetypal analysis, weighted hierarchical archetypal analysis, matrix decomposition, content graph joint model, multi-element graph, query-focused summarization, update summarization, comparative summarization.
    Language of Content: English
    Mentor / Comentors:
    Name and SurnameIDFunction
    prof. dr. Igor Kononenko237Mentor
    Link to COBISS: http://www.cobiss.si/scripts/cobiss?command=search&base=50070&select=(ID=10571092)
    Institution: University of Ljubljana
    Department: Faculty of Computer and Information Science
    Item ID: 2528
    Date Deposited: 17 Apr 2014 16:44
    Last Modified: 15 May 2014 10:50
    URI: http://eprints.fri.uni-lj.si/id/eprint/2528

    Actions (login required)

    View Item