ePrints.FRI - University of Ljubljana, Faculty of Computer and Information Science

TOPIC DETECTION AND TRACKING IN A STREAM OF DOCUMENTS

Blaž Novak (2008) TOPIC DETECTION AND TRACKING IN A STREAM OF DOCUMENTS. EngD thesis.

[img] PDF
Download (6Mb)

    Abstract

    A challenge created by the recent development in information technology is that people are often faced with an overwhelming amount of information available to them, with blogs presenting the latest and most abundant source of such information. In this thesis, I approach the problem from a standpoint of organizing the newly created information into sensible groups. The first part of the thesis is an overview of the state of the art in the areas relevant to the problem and an analysis of shortcomings of different methods. The main contribution is the development of a new algorithm that pieces together various ideas presented in the first part. It is an online hierarchical clustering algorithm that is capable of incremental model updates that support the addition and also the removal of documents. The structure of the model is adapted after each step to better reflect the structure of the currently observed world. The model can also be optimized while waiting for new events. Some experiments to test the properties of the new algorithm were performed using simulated data streams created from the Reuters Corpus Volume 1 dataset. I have found that the basic assumptions about time complexity and the ability to adapt the model are correct and that the algorithm performs surprisingly well for a range of different inputs.

    Item Type: Thesis (EngD thesis)
    Keywords: clustering, stream mining
    Number of Pages: 63
    Language of Content: Slovenian
    Mentor / Comentors:
    Name and SurnameIDFunction
    prof. dr. Ivan Bratko77Mentor
    Dunja MladenićComentor
    Link to COBISS: http://www.cobiss.si/scripts/cobiss?command=search&base=50070&select=(ID=6723668)
    Institution: University of Ljubljana
    Department: Faculty of Computer and Information Science
    Item ID: 302
    Date Deposited: 29 Oct 2008 10:41
    Last Modified: 13 Aug 2011 00:33
    URI: http://eprints.fri.uni-lj.si/id/eprint/302

    Actions (login required)

    View Item