ePrints.FRI - University of Ljubljana, Faculty of Computer and Information Science

Text Mining Using Data Compression Models

Andrej Bratko (2012) Text Mining Using Data Compression Models. PhD thesis.

[img]
Preview
PDF
Download (7Mb)

    Abstract

    The idea of using data compression algorithms for machine learning has been reinvented many times. Intuitively, compact representations of data are possible only if statistical regularities exist in the data. Compression algorithms identify such patterns and build statistical models to describe them. This ability to learn patterns from data makes compression methods instantly attractive for machine learning purposes. In this thesis, we propose several novel text mining applications of data compression algorithms. We introduce a compression-based method for instance selection, capable of extracting a diverse subset of documents that are representative of a larger document collection. The quality of the sample is measured by how well a compression model, trained from the subset, is able to predict held-out reference data. The method is useful for initializing k-means clustering, and as a pool-based active learning strategy for supervised training of text classifiers. When using compression models for classification, we propose that trained models should be sequentially adapted when evaluating the probability of the classified document. We justify this approach in terms of the minimum description length principle, and show that adaptation improves performance for online filtering of email spam. Our research contributes to the state-of-the-art of applied machine learning in two significant application domains. We propose the use of compression models for spam filtering, and show that compression-based filters are superior to traditional tokenization-based filters and competitive with the best known methods for this task. We also consider the use of compression models for lexical stress assignment, a problem in Slovenian speech synthesis, and demonstrate that compression models perform well on this task, while requiring fewer resources than competing methods. The topic of this thesis is text mining. However, most of the proposed methods are more general, and are designed for learning with arbitrary discrete sequences.

    Item Type: Thesis (PhD thesis)
    Keywords: text mining, compression, instance selection, active learning, k-means initialization, spam filtering, lexical stress assignment
    Number of Pages: 196
    Language of Content: English
    Mentor / Comentors:
    Name and SurnameIDFunction
    prof. dr. Blaž Zupan106Mentor
    izr. prof. dr. Bogdan Filipič1116Comentor
    Link to COBISS: http://www.cobiss.si/scripts/cobiss?command=search&base=50070&select=(ID=00009535572)
    Institution: University of Ljubljana
    Department: Faculty of Computer and Information Science
    Item ID: 1925
    Date Deposited: 09 Nov 2012 14:26
    Last Modified: 27 Nov 2012 13:10
    URI: http://eprints.fri.uni-lj.si/id/eprint/1925

    Actions (login required)

    View Item