ePrints.FRI - University of Ljubljana, Faculty of Computer and Information Science

Predicting categories of news articles using meta-data from the Web

Žiga Vučko (2015) Predicting categories of news articles using meta-data from the Web. EngD thesis.

Download (1634Kb)


    Text mining, a field of machine learning that deals with the discovery of knowledge from text, is evolving rapidly. This fact has been recognized by the Artificial Intelligence Laboratory of Jožef Stefan Institute, which is developing a system called Event Registry that collects news articles from the Web in real-time, detects events therein and extracts relevant information. The component of the system which deals with the classification of articles into categories has not yet been fully developed. In a response to this, in our diploma thesis, we tried to upgrade a reference model. The results of our work have been positive, since we improved the predictive accuracy of classification of arbitrary news articles into one of the categories of our predefined taxonomy. During the learning phase, we examined the impact of various forms of meta-data on the predictive accuracy of the model, where we focused mainly on meta-data obtained from Never-Ending Language Learner developed at Carnegie Mellon University. We assessed that the latter have a positive effect on the performance of the model if they are used in combination with other meta-data. For the purposes of learning we used different algorithms such as logistic regression, support vector machine, random forests and k-nearest neighbors. It turned out that the first two algorithms are the most appropriate for building the optimal predictive model. At the same time, we also tested several approaches to active learning, by which we can simplify and speed up the process of manual labeling of new articles. All of them have produced a positive result, while approach that combines uncertainty of prediction with correlation between learning instances proved to be the best.

    Item Type: Thesis (EngD thesis)
    Keywords: machine learning, text mining, classification, Event Registry, Never-Ending Language Learner, active learning
    Number of Pages: 82
    Language of Content: Slovenian
    Mentor / Comentors:
    Name and SurnameIDFunction
    doc. dr. Lovro ŠubeljMentor
    Link to COBISS: http://www.cobiss.si/scripts/cobiss?command=search&base=51012&select=(ID=1536529859)
    Institution: University of Ljubljana
    Department: Faculty of Computer and Information Science
    Item ID: 3111
    Date Deposited: 15 Sep 2015 14:03
    Last Modified: 01 Oct 2015 12:27
    URI: http://eprints.fri.uni-lj.si/id/eprint/3111

    Actions (login required)

    View Item