ePrints.FRI - University of Ljubljana, Faculty of Computer and Information Science

Machine learning on large class hierarchies by transformation into multiple binary problems

Janez Brank (2012) Machine learning on large class hierarchies by transformation into multiple binary problems. PhD thesis.

Download (1392Kb)


    An ontology is a shared conceptualization of some problem domain, usually consisting of concepts, instances, an is-a hierarchy between concepts, and possibly other relations and attributes. This thesis deals with several problems on ontologies from a machine-learning perspective, in which a simple ontology can be seen as a hierarchy of classes: ontology population (classification), ontology evaluation, ontology evolution (predicting structural change) and extraction of ontological data from a corpus of documents. We particularly focus on the problem of classification into a hierarchy of classes, which can be seen as one way to populate an ontology. One approach to deal with multi-class problems such as this one is to convert them into several binary (two-class) problems and use a voting scheme to combine the predictions of the resulting ensemble of binary classifiers. The relationship between the classes of the original multi-class problem and the new binary problems can be concisely described by a coding matrix. A particularly interesting question is whether good classification performance can be achieved with a small number of binary classifiers. The number of different coding matrices (and thus of the ensembles defined by them) is exponentially large in the number of classes of the original problem. Although this space of coding matrices is intractably large, a substantial amount of it can be explored if the number of classes in the original problem is small. We present extensive experiments on one such small dataset and investigate the distribution of classification performance scores as a function of the number of binary classifiers in the ensemble. We demonstrate that good classification performance can be achieved with a small ensemble, but such an ensemble might be hard to find; on the other hand, allowing a larger ensemble makes it easier to achieve good performance but does not lead to further increases in the maximal performance (over all ensembles of that size). We also investigate the well-known claim that high row and column separation (average Hamming distance between rows/columns of the matrix) are important properties of the coding matrix, and show that while matrices with high separation do indeed tend to perform well, maximizing row/separation does not lead to the best-performing matrices. We present a greedy algorithm that constructs the coding matrix one column at a time, based on the idea that the binary classifier defined by the new column should focus on separating those pairs of classes which are most frequently confused by the existing ensemble of classifiers. An empirical evaluation shows that this algorithm allows us to achieve comparable performance with a smaller number of classifiers, compared to a baseline random-matrix approach. We also present an analysis which demonstrates that the impact of adding a single new classifier to the ensemble is necessarily quite limited, even if weighted voting is taken into account. We also deal with the topic of ontology evolution, in particular of predicting structural changes in an ontology. We studied the evolution of the Open Directory Project (ODP) ontology over several years, identified several common types of structural changes, and developed a heuristic approach to recognize them and quantify their frequency. The most common structural change turned out to be the addition of new categories, and we present a machine-learning approach to predicting where a new subcategory might be added by taking a few documents from an existing category. Ontology evaluation consists of various approaches and techniques for evaluating and comparing ontologies. We present a survey of such techniques and classify them depending on which level or aspect of ontologies they focus on, as well as depending on their general approach (gold-standard based, application based, data-driven, and manual evaluation). We introduce an ontology evaluation measure for scenarios where the ontology is a hierarchy of classes and is to be compared to a “gold standard” ontology built over the same set of instances. We investigate how this measure responds to various kinds of structural changes in the ontology. We also discuss another approach to ontology population, aimed at “general knowledge” ontologies rather than document hierarchies. In this approach the goal is to extract useful triples of the form concept1, relation, concept2 from a corpus of natural-language documents. In addition to triples that directly occur in the corpus, we also consider more abstract triples that can be obtained by replacing one or more components by a hypernym. We developed an efficient algorithm that can process a large corpus of documents and extract triples that satisfy a minimum-support threshold at any level of abstraction. However, a high support by itself is not a sufficient condition for a triple to be interesting for inclusion in the ontology, since many triples with a high support are irrelevant or too abstract to be interesting. We show several heuristics that can be used to identify interesting triples by comparing their support to that of their ancestors or neighbors in the triple space. We evaluated these heuristics experimentally by comparing their results to manually assigned relevance labels.

    Item Type: Thesis (PhD thesis)
    Keywords: machine learning, classification, coding matrices, ontologies, ontology evolution, ontology evaluation, information extraction, text mining
    Number of Pages: 143
    Language of Content: Slovenian
    Mentor / Comentors:
    Name and SurnameIDFunction
    akad. prof. dr. Ivan Bratko77Mentor
    prof. dr. Dunja MladenićComentor
    Link to COBISS: http://www.cobiss.si/scripts/cobiss?command=search&base=50070&select=(ID=00009388628)
    Institution: University of Ljubljana
    Department: Faculty of Computer and Information Science
    Item ID: 1785
    Date Deposited: 03 Sep 2012 17:01
    Last Modified: 09 Nov 2012 11:04
    URI: http://eprints.fri.uni-lj.si/id/eprint/1785

    Actions (login required)

    View Item