Marko Toplak (2016) Induction of prediction models using domain knowledge about related features. PhD thesis.
Abstract
Domain knowledge can help us build more accurate prediction models. Molecular biology is one of the fields where induction of prediction models is relatively hard due to few learning instances in a typical data set, but there exists vast domain knowledge. Basic entities of the field---genes, proteins, and metabolic products---are described and categorized in various freely accessible databases. This thesis focuses on methods that transform data from the space of features into the space of feature groups, which can be assembled from existing data bases and represent prior knowledge. Features in data sets from the field of molecular biology that we used in the thesis represent genes. Methods working with gene groups assume that gene expression profiles belonging to the same group are similar. We show that gene expressions of gene pairs from groups in databases KEGG and BioGRID are more similar than gene expression of random gene pairs, but the differences are small. The differences do not change with the database version. We propose a technique for transformation of data into a space of feature groups with collective matrix factorization, which simultaneously factorizes matrices representing data and feature groups into a product of latent factors with ranks smaller than ranks of original matrices. The models induced from the transformed data can be as accurate as models on the non-transformed data. In contrast to existing approaches, the proposed approach can also use features that are not in predefined groups of features but are similar to features in a group. Transformation techniques that transform data into a space of feature groups require estimation of transformation parameters such as, for example, feature weights. Techniques that use values of the target variable for parameter estimation, produce values for the feature groups that are at least partially fitted to the target variable. The induced models could therefore overestimate the importance of class-overfitted features, which can decrease their accuracy on novel data. We propose a solution that uses stacking. The proposed solution can work with any transformation technique and, for some data sets, boosts accuracy substantially. In the thesis we throughly study transformation of data into predefined feature groups. We show, in the largest study so far, that, on average, models induced from data sets transformed with feature groups do not obtain better prediction accuracies than models induced on non-transformed data sets. As the accuracies on transformed and non-transformed data sets are similar, the transformed data may still be preferred as models on feature groups are easier to interpret.
Actions (login required)