Learning by Fusing Heterogeneous Data

Marinka Žitnik (2015) Learning by Fusing Heterogeneous Data. PhD thesis.

Preview

Abstract

It has become increasingly common in science and technology to gather data about systems at different levels of granularity or from different perspectives. This often gives rise to data that are represented in totally different input spaces. A basic premise behind the study of learning from heterogeneous data is that in many such cases, there exists some correspondence among certain input dimensions of different input spaces. In our work we found that a key bottleneck that prevents us from better understanding and truly fusing heterogeneous data at large scales is identifying the kind of knowledge that can be transferred between related data views, entities and tasks. We develop interesting and accurate data fusion methods for predictive modeling, which reduce or entirely eliminate some of the basic feature engineering steps that were needed in the past when inferring prediction models from disparate data. In addition, our work has a wide range of applications of which we focus on those from molecular and systems biology: it can help us predict gene functions, forecast pharmacological actions of small chemicals, prioritize genes for further studies, mine disease associations, detect drug toxicity and regress cancer patient survival data. Another important aspect of our research is the study of latent factor models. We aim to design latent models with factorized parameters that simultaneously tackle multiple types of data heterogeneity, where data diversity spans across heterogeneous input spaces, multiple types of features, and a variety of related prediction tasks. Our algorithms are capable of retaining the relational structure of a data system during model inference, which turns out to be vital for good performance of data fusion in certain applications. Our recent work included the study of network inference from many potentially nonidentical data distributions and its application to cancer genomic data. We also model the epistasis, an important concept from genetics, and propose algorithms to efficiently find the ordering of genes in cellular pathways. A central topic of our Thesis is also the analysis of large data compendia as predictions about certain phenomena, such as associations between diseases and involvement of genes in a certain phenotype, are only possible when dealing with lots of data. Among others, we analyze 30 heterogeneous data sets to assess drug toxicity and over 40 human gene association data collections, the largest number of data sets considered by a collective latent factor model up to date. We also make interesting observations about deciding which data should be considered for fusion and develop a generic approach that can estimate the sensitivities between different data sets.

Item Type:

Thesis (PhD thesis)

Keywords:

gene function prediction, gene prioritization, network inference, collective matrix factorization, matrix completion, factor models, data fusion, bioinformatics, statistical relational learning, machine learning

Number of Pages:

337

Language of Content:

English

Mentor / Comentors:

Name and Surname	ID	Function
prof. dr. Blaž Zupan	106	Mentor

Link to COBISS:

http://www.cobiss.si/scripts/cobiss?command=search&base=50070&select=(ID=)

Institution:

University of Ljubljana

Department:

Faculty of Computer and Information Science

Item ID:

3202

Date Deposited:

16 Oct 2015 14:51

Last Modified:

23 Oct 2015 00:08

URI:

http://eprints.fri.uni-lj.si/id/eprint/3202

Actions (login required)

View Item