ePrints.FRI - University of Ljubljana, Faculty of Computer and Information Science

Assessment of machine learning reliability methods for quantifying the applicability domain of QSAR regression models

Marko Toplak and Rok Mocnik and Matija Polajnar and Zoran Bosnic and Lars Carlsson and C Hasselgren and Janez Demsar and S Boyer and Blaz Zupan and Jonna Stalring (2014) Assessment of machine learning reliability methods for quantifying the applicability domain of QSAR regression models. J Chem Inf Model, 54 (2). pp. 431-441.

[img]
Preview
PDF
Download (1587Kb)

    Abstract

    The vastness of chemical space and the relatively small coverage by experimental data recording molecular properties require us to identify subspaces, or domains, for which we can confidently apply QSAR models. The prediction of QSAR models in these domains is reliable, and potential subsequent investigations of such compounds would find that the predictions closely match the experimental values. Standard approaches in QSAR assume that predictions are more reliable for compounds that are "similar" to those in subspaces with denser experimental data. Here, we report on a study of an alternative set of techniques recently proposed in the machine learning community. These methods quantify prediction confidence through estimation of the prediction error at the point of interest. Our study includes 20 public QSAR data sets with continuous response and assesses the quality of 10 reliability scoring methods by observing their correlation with prediction error. We show that these new alternative approaches can outperform standard reliability scores that rely only on similarity to compounds in the training set. The results also indicate that the quality of reliability scoring methods is sensitive to data set characteristics and to the regression method used in QSAR. We demonstrate that at the cost of increased computational complexity these dependencies can be leveraged by integration of scores from various reliability estimation approaches. The reliability estimation techniques described in this paper have been implemented in an open source add-on package (https://bitbucket.org/biolab/orange-reliability ) to the Orange data mining suite.

    Item Type: Article
    Keywords: reliability; machine learning; QSAR; regression
    Institution: University of Ljubljana
    Department: Faculty of Computer and Information Science
    Divisions: Faculty of Computer and Information Science > Bioinformatics Laboratory
    Item ID: 2420
    Date Deposited: 15 Mar 2014 15:13
    Last Modified: 25 Mar 2014 21:06
    URI: http://eprints.fri.uni-lj.si/id/eprint/2420

    Actions (login required)

    View Item