Miha Štajdohar (2012) Visualization and analysis of the space of prediction models. PhD thesis.
Abstract
Data mining – a search for interesting patterns in the data – typically creates a number of different models. These models need to be simple enough to be graspable by the human expert. A tree consisting of hundreds of nodes or a scatter plot projecting dozens of variables into a two-dimensional plane may offer great classification accuracy or class separation, yet they may be impossible to interpret. But simple models cannot describe the complex data which may contain many interesting relations. The solution is to create a large number of simple models, where each of them offers insight into a small part of the problem domain, but together they present a complete picture. The problem, however, is the presentation of the big picture. While a computer can infer thousands of models, the human expert is incapable of reviewing them without assistance. We argue that a set of partial views can represent complex data. Views may include linear projections, each involving at most a couple of variables and showing a single, particular, and simplified perspective or relation. Similarly, good predictive models can be build using ensembles of simple models, such as random forests of small trees, each covering a part of the problem domain. These approaches lack techniques for manual exploration. Is it possible to select a limited number of visualizations which will provide a complete picture? Can we navigate through the random forest and observe the common properties of models in each region? We propose a method for creating maps of classification models, which are presented to the user to interactively explore the model space. The proposed technique can, besides organizing predictive models, rank some types of visualizations. We extend existing methods – that rank projections based on quality – to consider projection diversity, and show that our method yields more information when viewing the same number of projections. We describe a model-map-based technique for the visualization and exploration of random forest, a prediction model that is assembled of random prediction trees. Those are normally small (up to 10 vertices), so each covers only a small part of the space of the decision problem. The random forest predicts remarkably well, however it is prohibitively difficult to explain or visualize–in comparison to a single prediction tree. Our method assists an expert with the random forest analysis, for example, in clustering similar trees together to emphasize the diverse ones. In addition to interactive exploration of a random forest, the model map can explain predictions of different classes. The final ingredient of the technique is a versatile interactive tool for exploration of maps and a new network layout technique, particularly suitable for handling the visualization of the model map networks. Most existing tools for network analysis are limited to drawing networks and computing their basic general characteristics. With this tools, it is impossible for the user to interactively and graphically manipulate the networks, select and explore sub-graphs using other statistical and data mining techniques, add and plot various other data within the graph, and so on. We developed tools that address these challenges, widgets and modules for exploration of networks and model maps within the general component-based environment Orange. We propose a network layout optimization algorithm which is designed to visualize fragmented networks: FragViz. Networks of prediction models are usually fragmented. They consist of unconnected components which popular network alignment algorithms place arbitrarily with respect to the rest of the network. This can lead to misinterpretations due to the proximity of otherwise unrelated elements. FragViz incorporates additional information on relations between unconnected network components. It uses a two-step approach by first arranging the nodes within each of the components and then placing the components so that their proximity in the network corresponds to their relatedness. In the experimental study we demonstrate that FragViz can obtain network layouts which are more interpretable and hold additional information that could not be exposed using classical network layout optimization algorithms.
Item Type: | Thesis (PhD thesis) |
Keywords: | information visualization, exploratory data analysis, visual data mining, machine learning, data mining, meta learning, ensemble learning, network layout optimization, network analysis, community detection in graphs |
Number of Pages: | 160 |
Language of Content: | English |
Mentor / Comentors: | Name and Surname | ID | Function |
---|
izr. prof. dr. Janez Demšar | 257 | Mentor |
|
Link to COBISS: | http://www.cobiss.si/scripts/cobiss?command=search&base=50070&select=(ID=00009536852) |
Institution: | University of Ljubljana |
Department: | Faculty of Computer and Information Science |
Item ID: | 1923 |
Date Deposited: | 08 Nov 2012 12:38 |
Last Modified: | 28 Nov 2012 10:05 |
URI: | http://eprints.fri.uni-lj.si/id/eprint/1923 |
---|
Actions (login required)