# Machine Learning Based on Attribute Interactions

Aleks Jakulin (2005) Machine Learning Based on Attribute Interactions. PhD thesis.

 Preview
PDF
Two attributes $A$ and $B$ are said to interact when it helps to observe the attribute values of both attributes together. This is an example of a $2$-way interaction. In general, a group of attributes ${\cal X}$ is involved in a $k$-way interaction when we cannot reconstruct their relationship merely with $\ell$-way interactions, $\ell < k$. These two definitions formalize the notion of an interaction in a nutshell. An additional notion is the one of context. We interpret context as just another attribute. There are two ways in which we can consider context. Context can be something that specifies our focus: we may examine interactions only in a given context, only for the instances that are in the context. Alternatively, context can be something that we are interested in: if we seek to predict weather, only the interactions involving the weather will be interesting to us. This is especially relevant for classification: we only want to examine the interactions involving the labelled class attribute and other attributes (unless there are missing or uncertain attribute values). But the definitions are not complete. We need to specify the model that assumes the interaction: how to we represent the pattern of co-appearance of several attributes? We also need to specify a model that does not assume the interaction: how do we reconstruct the pattern of co-appearance of several attributes without actually observing them all simultaneously? We need to specify a loss function that measures how good a particular model is, with respect to another model or with respect to the data. We need an algorithm that builds both models from the data. Finally, we need the data in order to assess whether it supports the hypothesis of interaction. The present work shows that mutual information, information gain, correlation, attribute importance, association and many other concepts, are all merely special cases of the above principle. Furthermore, the analysis of interactions generalizes the notions of analysis of variance, variable clustering, structure learning of Bayesian networks, and several other problems. There is an intriguing history of reinvention in the area of information theory on the topic of interactions. In our work, we focus on models founded on probability theory, and employ entropy and Kullback-Leibler divergence as our loss functions. Generally, whether an interaction exists or not, and to what extent, depends on what kind of models we are working with. The concept of McGill's interaction information in information theory, for example, is based upon Kullback-Leibler divergence as the loss function, and non-normalized Kirkwood superposition approximation models. Pearson's correlation coefficient is based on the proportion of explained standard deviation as the loss function, and on the multivariate Gaussian model. Most applications of mutual information are based on Kullback-Leibler divergence and the multinomial model. When there is a limited amount of data, it becomes unclear what model can be used to interpret it. Even if we fix the family of models, we remain uncertain about what would be the best choice of a model in the family. In all, uncertainty pervades the choice of the model. The underlying idea of Bayesian statistics is that the uncertainty about the model is to be handled in the same was as the uncertainty about the correct prediction in nondeterministic domains. The uncertainty, however, implies that we know neither if is an interaction with complete certainty, nor how important is the interaction. We propose a Bayesian approach to performing significance tests: an interaction is significant if it is very unlikely that a model assuming the interaction would suffer a greater loss than a model not assuming it, even if the interaction truly exists, among all the foreseeable posterior models. We also propose Bayesian confidence intervals to assess the probability distribution of the expected loss of assuming that an interaction does not exist. We compare significance tests based on permutations, bootstrapping, cross-validation, Bayesian statistics and asymptotic theory, and find that they often disagree. It is important, therefore, to understand the assumptions that underlie the tests. Interactions are a natural way of understanding the regularities in the data. We propose interaction analysis, a methodology for analyzing the data. It has a long history, but our novel contribution is a series of diagrams that illustrate the discovered interactions in data. The diagrams include information graphs, interaction graphs and dendrograms. We use interactions to identify concept drift and ignorability of missing data. We use interactions to cluster attribute values and build taxonomies automatically. When we say that there is an interaction, we still need to explain what it looks like. Generally, the interaction can be explained by inferring a higher-order construction. For that purpose, we provide visualizations for several models that allow for interactions. We also provide a probabilistic account of rule inference: a rule can be interpreted as a constructed attribute. We also describe interactions involving individual attribute values with other attributes: this can help us break complex attributes down into simpler components. We also provide an approach to handling the curse of dimensionality: we dynamically maintain a structure of attributes as individual attributes are entering our model one by one. We conclude this work by presenting two practical algorithms: an efficient heuristic for selecting attributes within the naive Bayesian classifier, and a complete approach to prediction with interaction models, the Kikuchi-Bayes model. Kikuchi-Bayes combines Bayesian model averaging, a parsimonious prior, and search for interactions that determine the model. Kikuchi-Bayes outperforms most popular machine learning methods, such as classification trees, logistic regression, the naive Bayesian classifier, and sometimes even the support vector machines. However, Kikuchi-Bayes models are highly interpretable and can be easily visualized as interaction graphs.