Branko Kavšek (2004) Using rule learning for subgroup discovery. PhD thesis.
This dissertation investigates how to adapt standard classification rule learning approaches to subgroup discovery. The goal of subgroup discovery is to find rules describing subsets of a selected population that are sufficiently large and statistically unusual in terms of class distribution. The dissertation presents a subgroup discovery algorithm, CN2-SD, developed by modifying parts of the CN2 classification rule learner: its covering algorithm, search heuristic, probabilistic classification of instances, and evaluation measures. Experimental evaluation of CN2-SD on selected data sets shows substantial reduction of the number of induced rules, increased rule coverage, rule significance and overall coverage of the target concept as well as slight improvements in terms of the area under ROC curve, when compared with rule learning algorithms CN2 and RIPPER. An application of CN2-SD to a large traffic accident data set confirms these findings. This dissertation presents also the subgroup discovery algorithm APRIORI-SD, developed by adapting association rule learning to subgroup discovery. This was achieved by building a classification rule learner APRIORI-C, enhanced with a novel post–processing mechanism, a new quality measure for induced rules (weighted relative accuracy) and using probabilistic classification of instances. Experimental results a similar behavior of APRIORI-SD and the subgroup discovery algorithm CN2-SD i.e. substantial reduction of the number of induced rules, increased rule coverage, rule significance and overall coverage of the target concept as well as slight improvements in terms of the area under ROC curve, when compared with rule learning algorithms CN2, RIPPER and APRIORI-C. A new optimization approach to subgroup discovery based on ROC analysis is also presented and implemented as an adaptation of the APRIORI-SD algorithm. The implications of the “number-of-rules–unusualness–coverage” trade off to subgroup discovery are investigated through an experimental evaluation of the adapted APRIORI-SD algorithm on selected data sets. The results are presented in the form of 2D graphs depicting the dependencies between the number of induced rules, unusualness, accuracy and overall coverage of the target concept and the original APRIORI-SD subgroup discovery algorithm is discussed in this new optimization framework. Finally, the dissertation presents the comparison of the new algorithms with existing state–of–the–art subgroup discovery algorithms and the application of CN2-SD and APRIORI-SD to a real–life problem – the traffic accident database – a database describing traffic accidents in Great Britain.
Actions (login required)