Publication Details

This article was originally published as: Wheway, V, Using Boosting to Simplify Classification Models, Proceedings 2001 IEEE International Conference on Data Mining, 26 November-2 December 2001, 558-565. Copyright IEEE 2001.


Ensemble classification techniques such as bagging, boosting and arcing algorithms have been shown to lead to reduced classification errors on unseen cases and seem immune to the problem of overfitting. Several explanations for the reduction in generalisation error have been presented, with recent authors defining and applying diagnostics such as "edge" and "margin". These measures provide insight into the behaviour of ensemble classifiers, but can they be exploited further? In this paper, a four-stage classification procedure in introduced, which is based on an extension of edge and margin analysis. This new procedure allows inverse sub-contexts and difficult border regions to be detected using properties of the edge distribution. It is widely known that ensemble classifiers 'balance' the margin as the number of iterations increases. However, by exploiting this balancing property and flagging observations whose edges (and margins) are not 'balanced', data sets can often be partitioned into sub contexts and the classification can be made more robust as confounding within a data set is removed. In the majority of cases, the sub-contexts detected are inverse to each other or, quite possibly, the smaller sub-context contains mis-labelled observations. The majority of classification techniques have not been adapted to detect contexts within a data set, and the generalisation error reported in studies to date is based on the entire data set and can be improved by partitioning the data set in question. The aim of this study is to move towards interpretability, and it is shown that, by training on a sub-set of the original training data, we gain simplicity of models and reduced generalisation error.