Year

2001

Degree Name

Doctor of Philosophy

Department

School of Mathematics and Applied Statistics; School of Information Technology and Computer Science

Abstract

Noisy data is inherent in many real-life and industrial modelling situations. If prior knowledge of such data were available, it would be a simple process to remove or account for noise and improve model robustness. Unfortunately, in the majority of learning situations, the presence of underlying noise is suspected but difficult to detect.

Ensemble classification techniques such as bagging, [4] ^boosting [20] and arcing algorithms [5] have received much attention in the recent literature. Such techniques have been shown to lead to reduced classification error on unseen cases, and this Thesis demonstrates that they may also be employed as noise detectors. Recently defined diagnostics such as edge and margin [6, 20, 41] have been used to explain the improvements made in generalisation error when ensemble classifiers are built. The distributions of these measures are key in the noise detection process introduced in this research.

This Thesis presents some empirical and theoretical results on edge distributions that confirm existing theories on boosting's tendency to 'balance' error rates. The results are then extended to introduce a methodology whereby boosting may be used to identify noise in training data by examining the changes in edge margin distributions as boosting proceeds.

Further enrichment can be made by detecting clusters of observations behaving differently to the main 'core' of data, as opposed to detecting single, unique noisy observations. These clusters may form boundaries, and in extreme cases form inverse models, which undetected lead to overestimated generalisation errors. Using edge distributions to perform this task is trivial in comparison to the significant effort required to undertake this task in large multi-dimensional datasets using visual methods or sorting algorithms. Partitioning datasets according to clusters leads to significant improvements on generalisation error for each partition, along with the benefit of simpler classifiers on each partition.

However, this process requires a new technique for estimating generalisation error as classifiers are trained on subsets of the original dataset and generalisation error is not representative. A classifier trained and tested using a biased subset of the data will be overly optimistic and give no indication of the proportion of data for which this error estimate applies.

This notion leads to the key result of an analyst being able to partition generalisation error into components pertaining to incoming data noise and model noise. Depending on the magnitude of each component, analysts can directly concentrate on the process step having the majority contribution to the generalisation error.

Share

COinS
 

Unless otherwise indicated, the views expressed in this thesis are those of the author and do not necessarily represent the views of the University of Wollongong.