Doctor of Philosophy
School of Electrical, Computer and Telecommunications Engineering
Le, Giang Hoang Nguyen, Machine learning with informative samples for large and imbalanced datasets, Doctor of Philosophy thesis, School of Electrical, Computer and Telecommunications Engineering, University of Wollongong, 2011. http://ro.uow.edu.au/theses/3385
With the emergence of computing and rapid progress in information technologies, new challenges are posed for machine learning techniques to learn from observed data. Supervised learning architectures, such as neural networks and support vector machines, are not very efficient when learning large-scale problems due to the computational complexity and memory storage requirements. Furthermore, data with imbalanced class distributions have a significant impact on the performance of most standard learning algorithms. These challenges are addressed in this dissertation through the development of efficient learning algorithms for pattern classification.
In this thesis, we proposed a new efficient machine learning approach that combines supervised and unsupervised learning, to address the problems of learning from large-scale and imbalanced datasets. The proposed learning approach involves two major stages. In the first stage, we propose a distributed clustering algorithm which is used for clustering large datasets. The training samples from each class are clustered separately, and each cluster is represented by its centroid and a weight. In the second stage, the weighted cluster centroids are used as training samples in supervised learning. The novelty of the proposed learning approach is that it employs a reduced but informative set of training samples, where the original training samples are replaced with weighted cluster centroids. A theoretical framework is also derived, which establishes the link between the proposed learning approach and the minimization of the expected risk functional.
Several training algorithms, based on the proposed approach, are developed for different learning architectures: multilayer perceptron neural networks, convolutional neural networks, and support vector machines. The new algorithms are applied to several benchmark datasets, and their performances are analyzed and compared with standard learning algorithms. Experimental results show that the developed learning algorithms can not only learn large datasets more efficiently but also help optimal decision making when dealing with imbalanced datasets.