Degree Name

Doctor of Philosophy


School of Electrical, Computer, and Telecommunications Engineering


This work deals with audio-visual video recognition using machine learning. A general audio-visual video recognition system first extracts auditory and visual feature descriptors, then represents the extracted bi-modal features using feature encoding techniques, and finally performs recognition using a machine learning classifier. This work adapts a similar pipe-line, contributing to the first two major components: visual feature extraction and global feature representation.

Visual feature extraction is a vital step in video recognition. In general, the visual feature extraction starts by detecting spatio-temporal interest points where the features are most discriminative in a video. There are a few problems associated with existing spatio-temporal interest point detectors. Firstly, the detectors are either too sparse, which leads to loss of information, or too dense, which results in additional noise and complexity. Secondly, in case of dynamic background and moving camera, the detectors may extract irrelevant interest points that do not belong to an actual motion. To address these problems, a spatio-temporal interest point detector is designed to extract salient interest points within a region of interest where there is motion. In addition, a video stabilization is integrated in the detector to handle camera motion and dynamic background.