Degree Name

Doctor of Philosophy


School of Electrical, Computer and Telecommunications Engineering


Video sequence classification is the process of recognizing the semantic labels of the given video sequences. It has a wide range of applications in human action recognition, video genre classification, and abnormal behavior detection. The essential cues for video classification are the spatial structures and their changes along the time axis.

This thesis presents two methods for pre-processing the input video sequence. First, an efficient optical flow estimation method is proposed to estimate pixel-level velocity from two adjacent frames. This method applies max pooling and min pooling to construct a hierarchical feature structure for coarse-to-fine patch matching. Optical flow is then estimated from the matching set via an interpolation algorithm. Second, a saliency map estimation method is proposed to detect salient spatio-temporal regions. Sampling local features from the salient regions reduces noise and redundancy in the feature set.

Then, this thesis focuses on spatio-temporal feature extraction in the trajectory-based method and the ConvNet-based method. In the trajectory-based method, we extend the HOG and SIFT descriptors to create two new spatio-temporal descriptors. Subsequently, we propose a novel feature encoding method to generate video representations. This encoding method uses the co-occurrence matrix of local features to compute the conditional probability distribution of the feature word pairs. Then, the tensor decomposition algorithm is applied on the conditional probability distribution matrix to produce a more compact and discriminative video representation. Experiments are conducted on three datasets: KTH, UCF11, and Hollywood2. The experimental results indicate that this encoding method significantly improves the classification rate compared to BoW.

In the ConvNet-based method, a two-stream cross-ResNet has been proposed. This network consists of two ResNets where each ResNet accepts one stream. The two streams exchange information through two extra residual connections at each convolution stage. The motivation of this ConvNet model is to learn local spatio-temporal features in the convolution layers. In our approach, the LSTM follows the cross-ResNet to extract global spatio-temporal features for video classification. Experiments are conducted on two datasets: UCF101 and HMDB51. The experimental results indicate that the proposed two-stream cross-ResNet model reaches the classification accuracy as the other two-stream ConvNet models. However, the proposed method requires only appearance sequences, thereby yielding an end-to-end network.



Unless otherwise indicated, the views expressed in this thesis are those of the author and do not necessarily represent the views of the University of Wollongong.