Degree Name

Doctor of Philosophy (Integrated)


School of Computing and Information Technology


Human action recognition is one of the most active research topics in computer vision and machine learning. Many methods have been developed in the past decades. However, most of them do not take the semantics of actions into consideration, resulting in the so-called semantic gap.

This thesis aims to bridge the semantic gap and to develop algorithms for semantic action recognition. To this end, a semantic representation of actions, referred to as pose lexicon, is proposed. A pose lexicon is composed of a set of semantic poses that are defined and can be extracted from the textual instructions or descriptions of actions, a set of visual poses defined by visual features, and a probabilistic mapping between the visual and semantic poses. Given a pose lexicon, semantic action recognition is formulated as a problem of finding the most likely sequence of semantic poses given a sequence of visual frames. Such action recognition methods offer several advantages over existing methods such as being capable of zero-shot action recognition, being insensitive to intra-class variation and distribution variation among different datasets, and bridging the semantic description and visual observation.

Specifically, the thesis presents a systematic study on how to learn a pose lexicon from training samples and has developed four learning methods under different assumptions. The first method is a visual pose to semantic pose translation model. In this method, visual poses are constructed through clustering and sequences of visual frames are converted into sequences of visual poses first. The probabilistic mapping between the visual poses and semantic poses extracted from textual instructions are then learned based on a translation model. The second method models the visual poses with a Gaussian mixture model which is learned from training samples to characterize the likelihood of an observed visual frame being generated by a visual pose. A pose lexicon model is subsequently learned through an extended hidden Markov alignment model to encode the correspondence between hidden visual pose sequences and semantic pose sequences. Unlike the first method, this method does not generate visual pose sequences before learning a pose lexicon, making the learning robust against noisy visual features. The third method assumes both visual poses and the correspondence hidden and consists of a two-level hidden Markov model to learn visual poses and the probabilistic mapping between the visual and semantic poses using Expectation-Maximization strategy. The fourth method extends the concept of pose lexicon from the whole body to body parts for body parts-based semantic action recognition. This method explicitly models the synchronization among body parts and learns a set of visual poses and a pose lexicon for each body part simultaneously. The body parts-based lexicons provide a detailed semantic description of actions in both temporal and spatial domain.

All of the proposed methods were evaluated on five skeleton-based action datasets using cross-subject, cross-dataset, zero-shot protocols and the results have demonstrated the efficacy of the methods.

This thesis is unavailable until Friday, July 12, 2019