University of Wollongong
Browse

Improving Skeleton-based Action Recognition

thesis
posted on 2025-10-31, 03:48 authored by Shanaka Gunasekara
<p dir="ltr">Skeleton-based human action recognition is a vibrant research area in computer vision; however, existing approaches often struggle to fully capture the complex spatiotemporal dynamics and discriminative features necessary for achieving robust recognition. This thesis introduces four novel methods that enhance action recognition performance in both supervised and self-supervised paradigms.</p><p dir="ltr">First, a temporal pooling algorithm, termed Asynchronous Joint-Based Temporal Pooling (AJTP), is proposed, which selectively aggregates informative cross-joint and crosstime dynamics. Unlike conventional pooling methods that treat all joints and frames uniformly, often leading to the aggregation of less discriminative features, AJTP is a leamable pooling algorithm that adaptively focuses on the most discriminative features for action recognition. It integrates seamlessly with both Graph Convolutional Networks (GCN) and Transformer-based models, significantly improving feature aggregation in supervised settings. Second, Spatial-Temporal Joint Density (STJD) is proposed as a novel metric to overcome the limitations of existing self-supervised approaches, which are often based on the assumption that only moving body joints and parts are informative. STJD dynamically quantifies the interactions between moving and static joints, enabling the effective identification of a subset of discriminative joints. STJD can be used in both contrastive and reconstruction-based unsupervised frameworks to guide the learning of discriminative representations. Third, a Unified Generative and Discriminative Learning (UGDL) framework is introduced. Traditional self-supervised methods tend to yield representations that either capture overly coarse sequence-level features or focus excessively on low-level details. In contrast, UGDL jointly optimizes explicit objective functions that balance representational capacity with discriminative ability, thereby producing more robust representations for downstream tasks. Lastly, DiffMotion is proposed to explicitly model uncertainty in skeletal motion dynamics, addressing a key limitation in current self-supervised learning approaches. By integrating denoising diffusion probabilistic techniques into the reconstruction framework, DiffMotion models the motion distribution of masked regions conditioned on visible joints, leading to robust action recognition.</p><p dir="ltr">Together, these contributions advance the state of the art in skeleton-based human action recognition, both theoretically and practically. They provide significant improvements in both supervised and self-supervised learning frameworks, establishing a new benchmark for future research.</p>

History

Year

2025

Thesis type

  • Doctoral thesis

Faculty/School

School of Computing Information Technology

Language

English

Disclaimer

Unless otherwise indicated, the views expressed in this thesis are those of the author and do not necessarily represent the views of the University of Wollongong.

Usage metrics

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC