Trear: Transformer-Based RGB-D Egocentric Action Recognition

Publication Name

IEEE Transactions on Cognitive and Developmental Systems

Abstract

In this article, we propose a transformer-based RGB-D egocentric action recognition framework, called Trear. It consists of two modules: 1) interframe attention encoder and 2) mutual-attentional fusion block. Instead of using optical flow or recurrent units, we adopt a self-attention mechanism to model the temporal structure of the data from different modalities. Input frames are cropped randomly to mitigate the effect of the data redundancy. Features from each modality are interacted through the proposed fusion block and combined through a simple yet effective fusion operation to produce a joint RGB-D representation. Empirical experiments on two large egocentric RGB-D data sets: 1) THU-READ and 2) first-person hand action, and one small data set, wearable computer vision systems, have shown that the proposed method outperforms the state-of-the-art results by a large margin.

Open Access Status

This publication may be available as open access

Volume

14

Issue

1

First Page

246

Last Page

252

Share

COinS
 

Link to publisher version (DOI)

http://dx.doi.org/10.1109/TCDS.2020.3048883