ExACT: Language-guided Conceptual Reasoning and Uncertainty Estimation for Event-based Action Recognition and More
CVPR 2024 Highlight

Abstract

Event cameras have recently been shown beneficial for practical vision tasks, such as action recognition, thanks to their high temporal resolution, power efficiency, and reduced privacy concerns. However, current research is hindered by 1) the difficulty in processing events because of their prolonged duration and dynamic actions with complex and ambiguous semantics and 2) the redundant action depiction of the event frame representation with fixed stacks. We find language naturally conveys abundant semantic information, rendering it stunningly superior in reducing semantic uncertainty. In light of this, we propose ExACT, a novel approach that, for the first time, tackles event-based action recognition from a cross-modal conceptualizing perspective. Our ExACT brings two technical contributions. Firstly, we propose an adaptive fine-grained event (AFE) representation to adaptively filter out the repeated events for the stationary objects while preserving dynamic ones. This subtly enhances the performance of ExACT without extra computational cost. Then, we propose a conceptual reasoning-based uncertainty estimation module, which simulates the recognition process to enrich the semantic representation. In particular, conceptual reasoning builds the temporal relation based on the action semantics, and uncertainty estimation tackles the semantic uncertainty of actions based on the distributional representation. Experiments show that our ExACT achieves superior recognition accuracy of 94.83%(+2.23%), 90.10%(+37.47%) and 67.24% on PAF, HARDVS and our SeAct datasets respectively.

Quick Review

The overall framework of our ExACT

Our ExACT consists of four components: Firstly, the AFE representation recursively eliminates repeated events and generates event frames depicting dynamic actions; Then, the event encoder and the text encoder are responsible for the event and text embedding, respectively; Lastly, the CRUE module simulates the action recognition process to establish the complex semantic relations for sub-actions and reduce the semantic uncertainty.

vis_res

SeAct Dataset: Event action dataset with caption-level labels

We propose the semantic-abundant SeAct dataset for event-text action recognition, where the detailed caption-level label of each action is provided. SeAct is collected with a DAVIS346 event camera whose resolution is 346 × 260. It contains 58 actions under four themes, as presented in the following images. Each action is accompanied by an action caption of less than 30 words generated by GPT-4 to enrich the semantic space of the original action labels. We split 80% and 20% of each category for training and testing (validating), respectively.


BibTeX

             
@article{ExACT,
  title={ExACT: Language-guided Conceptual Reasoning and Uncertainty Estimation for Event-based Action Recognition and More},
  author={Zhou,Jiazhou, Zheng,Xu, Lyu,Yuanhuiyi and Wang,Lin},
  journal={The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2024}}