8 months ago

Abstract

Existing action recognition methods are typically actor-specific due to theintrinsic topological and apparent differences among the actors. This requiresactor-specific pose estimation (e.g., humans vs. animals), leading tocumbersome model design complexity and high maintenance costs. Moreover, theyoften focus on learning the visual modality alone and single-labelclassification whilst neglecting other available information sources (e.g.,class name text) and the concurrent occurrence of multiple actions. To overcomethese limitations, we propose a new approach called 'actor-agnostic multi-modalmulti-label action recognition,' which offers a unified solution for varioustypes of actors, including humans and animals. We further formulate a novelMulti-modal Semantic Query Network (MSQNet) model in a transformer-based objectdetection framework (e.g., DETR), characterized by leveraging visual andtextual modalities to represent the action classes better. The elimination ofactor-specific model designs is a key advantage, as it removes the need foractor pose estimation altogether. Extensive experiments on five publiclyavailable benchmarks show that our MSQNet consistently outperforms the priorarts of actor-specific alternatives on human and animal single- and multi-labelaction recognition tasks by up to 50%. Code is made available athttps://github.com/mondalanindya/MSQNet.

Source PDF View Code