5 months ago

Rethinking the Two-Stage Framework for Grounded Situation Recognition

Wei Meng ; Chen Long ; Ji Wei ; Yue Xiaoyu ; Chua Tat-Seng

Abstract

Grounded Situation Recognition (GSR), i.e., recognizing the salient activity(or verb) category in an image (e.g., buying) and detecting all correspondingsemantic roles (e.g., agent and goods), is an essential step towards"human-like" event understanding. Since each verb is associated with a specificset of semantic roles, all existing GSR methods resort to a two-stageframework: predicting the verb in the first stage and detecting the semanticroles in the second stage. However, there are obvious drawbacks in both stages:1) The widely-used cross-entropy (XE) loss for object recognition isinsufficient in verb classification due to the large intra-class variation andhigh inter-class similarity among daily activities. 2) All semantic roles aredetected in an autoregressive manner, which fails to model the complex semanticrelations between different roles. To this end, we propose a novel SituFormerfor GSR which consists of a Coarse-to-Fine Verb Model (CFVM) and aTransformer-based Noun Model (TNM). CFVM is a two-step verb prediction model: acoarse-grained model trained with XE loss first proposes a set of verbcandidates, and then a fine-grained model trained with triplet loss re-ranksthese candidates with enhanced verb features (not only separable but alsodiscriminative). TNM is a transformer-based semantic role detection model,which detects all roles parallelly. Owing to the global relation modelingability and flexibility of the transformer decoder, TNM can fully explore thestatistical dependency of the roles. Extensive validations on the challengingSWiG benchmark show that SituFormer achieves a new state-of-the-art performancewith significant gains under various metrics. Code is available athttps://github.com/kellyiss/SituFormer.

Code Repositories

kellyiss/situformer

Official

pytorch

Benchmarks

Benchmark	Methodology	Metrics
grounded-situation-recognition-on-swig	SituFormer	Top-1 Verb: 44.2 Top-1 Verb u0026 Grounded-Value: 29.22 Top-1 Verb u0026 Value: 35.24 Top-5 Verbs: 71.21 Top-5 Verbs u0026 Grounded-Value: 46 Top-5 Verbs u0026 Value: 55.75
situation-recognition-on-imsitu	SituFormer	Top-1 Verb: 44.2 Top-1 Verb u0026 Value: 35.24 Top-5 Verbs: 71.21 Top-5 Verbs u0026 Value: 55.75

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette