HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Rethinking the Two-Stage Framework for Grounded Situation Recognition

Wei Meng ; Chen Long ; Ji Wei ; Yue Xiaoyu ; Chua Tat-Seng

Rethinking the Two-Stage Framework for Grounded Situation Recognition

Abstract

Grounded Situation Recognition (GSR), i.e., recognizing the salient activity(or verb) category in an image (e.g., buying) and detecting all correspondingsemantic roles (e.g., agent and goods), is an essential step towards"human-like" event understanding. Since each verb is associated with a specificset of semantic roles, all existing GSR methods resort to a two-stageframework: predicting the verb in the first stage and detecting the semanticroles in the second stage. However, there are obvious drawbacks in both stages:1) The widely-used cross-entropy (XE) loss for object recognition isinsufficient in verb classification due to the large intra-class variation andhigh inter-class similarity among daily activities. 2) All semantic roles aredetected in an autoregressive manner, which fails to model the complex semanticrelations between different roles. To this end, we propose a novel SituFormerfor GSR which consists of a Coarse-to-Fine Verb Model (CFVM) and aTransformer-based Noun Model (TNM). CFVM is a two-step verb prediction model: acoarse-grained model trained with XE loss first proposes a set of verbcandidates, and then a fine-grained model trained with triplet loss re-ranksthese candidates with enhanced verb features (not only separable but alsodiscriminative). TNM is a transformer-based semantic role detection model,which detects all roles parallelly. Owing to the global relation modelingability and flexibility of the transformer decoder, TNM can fully explore thestatistical dependency of the roles. Extensive validations on the challengingSWiG benchmark show that SituFormer achieves a new state-of-the-art performancewith significant gains under various metrics. Code is available athttps://github.com/kellyiss/SituFormer.

Code Repositories

kellyiss/situformer
Official
pytorch

Benchmarks

BenchmarkMethodologyMetrics
grounded-situation-recognition-on-swigSituFormer
Top-1 Verb: 44.2
Top-1 Verb u0026 Grounded-Value: 29.22
Top-1 Verb u0026 Value: 35.24
Top-5 Verbs: 71.21
Top-5 Verbs u0026 Grounded-Value: 46
Top-5 Verbs u0026 Value: 55.75
situation-recognition-on-imsituSituFormer
Top-1 Verb: 44.2
Top-1 Verb u0026 Value: 35.24
Top-5 Verbs: 71.21
Top-5 Verbs u0026 Value: 55.75

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Rethinking the Two-Stage Framework for Grounded Situation Recognition | Papers | HyperAI