HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

ViTGaze: Gaze Following with Interaction Features in Vision Transformers

Song Yuehao ; Wang Xinggang ; Yao Jingfeng ; Liu Wenyu ; Zhang Jinglin ; Xu Xiangmin

ViTGaze: Gaze Following with Interaction Features in Vision Transformers

Abstract

Gaze following aims to interpret human-scene interactions by predicting theperson's focal point of gaze. Prevailing approaches often adopt a two-stageframework, whereby multi-modality information is extracted in the initial stagefor gaze target prediction. Consequently, the efficacy of these methods highlydepends on the precision of the preceding modality extraction. Others use asingle-modality approach with complex decoders, increasing networkcomputational load. Inspired by the remarkable success of pre-trained plainvision transformers (ViTs), we introduce a novel single-modality gaze followingframework called ViTGaze. In contrast to previous methods, it creates a novelgaze following framework based mainly on powerful encoders (relative decoderparameters less than 1%). Our principal insight is that the inter-tokeninteractions within self-attention can be transferred to interactions betweenhumans and scenes. Leveraging this presumption, we formulate a frameworkconsisting of a 4D interaction encoder and a 2D spatial guidance module toextract human-scene interaction information from self-attention maps.Furthermore, our investigation reveals that ViT with self-supervisedpre-training has an enhanced ability to extract correlation information. Manyexperiments have been conducted to demonstrate the performance of the proposedmethod. Our method achieves state-of-the-art (SOTA) performance among allsingle-modality methods (3.4% improvement in the area under curve (AUC) score,5.1% improvement in the average precision (AP)) and very comparable performanceagainst multi-modality methods with 59% number of parameters less.

Code Repositories

hustvl/vitgaze
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
gaze-target-estimation-onViTGaze
AP: 0.905
AUC: 0.938
Average Distance: 0.102
gaze-target-estimation-on-gazefollowViTGaze
AUC: 0.949
Average Distance: 0.105

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
ViTGaze: Gaze Following with Interaction Features in Vision Transformers | Papers | HyperAI