5 months ago

ViTGaze: Gaze Following with Interaction Features in Vision Transformers

Song Yuehao ; Wang Xinggang ; Yao Jingfeng ; Liu Wenyu ; Zhang Jinglin ; Xu Xiangmin

Abstract

Gaze following aims to interpret human-scene interactions by predicting theperson's focal point of gaze. Prevailing approaches often adopt a two-stageframework, whereby multi-modality information is extracted in the initial stagefor gaze target prediction. Consequently, the efficacy of these methods highlydepends on the precision of the preceding modality extraction. Others use asingle-modality approach with complex decoders, increasing networkcomputational load. Inspired by the remarkable success of pre-trained plainvision transformers (ViTs), we introduce a novel single-modality gaze followingframework called ViTGaze. In contrast to previous methods, it creates a novelgaze following framework based mainly on powerful encoders (relative decoderparameters less than 1%). Our principal insight is that the inter-tokeninteractions within self-attention can be transferred to interactions betweenhumans and scenes. Leveraging this presumption, we formulate a frameworkconsisting of a 4D interaction encoder and a 2D spatial guidance module toextract human-scene interaction information from self-attention maps.Furthermore, our investigation reveals that ViT with self-supervisedpre-training has an enhanced ability to extract correlation information. Manyexperiments have been conducted to demonstrate the performance of the proposedmethod. Our method achieves state-of-the-art (SOTA) performance among allsingle-modality methods (3.4% improvement in the area under curve (AUC) score,5.1% improvement in the average precision (AP)) and very comparable performanceagainst multi-modality methods with 59% number of parameters less.

Code Repositories

hustvl/vitgaze

Official

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
gaze-target-estimation-on	ViTGaze	AP: 0.905 AUC: 0.938 Average Distance: 0.102
gaze-target-estimation-on-gazefollow	ViTGaze	AUC: 0.949 Average Distance: 0.105

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette