HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

MixFormer: End-to-End Tracking with Iterative Mixed Attention

Yutao Cui; Cheng Jiang; Limin Wang; Gangshan Wu

MixFormer: End-to-End Tracking with Iterative Mixed Attention

Abstract

Tracking often uses a multi-stage pipeline of feature extraction, target information integration, and bounding box estimation. To simplify this pipeline and unify the process of feature extraction and target information integration, we present a compact tracking framework, termed as MixFormer, built upon transformers. Our core design is to utilize the flexibility of attention operations, and propose a Mixed Attention Module (MAM) for simultaneous feature extraction and target information integration. This synchronous modeling scheme allows to extract target-specific discriminative features and perform extensive communication between target and search area. Based on MAM, we build our MixFormer tracking framework simply by stacking multiple MAMs with progressive patch embedding and placing a localization head on top. In addition, to handle multiple target templates during online tracking, we devise an asymmetric attention scheme in MAM to reduce computational cost, and propose an effective score prediction module to select high-quality templates. Our MixFormer sets a new state-of-the-art performance on five tracking benchmarks, including LaSOT, TrackingNet, VOT2020, GOT-10k, and UAV123. In particular, our MixFormer-L achieves NP score of 79.9% on LaSOT, 88.9% on TrackingNet and EAO of 0.555 on VOT2020. We also perform in-depth ablation studies to demonstrate the effectiveness of simultaneous feature extraction and information integration. Code and trained models are publicly available at https://github.com/MCG-NJU/MixFormer.

Code Repositories

MCG-NJU/MixFormer
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
semi-supervised-video-object-segmentation-on-15MixFormer-L
EAO: 0.555
video-object-tracking-on-nv-vot211Mixformer(ConvMAE)
AUC: 39.23
Precision: 54.20
visual-object-tracking-on-avistMixFormerL-22k
Success Rate: 56.0
visual-object-tracking-on-got-10kMixFormer-1k
Average Overlap: 71.2
Success Rate 0.5: 79.9
Success Rate 0.75: 65.8
visual-object-tracking-on-got-10kMixFormer
Average Overlap: 70.7
Success Rate 0.5: 80.0
Success Rate 0.75: 67.8
visual-object-tracking-on-got-10kMixFormer-L
Average Overlap: 75.6
Success Rate 0.5: 85.73
Success Rate 0.75: 72.8
visual-object-tracking-on-lasotMixFormer-L
AUC: 70.1
Normalized Precision: 79.9
Precision: 76.3
visual-object-tracking-on-trackingnetMixFormer-L
Accuracy: 83.9
Normalized Precision: 88.9
Precision: 83.1
visual-object-tracking-on-uav123MixFormer
AUC: 0.704
Precision: 0.918

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
MixFormer: End-to-End Tracking with Iterative Mixed Attention | Papers | HyperAI