Command Palette
Search for a command to run...
Bin Yan; Houwen Peng; Jianlong Fu; Dong Wang; Huchuan Lu

Abstract
In this paper, we present a new tracking architecture with an encoder-decoder transformer as the key component. The encoder models the global spatio-temporal feature dependencies between target objects and search regions, while the decoder learns a query embedding to predict the spatial positions of the target objects. Our method casts object tracking as a direct bounding box prediction problem, without using any proposals or predefined anchors. With the encoder-decoder transformer, the prediction of objects just uses a simple fully-convolutional network, which estimates the corners of objects directly. The whole method is end-to-end, does not need any postprocessing steps such as cosine window and bounding box smoothing, thus largely simplifying existing tracking pipelines. The proposed tracker achieves state-of-the-art performance on five challenging short-term and long-term benchmarks, while running at real-time speed, being 6x faster than Siam R-CNN. Code and models are open-sourced at https://github.com/researchmm/Stark.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| video-object-tracking-on-nv-vot211 | STARK | AUC: 38.26 Precision: 51.37 |
| visual-object-tracking-on-avist | STARK-ST-101 | Success Rate: 50.50 |
| visual-object-tracking-on-got-10k | STARK | Average Overlap: 68.8 Success Rate 0.5: 78.1 |
| visual-object-tracking-on-lasot | STARK | AUC: 67.1 Normalized Precision: 77.0 |
| visual-object-tracking-on-trackingnet | STARK | Accuracy: 82.0 Normalized Precision: 86.9 Precision: 79.1 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.