Command Palette
Search for a command to run...
Seokju Cho Jiahui Huang Jisu Nam Honggyu An Seungryong Kim Joon-Young Lee

Abstract
We introduce LocoTrack, a highly accurate and efficient model designed forthe task of tracking any point (TAP) across video sequences. Previousapproaches in this task often rely on local 2D correlation maps to establishcorrespondences from a point in the query image to a local region in the targetimage, which often struggle with homogeneous regions or repetitive features,leading to matching ambiguities. LocoTrack overcomes this challenge with anovel approach that utilizes all-pair correspondences across regions, i.e.,local 4D correlation, to establish precise correspondences, with bidirectionalcorrespondence and matching smoothness significantly enhancing robustnessagainst ambiguities. We also incorporate a lightweight correlation encoder toenhance computational efficiency, and a compact Transformer architecture tointegrate long-term temporal information. LocoTrack achieves unmatched accuracyon all TAP-Vid benchmarks and operates at a speed almost 6 times faster thanthe current state-of-the-art.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| point-tracking-on-tap-vid-davis | LocoTrack-B | Average Jaccard: 69.4 Average PCK: 81.3 Occlusion Accuracy: 88.6 |
| point-tracking-on-tap-vid-davis-first | LocoTrack-B | Average Jaccard: 64.8 Average PCK: 77.4 Occlusion Accuracy: 86.2 |
| point-tracking-on-tap-vid-kinetics | LocoTrack-B | Average Jaccard: 59.1 Average PCK: 72.5 Occlusion Accuracy: 85.7 |
| point-tracking-on-tap-vid-kinetics-first | LocoTrack-B | Average Jaccard: 52.3 Average PCK: 66.4 Occlusion Accuracy: 82.1 |
| point-tracking-on-tap-vid-rgb-stacking | LocoTrack-B | Average Jaccard: 70.8 Average PCK: 83.2 Occlusion Accuracy: 84.1 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.