Command Palette
Search for a command to run...
Ho Kei Cheng; Seoung Wug Oh; Brian Price; Joon-Young Lee; Alexander Schwing

Abstract
We present Cutie, a video object segmentation (VOS) network with object-level memory reading, which puts the object representation from memory back into the video object segmentation result. Recent works on VOS employ bottom-up pixel-level memory reading which struggles due to matching noise, especially in the presence of distractors, resulting in lower performance in more challenging data. In contrast, Cutie performs top-down object-level memory reading by adapting a small set of object queries. Via those, it interacts with the bottom-up pixel features iteratively with a query-based object transformer (qt, hence Cutie). The object queries act as a high-level summary of the target object, while high-resolution feature maps are retained for accurate segmentation. Together with foreground-background masked attention, Cutie cleanly separates the semantics of the foreground object from the background. On the challenging MOSE dataset, Cutie improves by 8.7 J&F over XMem with a similar running time and improves by 4.2 J&F over DeAOT while being three times faster. Code is available at: https://hkchengrex.github.io/Cutie
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| semi-supervised-video-object-segmentation-on-1 | Cutie (base, MEGA) | F-measure (Mean): 89.9 FPS: 36.4 Ju0026F: 86.1 Jaccard (Mean): 82.4 |
| semi-supervised-video-object-segmentation-on-1 | Cutie+ (base) | F-measure (Mean): 89.2 FPS: 17.9 Ju0026F: 85.9 Jaccard (Mean): 82.6 |
| semi-supervised-video-object-segmentation-on-1 | Cutie+ (base, MEGA) | F-measure (Mean): 91.4 FPS: 17.9 Ju0026F: 88.1 Jaccard (Mean): 84.7 |
| semi-supervised-video-object-segmentation-on-18 | Cutie+ (base, MEGA) | F-Measure (Seen): 90.6 F-Measure (Unseen): 90.5 Ju0026F: 17.9 Jaccard (Seen): 86.3 Jaccard (Unseen): 82.7 Overall: 87.5 |
| semi-supervised-video-object-segmentation-on-21 | Cutie (small, MEGA) | F: 72.9 FPS: 45.5 J: 64.3 Ju0026F: 68.6 |
| semi-supervised-video-object-segmentation-on-21 | Cutie+ (base, MEGA) | F: 75.8 FPS: 17.9 J: 67.6 Ju0026F: 71.7 |
| semi-supervised-video-object-segmentation-on-21 | Cutie (base) | F: 67.9 FPS: 36.4 J: 60.0 Ju0026F: 64.0 |
| semi-supervised-video-object-segmentation-on-21 | Cutie+ (small, MEGA) | F: 74.5 FPS: 20.6 J: 66.0 Ju0026F: 70.3 |
| semi-supervised-video-object-segmentation-on-21 | Cutie (small) | F: 66.2 FPS: 45.5 J: 58.2 Ju0026F: 62.2 |
| semi-supervised-video-object-segmentation-on-21 | Cutie (base, with mose) | F: 72.3 FPS: 36.4 J: 64.2 Ju0026F: 68.3 |
| semi-supervised-video-object-segmentation-on-21 | Cutie (base, MEGA) | F: 74.1 FPS: 36.4 J: 65.8 Ju0026F: 69.9 |
| semi-supervised-video-object-segmentation-on-21 | Cutie (small, with mose) | F: 71.7 FPS: 45.5 J: 63.1 Ju0026F: 67.4 |
| semi-supervised-video-object-segmentation-on-22 | Cutie (base, with mose, 600 pixels) | HOTA (all): 58.4 HOTA (common): 61.8 HOTA (uncommon): 57.5 |
| semi-supervised-video-object-segmentation-on-22 | Cutie (base, MEGA, 600 pixels) | HOTA (all): 61.2 HOTA (common): 65.0 HOTA (uncommon): 60.3 |
| semi-supervised-video-object-segmentation-on-23 | Cutie (base, MEGA, 600 pixels) | HOTA (all): 66.0 HOTA (common): 66.5 HOTA (uncommon): 65.9 |
| semi-supervised-video-object-segmentation-on-23 | Cutie (base, with mose, 600 pixels) | HOTA (all): 62.6 HOTA (common): 63.8 HOTA (uncommon): 62.3 |
| video-object-segmentation-on-mose | Cutie | Ju0026F: 68.3 |
| video-object-segmentation-on-youtube-vos | Cutie+ (base, MEGA) | F-Measure (Seen): 91.0 F-Measure (Unseen): 90.1 Jaccard (Seen): 86.6 Jaccard (Unseen): 82.2 Overall: 87.5 Speed (FPS): 17.9 |
| visual-object-tracking-on-davis-2017 | Cutie+ (base, MEGA) | F-measure (Mean): 90.8 Ju0026F: 88.1 Jaccard (Mean): 85.5 Speed (FPS): 17.9 |
| visual-object-tracking-on-davis-2017 | Cutie (base) | F-measure (Mean): 91.1 Ju0026F: 87.9 Jaccard (Mean): 84.6 Params(M): 36.4 |
| visual-object-tracking-on-davis-2017 | Cutie+ (base) | F-measure (Mean): 93.4 Ju0026F: 90.5 Jaccard (Mean): 87.5 Params(M): 17.9 |
| visual-object-tracking-on-didi | Cutie | Tracking quality: 0.575 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.