Command Palette
Search for a command to run...
General Object Foundation Model for Images and Videos at Scale
Wu Junfeng ; Jiang Yi ; Liu Qihao ; Yuan Zehuan ; Bai Xiang ; Bai Song

Abstract
We present GLEE in this work, an object-level foundation model for locatingand identifying objects in images and videos. Through a unified framework, GLEEaccomplishes detection, segmentation, tracking, grounding, and identificationof arbitrary objects in the open world scenario for various object perceptiontasks. Adopting a cohesive learning strategy, GLEE acquires knowledge fromdiverse data sources with varying supervision levels to formulate generalobject representations, excelling in zero-shot transfer to new data and tasks.Specifically, we employ an image encoder, text encoder, and visual prompter tohandle multi-modal inputs, enabling to simultaneously solve variousobject-centric downstream tasks while maintaining state-of-the-art performance.Demonstrated through extensive training on over five million images fromdiverse benchmarks, GLEE exhibits remarkable versatility and improvedgeneralization performance, efficiently tackling downstream tasks without theneed for task-specific adaptation. By integrating large volumes ofautomatically labeled data, we further enhance its zero-shot generalizationcapabilities. Additionally, GLEE is capable of being integrated into LargeLanguage Models, serving as a foundational model to provide universalobject-level information for multi-modal tasks. We hope that the versatilityand universality of our method will mark a significant step in the developmentof efficient visual foundation models for AGI systems. The model and code willbe released at https://glee-vision.github.io .
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| instance-segmentation-on-coco | GLEE-Lite | mask AP: 48.3 |
| instance-segmentation-on-coco | GLEE-Plus | mask AP: 53.3 |
| instance-segmentation-on-coco | GLEE-Pro | mask AP: 54.5 |
| instance-segmentation-on-coco-minival | GLEE-Pro | mask AP: 54.2 |
| instance-segmentation-on-coco-minival | GLEE-Plus | mask AP: 53.0 |
| instance-segmentation-on-coco-minival | GLEE-Lite | mask AP: 48.4 |
| instance-segmentation-on-lvis-v1-0-val | GLEE-Pro | mask AP: 49.9 |
| long-tail-video-object-segmentation-on-burst | GLEE-Lite | HOTA (all): 22.6 HOTA (com): 36.4 HOTA (unc): 19.1 mAP (all): 12.6 mAP (com): 18.9 mAP (unc): 11.0 |
| long-tail-video-object-segmentation-on-burst-1 | GLEE-Lite | HOTA (all): 22.6 HOTA (com): 36.4 HOTA (unc): 19.1 mAP (all): 12.6 mAP (com): 18.9 mAP (unc): 11.0 |
| long-tail-video-object-segmentation-on-burst-1 | GLEE-Pro | HOTA (all): 31.2 HOTA (com): 48.7 HOTA (unc): 26.9 mAP (all): 19.2 mAP (com): 24.8 mAP (unc): 17.7 |
| long-tail-video-object-segmentation-on-burst-1 | GLEE-Plus | HOTA (all): 26.9 HOTA (com): 38.8 HOTA (unc): 23.9 mAP (all): 17.2 mAP (com): 23.7 mAP (unc): 15.5 |
| multi-object-tracking-on-tao | GLEE-Lite | AssocA: 39.9 ClsA: 24.1 LocA: 56.3 TETA: 40.1 |
| multi-object-tracking-on-tao | GLEE-Plus | AssocA: 40.9 ClsA: 30.8 LocA: 52.9 TETA: 41.5 |
| multi-object-tracking-on-tao | GLEE-Pro | AssocA: 46.2 ClsA: 29.1 LocA: 66.2 TETA: 47.2 |
| object-detection-on-coco | GLEE-Lite | box mAP: 54.7 |
| object-detection-on-coco | GLEE-Pro | box mAP: 62.3 |
| object-detection-on-coco | GLEE-Plus | box mAP: 60.6 |
| object-detection-on-coco-minival | GLEE-Pro | box AP: 62.0 |
| object-detection-on-coco-minival | GLEE-Lite | box AP: 55.0 |
| object-detection-on-coco-minival | GLEE-Plus | box AP: 60.4 |
| object-detection-on-lvis-v1-0-val | GLEE-Pro | box AP: 55.7 |
| open-world-instance-segmentation-on-uvo | GLEE-Pro | ARmask: 72.6 |
| referring-expression-segmentation-on-refcoco | GLEE-Pro | Overall IoU: 80.0 |
| referring-expression-segmentation-on-refcoco-3 | GLEE-Pro | Overall IoU: 69.6 |
| referring-expression-segmentation-on-refcoco-6 | GLEE-Pro | IoU: 80.0 |
| referring-expression-segmentation-on-refcocog | GLEE-Pro | Overall IoU: 72.9 |
| referring-expression-segmentation-on-refer-1 | GLEE-Pro | F: 72.9 J: 68.2 Ju0026F: 70.6 |
| referring-video-object-segmentation-on-refer | GLEE-Plus | F: 69.7 J: 65.6 Ju0026F: 67.7 |
| referring-video-object-segmentation-on-refer | GLEE-Pro | F: 72.9 J: 68.2 Ju0026F: 70.6 |
| video-instance-segmentation-on-ovis-1 | GLEE-Pro | AP75: 55.5 mask AP: 50.4 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.