5 months ago

General Object Foundation Model for Images and Videos at Scale

Wu Junfeng ; Jiang Yi ; Liu Qihao ; Yuan Zehuan ; Bai Xiang ; Bai Song

Abstract

We present GLEE in this work, an object-level foundation model for locatingand identifying objects in images and videos. Through a unified framework, GLEEaccomplishes detection, segmentation, tracking, grounding, and identificationof arbitrary objects in the open world scenario for various object perceptiontasks. Adopting a cohesive learning strategy, GLEE acquires knowledge fromdiverse data sources with varying supervision levels to formulate generalobject representations, excelling in zero-shot transfer to new data and tasks.Specifically, we employ an image encoder, text encoder, and visual prompter tohandle multi-modal inputs, enabling to simultaneously solve variousobject-centric downstream tasks while maintaining state-of-the-art performance.Demonstrated through extensive training on over five million images fromdiverse benchmarks, GLEE exhibits remarkable versatility and improvedgeneralization performance, efficiently tackling downstream tasks without theneed for task-specific adaptation. By integrating large volumes ofautomatically labeled data, we further enhance its zero-shot generalizationcapabilities. Additionally, GLEE is capable of being integrated into LargeLanguage Models, serving as a foundational model to provide universalobject-level information for multi-modal tasks. We hope that the versatilityand universality of our method will mark a significant step in the developmentof efficient visual foundation models for AGI systems. The model and code willbe released at https://glee-vision.github.io .

Code Repositories

FoundationVision/GLEE

Official

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
instance-segmentation-on-coco	GLEE-Lite	mask AP: 48.3
instance-segmentation-on-coco	GLEE-Plus	mask AP: 53.3
instance-segmentation-on-coco	GLEE-Pro	mask AP: 54.5
instance-segmentation-on-coco-minival	GLEE-Pro	mask AP: 54.2
instance-segmentation-on-coco-minival	GLEE-Plus	mask AP: 53.0
instance-segmentation-on-coco-minival	GLEE-Lite	mask AP: 48.4
instance-segmentation-on-lvis-v1-0-val	GLEE-Pro	mask AP: 49.9
long-tail-video-object-segmentation-on-burst	GLEE-Lite	HOTA (all): 22.6 HOTA (com): 36.4 HOTA (unc): 19.1 mAP (all): 12.6 mAP (com): 18.9 mAP (unc): 11.0
long-tail-video-object-segmentation-on-burst-1	GLEE-Lite	HOTA (all): 22.6 HOTA (com): 36.4 HOTA (unc): 19.1 mAP (all): 12.6 mAP (com): 18.9 mAP (unc): 11.0
long-tail-video-object-segmentation-on-burst-1	GLEE-Pro	HOTA (all): 31.2 HOTA (com): 48.7 HOTA (unc): 26.9 mAP (all): 19.2 mAP (com): 24.8 mAP (unc): 17.7
long-tail-video-object-segmentation-on-burst-1	GLEE-Plus	HOTA (all): 26.9 HOTA (com): 38.8 HOTA (unc): 23.9 mAP (all): 17.2 mAP (com): 23.7 mAP (unc): 15.5
multi-object-tracking-on-tao	GLEE-Lite	AssocA: 39.9 ClsA: 24.1 LocA: 56.3 TETA: 40.1
multi-object-tracking-on-tao	GLEE-Plus	AssocA: 40.9 ClsA: 30.8 LocA: 52.9 TETA: 41.5
multi-object-tracking-on-tao	GLEE-Pro	AssocA: 46.2 ClsA: 29.1 LocA: 66.2 TETA: 47.2
object-detection-on-coco	GLEE-Lite	box mAP: 54.7
object-detection-on-coco	GLEE-Pro	box mAP: 62.3
object-detection-on-coco	GLEE-Plus	box mAP: 60.6
object-detection-on-coco-minival	GLEE-Pro	box AP: 62.0
object-detection-on-coco-minival	GLEE-Lite	box AP: 55.0
object-detection-on-coco-minival	GLEE-Plus	box AP: 60.4
object-detection-on-lvis-v1-0-val	GLEE-Pro	box AP: 55.7
open-world-instance-segmentation-on-uvo	GLEE-Pro	ARmask: 72.6
referring-expression-segmentation-on-refcoco	GLEE-Pro	Overall IoU: 80.0
referring-expression-segmentation-on-refcoco-3	GLEE-Pro	Overall IoU: 69.6
referring-expression-segmentation-on-refcoco-6	GLEE-Pro	IoU: 80.0
referring-expression-segmentation-on-refcocog	GLEE-Pro	Overall IoU: 72.9
referring-expression-segmentation-on-refer-1	GLEE-Pro	F: 72.9 J: 68.2 Ju0026F: 70.6
referring-video-object-segmentation-on-refer	GLEE-Plus	F: 69.7 J: 65.6 Ju0026F: 67.7
referring-video-object-segmentation-on-refer	GLEE-Pro	F: 72.9 J: 68.2 Ju0026F: 70.6
video-instance-segmentation-on-ovis-1	GLEE-Pro	AP75: 55.5 mask AP: 50.4

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette