HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

General Object Foundation Model for Images and Videos at Scale

Wu Junfeng ; Jiang Yi ; Liu Qihao ; Yuan Zehuan ; Bai Xiang ; Bai Song

General Object Foundation Model for Images and Videos at Scale

Abstract

We present GLEE in this work, an object-level foundation model for locatingand identifying objects in images and videos. Through a unified framework, GLEEaccomplishes detection, segmentation, tracking, grounding, and identificationof arbitrary objects in the open world scenario for various object perceptiontasks. Adopting a cohesive learning strategy, GLEE acquires knowledge fromdiverse data sources with varying supervision levels to formulate generalobject representations, excelling in zero-shot transfer to new data and tasks.Specifically, we employ an image encoder, text encoder, and visual prompter tohandle multi-modal inputs, enabling to simultaneously solve variousobject-centric downstream tasks while maintaining state-of-the-art performance.Demonstrated through extensive training on over five million images fromdiverse benchmarks, GLEE exhibits remarkable versatility and improvedgeneralization performance, efficiently tackling downstream tasks without theneed for task-specific adaptation. By integrating large volumes ofautomatically labeled data, we further enhance its zero-shot generalizationcapabilities. Additionally, GLEE is capable of being integrated into LargeLanguage Models, serving as a foundational model to provide universalobject-level information for multi-modal tasks. We hope that the versatilityand universality of our method will mark a significant step in the developmentof efficient visual foundation models for AGI systems. The model and code willbe released at https://glee-vision.github.io .

Code Repositories

FoundationVision/GLEE
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
instance-segmentation-on-cocoGLEE-Lite
mask AP: 48.3
instance-segmentation-on-cocoGLEE-Plus
mask AP: 53.3
instance-segmentation-on-cocoGLEE-Pro
mask AP: 54.5
instance-segmentation-on-coco-minivalGLEE-Pro
mask AP: 54.2
instance-segmentation-on-coco-minivalGLEE-Plus
mask AP: 53.0
instance-segmentation-on-coco-minivalGLEE-Lite
mask AP: 48.4
instance-segmentation-on-lvis-v1-0-valGLEE-Pro
mask AP: 49.9
long-tail-video-object-segmentation-on-burstGLEE-Lite
HOTA (all): 22.6
HOTA (com): 36.4
HOTA (unc): 19.1
mAP (all): 12.6
mAP (com): 18.9
mAP (unc): 11.0
long-tail-video-object-segmentation-on-burst-1GLEE-Lite
HOTA (all): 22.6
HOTA (com): 36.4
HOTA (unc): 19.1
mAP (all): 12.6
mAP (com): 18.9
mAP (unc): 11.0
long-tail-video-object-segmentation-on-burst-1GLEE-Pro
HOTA (all): 31.2
HOTA (com): 48.7
HOTA (unc): 26.9
mAP (all): 19.2
mAP (com): 24.8
mAP (unc): 17.7
long-tail-video-object-segmentation-on-burst-1GLEE-Plus
HOTA (all): 26.9
HOTA (com): 38.8
HOTA (unc): 23.9
mAP (all): 17.2
mAP (com): 23.7
mAP (unc): 15.5
multi-object-tracking-on-taoGLEE-Lite
AssocA: 39.9
ClsA: 24.1
LocA: 56.3
TETA: 40.1
multi-object-tracking-on-taoGLEE-Plus
AssocA: 40.9
ClsA: 30.8
LocA: 52.9
TETA: 41.5
multi-object-tracking-on-taoGLEE-Pro
AssocA: 46.2
ClsA: 29.1
LocA: 66.2
TETA: 47.2
object-detection-on-cocoGLEE-Lite
box mAP: 54.7
object-detection-on-cocoGLEE-Pro
box mAP: 62.3
object-detection-on-cocoGLEE-Plus
box mAP: 60.6
object-detection-on-coco-minivalGLEE-Pro
box AP: 62.0
object-detection-on-coco-minivalGLEE-Lite
box AP: 55.0
object-detection-on-coco-minivalGLEE-Plus
box AP: 60.4
object-detection-on-lvis-v1-0-valGLEE-Pro
box AP: 55.7
open-world-instance-segmentation-on-uvoGLEE-Pro
ARmask: 72.6
referring-expression-segmentation-on-refcocoGLEE-Pro
Overall IoU: 80.0
referring-expression-segmentation-on-refcoco-3GLEE-Pro
Overall IoU: 69.6
referring-expression-segmentation-on-refcoco-6GLEE-Pro
IoU: 80.0
referring-expression-segmentation-on-refcocogGLEE-Pro
Overall IoU: 72.9
referring-expression-segmentation-on-refer-1GLEE-Pro
F: 72.9
J: 68.2
Ju0026F: 70.6
referring-video-object-segmentation-on-referGLEE-Plus
F: 69.7
J: 65.6
Ju0026F: 67.7
referring-video-object-segmentation-on-referGLEE-Pro
F: 72.9
J: 68.2
Ju0026F: 70.6
video-instance-segmentation-on-ovis-1GLEE-Pro
AP75: 55.5
mask AP: 50.4

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
General Object Foundation Model for Images and Videos at Scale | Papers | HyperAI