
摘要
在本工作中,我们介绍了GLEE,这是一种用于图像和视频中定位和识别对象的对象级基础模型。通过统一的框架,GLEE能够在开放世界场景中完成检测、分割、跟踪、定位和识别任意对象的各种对象感知任务。采用连贯的学习策略,GLEE从具有不同监督水平的多样化数据源中获取知识,形成通用的对象表示,从而在零样本迁移至新数据和新任务时表现出色。具体而言,我们使用了图像编码器、文本编码器和视觉提示器来处理多模态输入,能够在保持最先进性能的同时解决各种以对象为中心的下游任务。经过对来自多个基准测试集的超过五百万张图像的广泛训练,GLEE展示了出色的多功能性和改进的泛化性能,能够高效地应对下游任务而无需进行特定任务的适应。通过整合大量自动标注的数据,我们进一步增强了其零样本泛化能力。此外,GLEE可以集成到大型语言模型中,作为基础模型为多模态任务提供通用的对象级信息。我们希望该方法的多功能性和通用性将在开发适用于AGI系统的高效视觉基础模型方面迈出重要一步。模型和代码将在https://glee-vision.github.io 发布。
代码仓库
FoundationVision/GLEE
官方
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| instance-segmentation-on-coco | GLEE-Lite | mask AP: 48.3 |
| instance-segmentation-on-coco | GLEE-Plus | mask AP: 53.3 |
| instance-segmentation-on-coco | GLEE-Pro | mask AP: 54.5 |
| instance-segmentation-on-coco-minival | GLEE-Pro | mask AP: 54.2 |
| instance-segmentation-on-coco-minival | GLEE-Plus | mask AP: 53.0 |
| instance-segmentation-on-coco-minival | GLEE-Lite | mask AP: 48.4 |
| instance-segmentation-on-lvis-v1-0-val | GLEE-Pro | mask AP: 49.9 |
| long-tail-video-object-segmentation-on-burst | GLEE-Lite | HOTA (all): 22.6 HOTA (com): 36.4 HOTA (unc): 19.1 mAP (all): 12.6 mAP (com): 18.9 mAP (unc): 11.0 |
| long-tail-video-object-segmentation-on-burst-1 | GLEE-Lite | HOTA (all): 22.6 HOTA (com): 36.4 HOTA (unc): 19.1 mAP (all): 12.6 mAP (com): 18.9 mAP (unc): 11.0 |
| long-tail-video-object-segmentation-on-burst-1 | GLEE-Pro | HOTA (all): 31.2 HOTA (com): 48.7 HOTA (unc): 26.9 mAP (all): 19.2 mAP (com): 24.8 mAP (unc): 17.7 |
| long-tail-video-object-segmentation-on-burst-1 | GLEE-Plus | HOTA (all): 26.9 HOTA (com): 38.8 HOTA (unc): 23.9 mAP (all): 17.2 mAP (com): 23.7 mAP (unc): 15.5 |
| multi-object-tracking-on-tao | GLEE-Lite | AssocA: 39.9 ClsA: 24.1 LocA: 56.3 TETA: 40.1 |
| multi-object-tracking-on-tao | GLEE-Plus | AssocA: 40.9 ClsA: 30.8 LocA: 52.9 TETA: 41.5 |
| multi-object-tracking-on-tao | GLEE-Pro | AssocA: 46.2 ClsA: 29.1 LocA: 66.2 TETA: 47.2 |
| object-detection-on-coco | GLEE-Lite | box mAP: 54.7 |
| object-detection-on-coco | GLEE-Pro | box mAP: 62.3 |
| object-detection-on-coco | GLEE-Plus | box mAP: 60.6 |
| object-detection-on-coco-minival | GLEE-Pro | box AP: 62.0 |
| object-detection-on-coco-minival | GLEE-Lite | box AP: 55.0 |
| object-detection-on-coco-minival | GLEE-Plus | box AP: 60.4 |
| object-detection-on-lvis-v1-0-val | GLEE-Pro | box AP: 55.7 |
| open-world-instance-segmentation-on-uvo | GLEE-Pro | ARmask: 72.6 |
| referring-expression-segmentation-on-refcoco | GLEE-Pro | Overall IoU: 80.0 |
| referring-expression-segmentation-on-refcoco-3 | GLEE-Pro | Overall IoU: 69.6 |
| referring-expression-segmentation-on-refcoco-6 | GLEE-Pro | IoU: 80.0 |
| referring-expression-segmentation-on-refcocog | GLEE-Pro | Overall IoU: 72.9 |
| referring-expression-segmentation-on-refer-1 | GLEE-Pro | F: 72.9 J: 68.2 Ju0026F: 70.6 |
| referring-video-object-segmentation-on-refer | GLEE-Plus | F: 69.7 J: 65.6 Ju0026F: 67.7 |
| referring-video-object-segmentation-on-refer | GLEE-Pro | F: 72.9 J: 68.2 Ju0026F: 70.6 |
| video-instance-segmentation-on-ovis-1 | GLEE-Pro | AP75: 55.5 mask AP: 50.4 |