
摘要
本文旨在利用视觉大语言模型(VLLMs)的强大推理能力解决图像和视频感知中的通用分割问题。尽管当前的统一分割方法已经取得了显著进展,但在适应图像和视频场景以及复杂推理分割方面仍存在局限性,这使得它们难以处理各种具有挑战性的指令并实现对细粒度视觉-语言关联的准确理解。为此,我们提出了HyperSeg,这是首个基于VLLM的用于像素级图像和视频感知的通用分割模型,涵盖了通用分割任务及需要强大推理能力和世界知识的更复杂的感知任务。此外,为了充分利用VLLM的识别能力和细粒度视觉信息,HyperSeg集成了混合实体识别模块和细粒度视觉感知模块,以应对各种分割任务。结合时间适配器,HyperSeg实现了对时间信息的全面理解。实验结果验证了我们的见解在解决通用图像和视频分割任务(包括更复杂的推理感知任务)方面的有效性。我们的代码已公开。
代码仓库
congvvc/HyperSeg
官方
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| open-vocabulary-semantic-segmentation-on-1 | HyperSeg | mIoU: 64.6 |
| open-vocabulary-semantic-segmentation-on-5 | HyperSeg | mIoU: 92.1 |
| panoptic-segmentation-on-coco-minival | HyperSeg (Swin-B) | PQ: 61.2 |
| referring-expression-segmentation-on-davis | HyperSeg | Ju0026F 1st frame: 71.2 |
| referring-expression-segmentation-on-refcoco | HyperSeg | Overall IoU: 84.8 |
| referring-expression-segmentation-on-refcoco-3 | HyperSeg | Overall IoU: 79.0 |
| referring-expression-segmentation-on-refcoco-4 | HyperSeg | Overall IoU: 83.5 |
| referring-expression-segmentation-on-refcoco-5 | HyperSeg | Overall IoU: 75.2 |
| referring-expression-segmentation-on-refcoco-8 | HyperSeg | Overall IoU: 85.7 |
| referring-expression-segmentation-on-refcoco-9 | HyperSeg | Overall IoU: 83.4 |
| referring-expression-segmentation-on-refcocog | HyperSeg | Overall IoU: 79.4 |
| referring-expression-segmentation-on-refcocog-1 | HyperSeg | Overall IoU: 78.9 |
| referring-video-object-segmentation-on-refer | HyperSeg | Ju0026F: 68.5 |
| semantic-segmentation-on-coco-1 | HyperSeg | mIoU: 77.2 |