8 months ago

Abstract

Promptable segmentation typically requires instance-specific manual promptsto guide the segmentation of each desired object. To minimize such a need,task-generic promptable segmentation has been introduced, which employs asingle task-generic prompt to segment various images of different objects inthe same task. Current methods use Multimodal Large Language Models (MLLMs) toreason detailed instance-specific prompts from a task-generic prompt forimproving segmentation accuracy. The effectiveness of this segmentation heavilydepends on the precision of these derived prompts. However, MLLMs often sufferhallucinations during reasoning, resulting in inaccurate prompting. Whileexisting methods focus on eliminating hallucinations to improve a model, weargue that MLLM hallucinations can reveal valuable contextual insights whenleveraged correctly, as they represent pre-trained large-scale knowledge beyondindividual images. In this paper, we utilize hallucinations to minetask-related information from images and verify its accuracy for enhancingprecision of the generated prompts. Specifically, we introduce an iterativePrompt-Mask Cycle generation framework (ProMaC) with a prompt generator and amask generator.The prompt generator uses a multi-scale chain of thoughtprompting, initially exploring hallucinations for extracting extendedcontextual knowledge on a test image.These hallucinations are then reduced toformulate precise instance-specific prompts, directing the mask generator toproduce masks that are consistent with task semantics by mask semanticalignment. The generated masks iteratively induce the prompt generator to focusmore on task-relevant image areas and reduce irrelevant hallucinations,resulting jointly in better prompts and masks. Experiments on 5 benchmarksdemonstrate the effectiveness of ProMaC. Code given inhttps://lwpyh.github.io/ProMaC/.

Source PDF View Code