8 months ago

Abstract

The emergence of attention-based transformer models has led to theirextensive use in various tasks, due to their superior generalization andtransfer properties. Recent research has demonstrated that such models, whenprompted appropriately, are excellent for few-shot inference. However, suchtechniques are under-explored for dense prediction tasks like semanticsegmentation. In this work, we examine the effectiveness of prompting atransformer-decoder with learned visual prompts for the generalized few-shotsegmentation (GFSS) task. Our goal is to achieve strong performance not only onnovel categories with limited examples, but also to retain performance on basecategories. We propose an approach to learn visual prompts with limitedexamples. These learned visual prompts are used to prompt a multiscaletransformer decoder to facilitate accurate dense predictions. Additionally, weintroduce a unidirectional causal attention mechanism between the novelprompts, learned with limited examples, and the base prompts, learned withabundant data. This mechanism enriches the novel prompts without deterioratingthe base class performance. Overall, this form of prompting helps us achievestate-of-the-art performance for GFSS on two different benchmark datasets:COCO- $20^i$ and Pascal- $5^i$ , without the need for test-time optimization (ortransduction). Furthermore, test-time optimization leveraging unlabelled testdata can be used to improve the prompts, which we refer to as transductiveprompt tuning.