Command Palette
Search for a command to run...
VLCounter: Text-aware Visual Representation for Zero-Shot Object Counting
Kang Seunggu ; Moon WonJun ; Kim Euiyeon ; Heo Jae-Pil

Abstract
Zero-Shot Object Counting (ZSOC) aims to count referred instances ofarbitrary classes in a query image without human-annotated exemplars. To dealwith ZSOC, preceding studies proposed a two-stage pipeline: discoveringexemplars and counting. However, there remains a challenge of vulnerability toerror propagation of the sequentially designed two-stage process. In this work,an one-stage baseline, Visual-Language Baseline (VLBase), exploring theimplicit association of the semantic-patch embeddings of CLIP is proposed.Subsequently, the extension of VLBase to Visual-language Counter (VLCounter) isachieved by incorporating three modules devised to tailor VLBase for objectcounting. First, Semantic-conditioned Prompt Tuning (SPT) is introduced withinthe image encoder to acquire target-highlighted representations. Second,Learnable Affine Transformation (LAT) is employed to translate thesemantic-patch similarity map to be appropriate for the counting task. Lastly,the layer-wisely encoded features are transferred to the decoder throughSegment-aware Skip Connection (SaSC) to keep the generalization capability forunseen classes. Through extensive experiments on FSC147, CARPK, and PUCPR+, thebenefits of the end-to-end framework, VLCounter, are demonstrated.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| object-counting-on-carpk | VLCounter | MAE: 6.46 RMSE: 8.68 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.