8 months ago

Abstract

This paper presents a language-powered paradigm for ordinal regression.Existing methods usually treat each rank as a category and employ a set ofweights to learn these concepts. These methods are easy to overfit and usuallyattain unsatisfactory performance as the learned concepts are mainly derivedfrom the training set. Recent large pre-trained vision-language models likeCLIP have shown impressive performance on various visual tasks. In this paper,we propose to learn the rank concepts from the rich semantic CLIP latent space.Specifically, we reformulate this task as an image-language matching problemwith a contrastive objective, which regards labels as text and obtains alanguage prototype from a text encoder for each rank. While prompt engineeringfor CLIP is extremely time-consuming, we propose OrdinalCLIP, a differentiableprompting method for adapting CLIP for ordinal regression. OrdinalCLIP consistsof learnable context tokens and learnable rank embeddings; The learnable rankembeddings are constructed by explicitly modeling numerical continuity,resulting in well-ordered, compact language prototypes in the CLIP space. Oncelearned, we can only save the language prototypes and discard the huge languagemodel, resulting in zero additional computational overhead compared with thelinear head counterpart. Experimental results show that our paradigm achievescompetitive performance in general ordinal regression tasks, and gainsimprovements in few-shot and distribution shift settings for age estimation.The code is available at https://github.com/xk-huang/OrdinalCLIP.

Source PDF