8 months ago

Abstract

Social relation reasoning aims to identify relation categories such asfriends, spouses, and colleagues from images. While current methods adopt theparadigm of training a dedicated network end-to-end using labeled image data,they are limited in terms of generalizability and interpretability. To addressthese issues, we first present a simple yet well-crafted framework named{ame}, which combines the perception capability of Vision Foundation Models(VFMs) and the reasoning capability of Large Language Models (LLMs) within amodular framework, providing a strong baseline for social relation recognition.Specifically, we instruct VFMs to translate image content into a textual socialstory, and then utilize LLMs for text-based reasoning. {ame} introducessystematic design principles to adapt VFMs and LLMs separately and bridge theirgaps. Without additional model training, it achieves competitive zero-shotresults on two databases while offering interpretable answers, as LLMs cangenerate language-based explanations for the decisions. The manual promptdesign process for LLMs at the reasoning phase is tedious and an automatedprompt optimization method is desired. As we essentially convert a visualclassification task into a generative task of LLMs, automatic promptoptimization encounters a unique long prompt optimization issue. To addressthis issue, we further propose the Greedy Segment Prompt Optimization (GSPO),which performs a greedy search by utilizing gradient information at the segmentlevel. Experimental results show that GSPO significantly improves performance,and our method also generalizes to different image styles. The code isavailable at https://github.com/Mengzibin/SocialGPT.

Source PDF