HyperAIHyperAI

Command Palette

Search for a command to run...

a month ago

Language Models Can Learn from Verbal Feedback Without Scalar Rewards

Renjie Luo Zichen Liu Xiangyan Liu Chao Du Min Lin Wenhu Chen Wei Lu Tianyu Pang

Language Models Can Learn from Verbal Feedback Without Scalar Rewards

Abstract

LLMs are often trained with RL from human or AI feedback, yet such methodstypically compress nuanced feedback into scalar rewards, discarding much oftheir richness and inducing scale imbalance. We propose treating verbalfeedback as a conditioning signal. Inspired by language priors in text-to-imagegeneration, which enable novel outputs from unseen prompts, we introduce thefeedback-conditional policy (FCP). FCP learns directly from response-feedbackpairs, approximating the feedback-conditional posterior through maximumlikelihood training on offline data. We further develop an online bootstrappingstage where the policy generates under positive conditions and receives freshfeedback to refine itself. This reframes feedback-driven learning asconditional generation rather than reward optimization, offering a moreexpressive way for LLMs to directly learn from verbal feedback. Our code isavailable at https://github.com/sail-sg/feedback-conditional-policy.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Language Models Can Learn from Verbal Feedback Without Scalar Rewards | Papers | HyperAI