HyperAIHyperAI

Command Palette

Search for a command to run...

25 days ago

Token-Aware Editing of Internal Activations for Large Language Model Alignment

Tianbo Wang Yuqing Ma Kewei Liao Chengzhao Yang Zhange Zhang Jiakai Wang Xianglong Liu

Token-Aware Editing of Internal Activations for Large Language Model Alignment

Abstract

Intervening the internal activations of large language models (LLMs) provides an effective inference-time alignment approach to mitigate undesirable behaviors, such as generating erroneous or harmful content, thereby ensuring safe and reliable applications of LLMs. However, previous methods neglect the misalignment discrepancy among varied tokens, resulting in deviant alignment direction and inflexible editing strength. To address these issues, we propose a token-aware editing (TAE) approach to fully utilize token-level alignment information in the activation space, therefore realizing superior post-intervention performance. Specifically, a Mutual Information-guided Graph Aggregation (MIG) module first develops an MI-guided graph to exploit the tokens' informative interaction for activation enrichment, thus improving alignment probing and facilitating intervention. Subsequently, Misalignment-aware Adaptive Intervention (MAI) comprehensively perceives the token-level misalignment degree from token representation and prediction to guide the adaptive adjustment of editing strength, thereby enhancing final alignment performance. Extensive experiments on three alignment capabilities demonstrate the efficacy of TAE, notably surpassing baseline by 25.8% on the primary metric of truthfulness with minimal cost.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Token-Aware Editing of Internal Activations for Large Language Model Alignment | Papers | HyperAI