HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Effective Red-Teaming of Policy-Adherent Agents

Itay Nakash George Kour Koren Lazar Matan Vetzler Guy Uziel Ateret Anaby-Tavor

Effective Red-Teaming of Policy-Adherent Agents

Abstract

Task-oriented LLM-based agents are increasingly used in domains with strictpolicies, such as refund eligibility or cancellation rules. The challenge liesin ensuring that the agent consistently adheres to these rules and policies,appropriately refusing any request that would violate them, while stillmaintaining a helpful and natural interaction. This calls for the developmentof tailored design and evaluation methodologies to ensure agent resilienceagainst malicious user behavior. We propose a novel threat model that focuseson adversarial users aiming to exploit policy-adherent agents for personalbenefit. To address this, we present CRAFT, a multi-agent red-teaming systemthat leverages policy-aware persuasive strategies to undermine apolicy-adherent agent in a customer-service scenario, outperformingconventional jailbreak methods such as DAN prompts, emotional manipulation, andcoercive. Building upon the existing tau-bench benchmark, we introducetau-break, a complementary benchmark designed to rigorously assess the agent'srobustness against manipulative user behavior. Finally, we evaluate severalstraightforward yet effective defense strategies. While these measures providesome protection, they fall short, highlighting the need for stronger,research-driven safeguards to protect policy-adherent agents from adversarialattacks

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Effective Red-Teaming of Policy-Adherent Agents | Papers | HyperAI