HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Representation Engineering: A Top-Down Approach to AI Transparency

Representation Engineering: A Top-Down Approach to AI Transparency

Abstract

In this paper, we identify and characterize the emerging area of representation engineering (RepE), an approach to enhancing the transparency of AI systems that draws on insights from cognitive neuroscience. RepE places population-level representations, rather than neurons or circuits, at the center of analysis, equipping us with novel methods for monitoring and manipulating high-level cognitive phenomena in deep neural networks (DNNs). We provide baselines and an initial analysis of RepE techniques, showing that they offer simple yet effective solutions for improving our understanding and control of large language models. We showcase how these methods can provide traction on a wide range of safety-relevant problems, including honesty, harmlessness, power-seeking, and more, demonstrating the promise of top-down transparency research. We hope that this work catalyzes further exploration of RepE and fosters advancements in the transparency and safety of AI systems.

Code Repositories

sunblaze-ucb/political_leaning_RepE
pytorch
Mentioned in GitHub
kaiyuhe998/rulearn_idea
Mentioned in GitHub
andyzoujm/representation-engineering
Official
pytorch
Mentioned in GitHub
cma1114/activation_steering
pytorch
Mentioned in GitHub
steering-vectors/steering-vectors
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
question-answering-on-truthfulqaLLaMA-2-Chat-13B + Representation Control (Contrast Vector)
MC1: 0.54
question-answering-on-truthfulqaLLaMA-2-Chat-7B + Representation Control (Contrast Vector)
MC1: 0.48

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Representation Engineering: A Top-Down Approach to AI Transparency | Papers | HyperAI