
摘要
我们提出了 EQ-Bench,这是一个专为评估大语言模型(LLMs)情感智能水平而设计的新基准。该基准通过要求模型预测对话中角色情感状态的强度,来考察模型理解复杂情绪与社会互动的能力。EQ-Bench 能够有效区分多种不同性能的模型。研究发现,EQ-Bench 与综合性多领域基准(如 MMLU,Hendrycks 等,2020)具有高度相关性(相关系数 r = 0.97),表明我们所衡量的可能正是广泛智能的某些核心维度。该基准基于一组 60 个英文问题,能够产生高度可重复的评估结果。此外,我们已在 https://github.com/EQ-bench/EQ-Bench 开源了自动化基准测试流程的代码,并提供了在线排行榜 https://eqbench.com。
代码仓库
eq-bench/eq-bench
官方
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| emotional-intelligence-on-emotional | OpenAI gpt-3.5-0613 | EQ-Bench Score: 49.17 |
| emotional-intelligence-on-emotional | lmsys/vicuna-33b-v1.3 | EQ-Bench Score: 36.52 |
| emotional-intelligence-on-emotional | lmsys/vicuna-13b-v1.1 | EQ-Bench Score: 32.85 |
| emotional-intelligence-on-emotional | OpenAI text-davinci-002 | EQ-Bench Score: 39.44 |
| emotional-intelligence-on-emotional | OpenAI text-davinci-003 | EQ-Bench Score: 43.73 |
| emotional-intelligence-on-emotional | meta-llama/Llama-2-70b-chat-hf | EQ-Bench Score: 51.56 |
| emotional-intelligence-on-emotional | OpenAI ADA | EQ-Bench Score: 2.25 |
| emotional-intelligence-on-emotional | meta-llama/Llama-2-7b-chat-hf | EQ-Bench Score: 25.43 |
| emotional-intelligence-on-emotional | OpenAI gpt-3.5-turbo-0301 | EQ-Bench Score: 47.61 |
| emotional-intelligence-on-emotional | Intel/neural-chat-7b-v3-1 | EQ-Bench Score: 43.61 |
| emotional-intelligence-on-emotional | Qwen/Qwen-72B-Chat | EQ-Bench Score: 52.44 |
| emotional-intelligence-on-emotional | openchat/openchat 3.5 | EQ-Bench Score: 37.08 |
| emotional-intelligence-on-emotional | migtissera/SynthIA-70B-v1.5 | EQ-Bench Score: 54.83 |
| emotional-intelligence-on-emotional | Open-Orca/Mistral-7B-OpenOrca | EQ-Bench Score: 44.40 |
| emotional-intelligence-on-emotional | OpenAI gpt-4-0613 | EQ-Bench Score: 62.52 |
| emotional-intelligence-on-emotional | OpenAI gpt-4-0314 | EQ-Bench Score: 53.39 |
| emotional-intelligence-on-emotional | Qwen/Qwen-14B-Chat | EQ-Bench Score: 43.76 |
| emotional-intelligence-on-emotional | Koala 13B | EQ-Bench Score: 24.92 |
| emotional-intelligence-on-emotional | meta-llama/Llama-2-13b-chat-hf | EQ-Bench Score: 33.02 |
| emotional-intelligence-on-emotional | OpenAI ADA | EQ-Bench Score: 2.25 |
| emotional-intelligence-on-emotional | Anthropic Claude2 | EQ-Bench Score: 52.14 |
| emotional-intelligence-on-emotional | 01-ai/Yi-34B-Chat | EQ-Bench Score: 51.03 |
| emotional-intelligence-on-emotional | lmsys/vicuna-7b-v1.1 | EQ-Bench Score: 22.24 |
| emotional-intelligence-on-emotional | OpenAI text-davinci-001 | EQ-Bench Score: 15.19 |