Question Answering On Triviaqa

评估指标

EM

评测结果

各个模型在此基准测试上的表现结果

Paper TitleRepository
Claude 2 (few-shot, k=5)87.5Model Card and Evaluations for Claude Models-
GPT-4-061387--
Claude 1.3 (few-shot, k=5)86.7Model Card and Evaluations for Claude Models-
RankRAG-llama3-70b (Zero-Shot, KILT)86.5RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs-
PaLM 2-L (one-shot)86.1PaLM 2 Technical Report
ChatQA-1.5-llama3-70b (Zero-Shot, KILT)85.6ChatQA: Surpassing GPT-4 on Conversational QA and RAG-
LLaMA 2 70B (one-shot)85Llama 2: Open Foundation and Fine-Tuned Chat Models
GPT-4-0613 (Zero-shot)84.8GPT-4 Technical Report
RankRAG-llama3-8b (Zero-Shot, KILT)82.9RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs-
PaLM 2-M (one-shot)81.7PaLM 2 Technical Report
PaLM-540B (One-Shot)81.4PaLM: Scaling Language Modeling with Pathways
PaLM-540B (Few-Shot)81.4PaLM: Scaling Language Modeling with Pathways
ChatQA-1.5-llama3-8B (Zero-Shot, KILT)81.0ChatQA: Surpassing GPT-4 on Conversational QA and RAG-
GaC(Qwen2-72B-Instruct + Llama-3-70B-Instruct)79.29Breaking the Ceiling of the LLM Community by Treating Token Generation as a Classification for Ensembling
Claude Instant 1.1 (few-shot, k=5)78.9Model Card and Evaluations for Claude Models-
code-davinci-002 175B + REPLUG LSR (Few-Shot)77.3REPLUG: Retrieval-Augmented Black-Box Language Models
PaLM-540B (Zero-Shot)76.9PaLM: Scaling Language Modeling with Pathways
code-davinci-002 175B + REPLUG (Few-Shot)76.8REPLUG: Retrieval-Augmented Black-Box Language Models
GLaM 62B/64E (Few-shot)75.8GLaM: Efficient Scaling of Language Models with Mixture-of-Experts-
GLaM 62B/64E (One-shot)75.8GLaM: Efficient Scaling of Language Models with Mixture-of-Experts-
0 of 56 row(s) selected.
Question Answering On Triviaqa | SOTA | HyperAI超神经