| Claude 2 (few-shot, k=5) | 87.5 | Model Card and Evaluations for Claude Models | - |
| Claude 1.3 (few-shot, k=5) | 86.7 | Model Card and Evaluations for Claude Models | - |
| RankRAG-llama3-70b (Zero-Shot, KILT) | 86.5 | RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs | - |
| ChatQA-1.5-llama3-70b (Zero-Shot, KILT) | 85.6 | ChatQA: Surpassing GPT-4 on Conversational QA and RAG | - |
| RankRAG-llama3-8b (Zero-Shot, KILT) | 82.9 | RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs | - |
| ChatQA-1.5-llama3-8B (Zero-Shot, KILT) | 81.0 | ChatQA: Surpassing GPT-4 on Conversational QA and RAG | - |
| Claude Instant 1.1 (few-shot, k=5) | 78.9 | Model Card and Evaluations for Claude Models | - |
| code-davinci-002 175B + REPLUG LSR (Few-Shot) | 77.3 | REPLUG: Retrieval-Augmented Black-Box Language Models | |
| code-davinci-002 175B + REPLUG (Few-Shot) | 76.8 | REPLUG: Retrieval-Augmented Black-Box Language Models | |