| GPT-2 Small 124M (pre-trained) | 1.2253 | The Pile: An 800GB Dataset of Diverse Text for Language Modeling | |
| GPT-2 Medium 355M (pre-trained) | 1.0928 | The Pile: An 800GB Dataset of Diverse Text for Language Modeling | |
| GPT-2 Large 774M (pre-trained) | 1.0828 | The Pile: An 800GB Dataset of Diverse Text for Language Modeling | |
| GPT-2 XL 1.5B (pre-trained) | 1.0468 | The Pile: An 800GB Dataset of Diverse Text for Language Modeling | |
| GPT-3 Ada 350M (pre-trained) | 0.9631 | The Pile: An 800GB Dataset of Diverse Text for Language Modeling | |
| GPT-3 Babbage 1.3B (pre-trained) | 0.8718 | The Pile: An 800GB Dataset of Diverse Text for Language Modeling | |
| Test-Time Fine-Tuning with SIFT + GPT-2 (124M) | 0.862 | Efficiently Learning at Test-Time: Active Fine-Tuning of LLMs | |
| GPT-2 Large 774M (test-time training on nearest neighbors) | 0.85 | Test-Time Training on Nearest Neighbors for Large Language Models | |
| GPT-3 Curie 6.7B (pre-trained) | 0.7980 | The Pile: An 800GB Dataset of Diverse Text for Language Modeling | |
| Test-Time Fine-Tuning with SIFT + GPT-2 (774M) | 0.762 | Efficiently Learning at Test-Time: Active Fine-Tuning of LLMs | |
| GPT-3 Davinci 175B (pre-trained) | 0.7177 | The Pile: An 800GB Dataset of Diverse Text for Language Modeling | |