Command Palette
Search for a command to run...
Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models
Devichand Budagam Sankalp KJ Ashutosh Kumar Vinija Jain Aman Chadha

Abstract
Assessing the effectiveness of large language models (LLMs) in addressingdiverse tasks is essential for comprehending their strengths and weaknesses.Conventional evaluation techniques typically apply a single prompting strategyuniformly across datasets, not considering the varying degrees of taskcomplexity. We introduce the Hierarchical Prompting Taxonomy (HPT), a taxonomythat employs a Hierarchical Prompt Framework (HPF) composed of five uniqueprompting strategies, arranged from the simplest to the most complex, to assessLLMs more precisely and to offer a clearer perspective. This taxonomy assigns ascore, called the Hierarchical Prompting Score (HP-Score), to datasets as wellas LLMs based on the rules of the taxonomy, providing a nuanced understandingof their ability to solve diverse tasks and offering a universal measure oftask complexity. Additionally, we introduce the Adaptive Hierarchical Promptframework, which automates the selection of appropriate prompting strategiesfor each task. This study compares manual and adaptive hierarchical promptframeworks using four instruction-tuned LLMs, namely Llama 3 8B, Phi 3 3.8B,Mistral 7B, and Gemma 7B, across four datasets: BoolQ, CommonSenseQA (CSQA),IWSLT-2017 en-fr (IWSLT), and SamSum. Experiments demonstrate the effectivenessof HPT, providing a reliable way to compare different tasks and LLMcapabilities. This paper leads to the development of a universal evaluationmetric that can be used to evaluate both the complexity of the datasets and thecapabilities of LLMs. The implementation of both manual HPF and adaptive HPF ispublicly available.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| arithmetic-reasoning-on-gsm8k | Claude 3.5 Sonnet (HPT) | Accuracy: 97.72 |
| code-generation-on-humaneval | Llama-3 8B (HPT) | Pass@1: 100 |
| code-generation-on-humaneval | Claude 3.5 Sonnet (HPT) | Pass@1: 100 |
| common-sense-reasoning-on-commonsenseqa | GPT-4o (HPT) | Accuracy: 92.54 |
| machine-translation-on-iwslt-2017 | GPT-4o (HPT) | BLEU score: 32 |
| question-answering-on-boolq | Mistral-Nemo 12B (HPT) | Accuracy: 99.87 |
| question-answering-on-boolq | Gemma-7B | Accuracy: 99.419 |
| text-summarization-on-samsum-corpus | GPT-4o (HPT) | ROUGE-L: 30 |
| translation-on-iwslt-2017 | Llama 3 8B | BLEU: 0.23539 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.