3 months ago

Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models

Devichand Budagam Sankalp KJ Ashutosh Kumar Vinija Jain Aman Chadha

Abstract

Assessing the effectiveness of large language models (LLMs) in addressingdiverse tasks is essential for comprehending their strengths and weaknesses.Conventional evaluation techniques typically apply a single prompting strategyuniformly across datasets, not considering the varying degrees of taskcomplexity. We introduce the Hierarchical Prompting Taxonomy (HPT), a taxonomythat employs a Hierarchical Prompt Framework (HPF) composed of five uniqueprompting strategies, arranged from the simplest to the most complex, to assessLLMs more precisely and to offer a clearer perspective. This taxonomy assigns ascore, called the Hierarchical Prompting Score (HP-Score), to datasets as wellas LLMs based on the rules of the taxonomy, providing a nuanced understandingof their ability to solve diverse tasks and offering a universal measure oftask complexity. Additionally, we introduce the Adaptive Hierarchical Promptframework, which automates the selection of appropriate prompting strategiesfor each task. This study compares manual and adaptive hierarchical promptframeworks using four instruction-tuned LLMs, namely Llama 3 8B, Phi 3 3.8B,Mistral 7B, and Gemma 7B, across four datasets: BoolQ, CommonSenseQA (CSQA),IWSLT-2017 en-fr (IWSLT), and SamSum. Experiments demonstrate the effectivenessof HPT, providing a reliable way to compare different tasks and LLMcapabilities. This paper leads to the development of a universal evaluationmetric that can be used to evaluate both the complexity of the datasets and thecapabilities of LLMs. The implementation of both manual HPF and adaptive HPF ispublicly available.

Code Repositories

devichand579/HPT

Official

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
arithmetic-reasoning-on-gsm8k	Claude 3.5 Sonnet (HPT)	Accuracy: 97.72
code-generation-on-humaneval	Llama-3 8B (HPT)	Pass@1: 100
code-generation-on-humaneval	Claude 3.5 Sonnet (HPT)	Pass@1: 100
common-sense-reasoning-on-commonsenseqa	GPT-4o (HPT)	Accuracy: 92.54
machine-translation-on-iwslt-2017	GPT-4o (HPT)	BLEU score: 32
question-answering-on-boolq	Mistral-Nemo 12B (HPT)	Accuracy: 99.87
question-answering-on-boolq	Gemma-7B	Accuracy: 99.419
text-summarization-on-samsum-corpus	GPT-4o (HPT)	ROUGE-L: 30
translation-on-iwslt-2017	Llama 3 8B	BLEU: 0.23539

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette