HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models

Devichand Budagam Sankalp KJ Ashutosh Kumar Vinija Jain Aman Chadha

Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for
  Large Language Models

Abstract

Assessing the effectiveness of large language models (LLMs) in addressingdiverse tasks is essential for comprehending their strengths and weaknesses.Conventional evaluation techniques typically apply a single prompting strategyuniformly across datasets, not considering the varying degrees of taskcomplexity. We introduce the Hierarchical Prompting Taxonomy (HPT), a taxonomythat employs a Hierarchical Prompt Framework (HPF) composed of five uniqueprompting strategies, arranged from the simplest to the most complex, to assessLLMs more precisely and to offer a clearer perspective. This taxonomy assigns ascore, called the Hierarchical Prompting Score (HP-Score), to datasets as wellas LLMs based on the rules of the taxonomy, providing a nuanced understandingof their ability to solve diverse tasks and offering a universal measure oftask complexity. Additionally, we introduce the Adaptive Hierarchical Promptframework, which automates the selection of appropriate prompting strategiesfor each task. This study compares manual and adaptive hierarchical promptframeworks using four instruction-tuned LLMs, namely Llama 3 8B, Phi 3 3.8B,Mistral 7B, and Gemma 7B, across four datasets: BoolQ, CommonSenseQA (CSQA),IWSLT-2017 en-fr (IWSLT), and SamSum. Experiments demonstrate the effectivenessof HPT, providing a reliable way to compare different tasks and LLMcapabilities. This paper leads to the development of a universal evaluationmetric that can be used to evaluate both the complexity of the datasets and thecapabilities of LLMs. The implementation of both manual HPF and adaptive HPF ispublicly available.

Code Repositories

devichand579/HPT
Official
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
arithmetic-reasoning-on-gsm8kClaude 3.5 Sonnet (HPT)
Accuracy: 97.72
code-generation-on-humanevalLlama-3 8B (HPT)
Pass@1: 100
code-generation-on-humanevalClaude 3.5 Sonnet (HPT)
Pass@1: 100
common-sense-reasoning-on-commonsenseqaGPT-4o (HPT)
Accuracy: 92.54
machine-translation-on-iwslt-2017GPT-4o (HPT)
BLEU score: 32
question-answering-on-boolqMistral-Nemo 12B (HPT)
Accuracy: 99.87
question-answering-on-boolqGemma-7B
Accuracy: 99.419
text-summarization-on-samsum-corpusGPT-4o (HPT)
ROUGE-L: 30
translation-on-iwslt-2017Llama 3 8B
BLEU: 0.23539

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models | Papers | HyperAI