HyperAIHyperAI

Command Palette

Search for a command to run...

QuanBench+: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation

Ali Slim Haydar Hamieh Jawad Kotaich Yehya Ghosn Mahdi Chehimi Ammar Mohanna Hasan Abed Al Kader Hammoud Bernard Ghanem

Abstract

Large Language Models (LLMs) are increasingly used for code generation, yet quantum code generation is still evaluated mostly within single frameworks, making it difficult to separate quantum reasoning from framework familiarity. We introduce QuanBench+, a unified benchmark spanning Qiskit, PennyLane, and Cirq, with 42 aligned tasks covering quantum algorithms, gate decomposition, and state preparation. We evaluate models with executable functional tests, report Pass@1 and Pass@5, and use KL-divergence-based acceptance for probabilistic outputs. We additionally study Pass@1 after feedback-based repair, where a model may revise code after a runtime error or wrong answer. Across frameworks, the strongest one-shot scores reach 59.5% in Qiskit, 54.8% in Cirq, and 42.9% in PennyLane; with feedback-based repair, the best scores rise to 83.3%, 76.2%, and 66.7%, respectively. These results show clear progress, but also that reliable multi-framework quantum code generation remains unsolved and still depends strongly on framework-specific knowledge.

One-sentence Summary

To evaluate LLM-based quantum code generation, the authors introduce QuanBench+, a unified benchmark spanning Qiskit, PennyLane, and Cirq that utilizes 42 aligned tasks and executable functional tests to demonstrate that while feedback-based repair improves Pass@1 and Pass@5 scores, reliable multi-framework quantum reasoning remains an unsolved challenge.

Key Contributions

  • This work introduces QuanBench+, a unified benchmark that evaluates quantum code generation across three distinct frameworks: Qiskit, PennyLane, and Cirq. The benchmark consists of 42 aligned tasks covering quantum algorithms, gate decomposition, and state preparation to differentiate between portable quantum reasoning and framework-specific knowledge.
  • The researchers implement an executable functional evaluation method that defines correctness based on task success through measurement statistics. This approach utilizes Pass@k metrics and KL-divergence-based acceptance for probabilistic outputs to ensure that functionally equivalent but syntactically different circuits are correctly identified as valid.
  • The study provides a comprehensive analysis of model performance through both one-shot generation and feedback-based repair. Results demonstrate that while one-shot scores reach up to 59.5% in Qiskit, the application of feedback-based repair significantly improves performance, with the highest scores rising to 83.3% in Qiskit, 76.2% in Cirq, and 66.7% in PennyLane.

Introduction

As quantum computing moves toward practical software applications, Large Language Models (LLMs) are increasingly used to automate code generation across various ecosystems like Qiskit, PennyLane, and Cirq. Current benchmarks typically focus on a single framework, which makes it difficult to determine if a model's failure stems from poor quantum reasoning or a simple lack of familiarity with a specific API. The authors introduce QuanBench+, a unified multi-framework benchmark that holds task intent constant across 42 aligned tasks to isolate these two failure modes. By utilizing executable functional tests and KL-divergence based acceptance for probabilistic outputs, the authors provide a standardized way to evaluate whether models possess portable quantum reasoning or merely framework-specific knowledge.

Dataset

Dataset overview
Dataset overview

The authors utilize QuanBench+, a dataset derived from the original QuanBench task set. The dataset is structured as follows:

  • Composition and Categories: The benchmark consists of 42 tasks organized into three distinct categories: Quantum Algorithms, Gate Decomposition, and State Preparation.
  • Sources and Adaptation: The authors adapted the original tasks to three specific quantum computing frameworks: Qiskit, PennyLane, and Cirq. This adaptation involved modifying prompts to align with framework-specific APIs and library conventions.
  • Filtering and Refinement: To ensure reliable cross-framework grading, the authors removed two tasks from the original benchmark that did not meet the necessary criteria for consistent evaluation.
  • Prompt Engineering and Processing:
    • All prompts were modified to ensure the correct libraries are imported for each respective framework.
    • A strict constraint was added to the beginning of each prompt requiring models to return code only. This removes accompanying explanations to improve execution efficiency and grading consistency.

Method

The authors leverage a modular pipeline for evaluating quantum code generation, structured around a framework that integrates multiple quantum computing platforms. The process begins with the selection of a framework to test—specifically Qiskit, Pennylane, or Cirq—each of which is accessed via a unified API interface managed by OpenRouter. This setup allows for consistent interaction with different quantum development environments while abstracting low-level implementation differences.

Framework diagram showing the workflow from framework selection to validation
Framework diagram showing the workflow from framework selection to validation

As shown in the figure below, the selected framework is used to generate quantum code, which is then sent as API requests to the backend system. The responses are parsed to extract the generated code, which is subsequently executed in an isolated sandbox environment to ensure safety and reproducibility. The output is validated against canonical solutions to assess correctness. For deterministic tasks, validation involves checking whether the generated program satisfies a fixed correctness criterion under a predefined harness. For probabilistic tasks, correctness is determined by the agreement of measurement outcome distributions with a reference distribution.

To handle probabilistic tasks, the authors calibrate a global acceptance threshold based on the inherent variability of canonical circuit executions. For each task, the reference distribution is computed as the normalized mean of 1000 repeated executions of the canonical circuit. The within-canonical variability is quantified using the Kullback-Leibler (KL) divergence between the empirical distributions and the reference distribution. A global threshold is then derived from the 99.7th percentile of the pooled KL divergence values across all tasks, resulting in a threshold of τ=0.05\tau = 0.05τ=0.05, which is used consistently across the evaluation. This calibration ensures that the acceptance criterion accounts for natural shot noise and variability in quantum measurements.

Experiment

The QuanBench+ benchmark evaluates a diverse set of frontier and open-weight large language models on their ability to generate functional quantum code across three frameworks: Qiskit, Cirq, and PennyLane. The evaluation utilizes Pass@k metrics and compares one-shot generation against settings involving prompt prefilling and iterative feedback-based repair. Results reveal a significant asymmetry in framework difficulty, with Qiskit being the easiest and PennyLane the most challenging, suggesting that model performance is heavily tied to framework-specific API familiarity. While feedback loops effectively recover many surface-level implementation and interface errors, they do not fully close the performance gap or resolve the deeper semantic and reasoning mistakes that persist across all frameworks.

The bar chart shows Pass@1 scores across different models and frameworks, with Qiskit achieving the highest performance and PennyLane the lowest. Feedback-based repair significantly improves scores across all frameworks, but the relative ranking and performance gap between frameworks remain consistent. Qiskit consistently achieves the highest Pass@1 scores across models PennyLane consistently has the lowest Pass@1 scores across models Feedback-based repair improves performance across all frameworks and reduces the gap between models

Pass@1 scores by framework
Pass@1 scores by framework

The authors compare model performance across three quantum computing frameworks, showing consistent differences in difficulty. Results indicate that Qiskit is the easiest framework, PennyLane the hardest, and that performance varies significantly by model and framework. Qiskit consistently yields the highest performance across models, while PennyLane shows the lowest scores. Model rankings shift across frameworks, with no single model dominating all environments. Performance differences between frameworks persist even after feedback repair, indicating framework-specific challenges.

Model performance across frameworks
Model performance across frameworks

The heatmap visualizes the one-shot correctness of various models across tasks, showing that Qiskit generally yields higher accuracy than PennyLane and Cirq. Models exhibit varying performance across tasks, with some achieving broad success and others showing more scattered results. Qiskit consistently shows higher accuracy compared to PennyLane and Cirq across models. Model performance varies significantly across tasks, with some models achieving broad success and others showing more scattered results. The heatmap reveals differences in model capabilities, with stronger models demonstrating more consistent success across tasks.

Pass@1 accuracy heatmap
Pass@1 accuracy heatmap

The bar chart compares the Pass@1 accuracy of various models with the additional correctness achieved by Pass@5. Results show that most models gain significantly from generating multiple samples, with GPT 5.1 achieving the highest overall accuracy and Qwen 2.5 7B showing the smallest improvement from Pass@5. Pass@5 substantially improves accuracy for most models compared to Pass@1 GPT 5.1 achieves the highest Pass@1 and Pass@5 scores Qwen 2.5 7B shows the smallest gain from Pass@5

Pass@1 and Pass@5 accuracy comparison
Pass@1 and Pass@5 accuracy comparison

The authors compare Pass@1 performance between no-prefill and prefill settings across multiple models. Results show that prefill consistently improves performance, with larger gains observed for models and frameworks where boilerplate code is more error-prone. The strongest models benefit less from prefill, suggesting it primarily reduces setup-related errors rather than core reasoning challenges. Prefill improves Pass@1 across all models, with larger gains for weaker models and frameworks with complex setup The largest performance gains occur in frameworks where boilerplate is easy to miss Stronger models benefit less from prefill, indicating it mainly addresses surface-level coding errors rather than reasoning

Pass@1 comparison with prefill
Pass@1 comparison with prefill

These experiments evaluate model performance across different quantum computing frameworks, repair strategies, and prompting settings to identify key drivers of coding accuracy. The results demonstrate that Qiskit is consistently the most accessible framework while PennyLane presents the greatest difficulty, with performance gaps persisting regardless of feedback-based repairs. Additionally, while multiple sampling and prefill techniques significantly enhance accuracy by mitigating boilerplate errors and setup challenges, the most capable models show less sensitivity to these improvements.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp