HyperAIHyperAI

Command Palette

Search for a command to run...

GENIUS: Generative Fluid Intelligence Evaluation Suite

Abstract

Unified Multimodal Models (UMMs) have shown remarkable progress in visual generation. Yet, existing benchmarks predominantly assess Crystallized Intelligence, which relies on recalling accumulated knowledge and learned schemas. This focus overlooks Generative Fluid Intelligence (GFI): the capacity to induce patterns, reason through constraints, and adapt to novel scenarios on the fly. To rigorously assess this capability, we introduce GENIUS (GEN Fluid Intelligence EvalUation Suite). We formalize GFI as a synthesis of three primitives. These include Inducing Implicit Patterns (e.g., inferring personalized visual preferences), Executing Ad-hoc Constraints (e.g., visualizing abstract metaphors), and Adapting to Contextual Knowledge (e.g., simulating counter-intuitive physics). Collectively, these primitives challenge models to solve problems grounded entirely in the immediate context. Our systematic evaluation of 12 representative models reveals significant performance deficits in these tasks. Crucially, our diagnostic analysis disentangles these failure modes. It demonstrates that deficits stem from limited context comprehension rather than insufficient intrinsic generative capability. To bridge this gap, we propose a training-free attention intervention strategy. Ultimately, GENIUS establishes a rigorous standard for GFI, guiding the field beyond knowledge utilization toward dynamic, general-purpose reasoning. Our dataset and code will be released at: https://github.com/arctanxarc/GENIUS{https://github.com/arctanxarc/GENIUS}.

One-sentence Summary

Researchers from Tsinghua University and collaborators propose GENIUS, a benchmark evaluating Generative Fluid Intelligence in multimodal models via pattern induction, constraint execution, and contextual adaptation; they introduce a training-free attention intervention to address context comprehension gaps, advancing dynamic reasoning beyond static knowledge recall.

Key Contributions

  • We introduce GENIUS, the first benchmark suite designed to evaluate Generative Fluid Intelligence (GFI) in multimodal models, formalizing GFI through three primitives: inducing implicit patterns, executing ad-hoc constraints, and adapting to contextual knowledge, with tasks decoupled from static knowledge to isolate dynamic reasoning.
  • Our evaluation of 12 state-of-the-art models reveals consistent deficits in GFI tasks, with diagnostic analysis showing failures stem from poor context comprehension rather than weak generative capacity, highlighting a critical gap in current UMMs’ ability to reason dynamically.
  • To address this, we propose a training-free attention intervention strategy that improves model performance across GENIUS tasks by enhancing focus on contextual rules, validating our theoretical insight that imbalanced attention undermines implicit in-context learning.

Introduction

The authors leverage the Cattell-Horn-Carroll theory to define Generative Fluid Intelligence (GFI) as the ability to induce patterns, execute ad-hoc constraints, and adapt to novel contextual knowledge—capabilities critical for true general intelligence in visual generation. Prior benchmarks largely assess crystallized intelligence (knowledge recall), ignoring GFI, and lack formal definitions, fine-grained tasks, or diagnostic analysis of failure modes. Their main contribution is GENIUS, the first benchmark dedicated to evaluating GFI through 510 expert-curated samples across three dimensions, revealing that even top models fail due to poor context comprehension rather than weak generative capacity—and they propose a training-free attention intervention that boosts performance across tasks.

Dataset

The authors use GENIUS, a multimodal benchmark designed to evaluate flexible intelligence (FI), composed of three core dimensions: Implicit Pattern Induction, Ad-hoc Constraint Execution, and Contextual Knowledge Adaptation. Each dimension includes novel, expert-curated tasks requiring tight integration of visual and textual modalities — removing either modality renders the task unsolvable.

  • Implicit Pattern Induction includes Implicit Pattern Generation: models must infer unstated stylistic preferences from interleaved image-text inputs and apply them in generation. Relying on only one modality leads to failure — images alone cause feature conflation, text alone leaves preferences undefined.

  • Ad-hoc Constraint Execution features two tasks: Visual Constraint Generation and Symbolic Constraint Generation. Models must reason under novel, context-defined rules (e.g., a blue square means “remove an object,” or a function f means “melt an object”). These rules deliberately use semantically neutral elements to test abstract reasoning; missing either modality breaks rule establishment.

  • Contextual Knowledge Adaptation comprises Prior-Conflicting Generation (e.g., “weight is determined by color”) and Multi-Semantic Generation (e.g., interpreting “green hand” as novice vs. skin tone). Models must override pretrained knowledge or resolve ambiguity based on context — failure occurs if either modality is absent.

GENIUS contains 5 tasks across 3 dimensions, totaling 20 sub-tasks. The dataset is structured to test dynamic reasoning, adaptation, and cross-modal integration — no training split or mixture ratios are specified, as it is a pure evaluation benchmark. No cropping or metadata construction is mentioned; the focus is on carefully designed, modality-dependent test cases.

Method

The authors leverage a theoretical framework rooted in In-Context Learning (ICL) as Implicit Fine-Tuning to analyze and enhance the generative capabilities of the Bagel model, which employs a Mixture-of-Experts (MoE) Transformer architecture. Their core insight is that the ICL process during multimodal generation can be mathematically formalized as an implicit gradient descent over specific model parameters—namely, the Up projection layer and the bias term within the decoder blocks. This theoretical grounding, derived from the model’s forward pass, reveals that context tokens induce parameter updates that steer the generation trajectory. The authors formalize this relationship through Theorem 4.1, which establishes that a perturbation in the context input uuu can be compensated by a corresponding perturbation in the Up and bias parameters, such that the output remains invariant. This equivalence is expressed as LUp+ΔUp,b+Δb(u,g)=LUp,b(u,g)\mathcal{L}_{\mathrm{Up} + \Delta \mathrm{Up},\, b + \Delta b}(u', g) = \mathcal{L}_{\mathrm{Up}, b}(u, g)LUp+ΔUp,b+Δb(u,g)=LUp,b(u,g), where the perturbations ΔUp\Delta \mathrm{Up}ΔUp and Δb\Delta bΔb are explicitly defined in terms of the normalized attention difference δA\delta \mathcal{A}δA and the attention function A(u,g)\mathcal{A}(u, g)A(u,g).

Building on this, Theorem 4.2 further refines the analysis by deriving gradient descent update rules for these parameters across iterative context token processing. The authors show that the Up and bias parameters evolve according to Upi+1=UpihUpLi(Upi)\mathrm{Up}_{i+1} = \mathrm{Up}_{i} - h \nabla_{\mathrm{Up}} L_{i}(\mathrm{Up}_{i})Upi+1=UpihUpLi(Upi) and bi+1=bib(tr(δibi))b_{i+1} = b_{i} - \nabla_{b} \left( \operatorname{tr} \left( \delta_{i}^{\top} b_{i} \right) \right)bi+1=bib(tr(δibi)), where the learning rate hhh and loss function LiL_iLi are derived from the attention mechanism’s output. This theoretical analysis identifies a critical deficit in Guided Fine-tuning (GFI): an imbalanced attention distribution over context tokens leads to noisy, stochastic gradient updates that fail to overcome pre-trained priors.

To address this, the authors propose a training-free Attention Adjustment Mechanism, designed to recalibrate the implicit gradient direction by suppressing the influence of irrelevant “noise” tokens. The mechanism operates as a three-stage pipeline. First, in the Keyword Distillation phase, the model is prompted to extract task-critical visual cues from the context images as a set of region-specific keywords K\mathcal{K}K. This step is guided by a structured prompt template, which instructs the model to parse multimodal instructions and map each image to its specific role—whether as a target canvas, a source of features, or an irrelevant reference. The prompt enforces a strict JSON output format to ensure precise, machine-readable keyword generation.

Second, during Relevance Mapping, the model computes a semantic relevance map S\mathbf{S}S by evaluating the alignment between the distilled keywords and the visual context tokens. This map serves as a proxy for the token’s contribution to the effective gradient signal. Finally, in the Bias Injection stage, the authors inject a spatial bias F(S)\mathcal{F}(\mathbf{S})F(S) directly into the attention logits of selected decoder layers and generation steps. The modulated attention logits A^l,h\hat{\mathbf{A}}^{l,h}A^l,h are computed as A^l,h(i,j)=Al,h(i,j)+λF(Si)\hat{\mathbf{A}}^{l,h}(i, j) = \mathbf{A}^{l,h}(i, j) + \lambda \cdot \mathcal{F}(\mathbf{S}_{i})A^l,h(i,j)=Al,h(i,j)+λF(Si), where the function F()\mathcal{F}(\cdot)F() normalizes the relevance scores to a bipolar distribution, effectively amplifying the attention weights of signal tokens and suppressing those of noise tokens. The final attention weights are then computed via the standard Softmax operation, ensuring that the gradient norm contribution from noise is exponentially dampened.

This intervention transforms the implicit gradient from a noisy, signal-plus-noise composition into a clean, signal-dominated update, as illustrated in the conceptual diagram. The authors demonstrate that this method deterministically steers the optimization trajectory, enabling the model to overcome pre-trained priors and achieve more accurate, instruction-following generation. The entire process is implemented without modifying the model’s weights, making it a lightweight, post-hoc enhancement to existing multimodal architectures.

Experiment

  • Evaluated 12 models using a hybrid LMM-based framework (Gemini-3-Pro) with three metrics: Rule Compliance, Visual Consistency, and Aesthetic Quality, grounded in human-curated hints to ensure rigor.
  • State-of-the-art models, including Nano Banana Pro, scored below passing thresholds, revealing a fundamental gap in fluid intelligence despite strong aesthetic output.
  • Models consistently fail to override pre-trained priors when faced with conflicting or novel rules, showing cognitive inertia rather than adaptive reasoning.
  • Aesthetic quality masks deeper failures: models generate visually plausible images but violate logical or rule-based constraints, exposing a bias toward surface realism over contextual fidelity.
  • Inference-time strategies like pre-planning or post-reflection yield minimal gains, indicating architectural limitations in leveraging reasoning for generation.
  • Human-curated hints significantly improve performance, especially for stronger models, confirming that context comprehension is critical—but insufficient without robust generative capability.
  • Reformulating tasks as VQA reveals models often understand instructions but fail to execute them visually, pointing to a “know-but-cannot-draw” gap in decoder efficiency.
  • LMM-as-judge validation shows high correlation with human ratings (r > 0.96), and cross-judge consistency (Qwen2.5-VL-72B) confirms results reflect intrinsic model gaps, not evaluator bias.
  • Attention visualization reveals noisy, unfocused context processing in baseline models; a proposed intervention sharpens attention on key tokens, boosting performance without parameter updates.
  • Context ablation confirms its necessity: removing contextual input causes severe performance drops, especially in tasks requiring inductive reasoning or overriding priors.

The authors use a large multimodal model as an evaluator to assess 12 image generation models across multiple dimensions of generative fluid intelligence, revealing that even top proprietary models score below 60% overall and struggle with rule compliance and contextual adaptation. Results show a consistent gap between models’ ability to comprehend instructions and their capacity to generate visually accurate outputs, with aesthetic quality often masking deeper logical failures. Interventions that improve attention focus yield measurable gains, suggesting that architectural improvements in context processing are key to advancing generative adaptability.

The authors use a structured evaluation framework with a large multimodal model as judge to assess 12 models on Generative Fluid Intelligence tasks, revealing that even top proprietary models score below 60% overall due to poor adaptation to novel rules and context. Results show a consistent gap between models’ ability to comprehend instructions and their capacity to generate visually compliant outputs, with aesthetic quality often masking deeper logical failures. The proposed GENIUS benchmark introduces multimodal interleaved context and hybrid evaluation to expose these limitations, confirming that current architectures struggle to override pre-trained priors when faced with conflicting or ad-hoc instructions.

The authors use a large multimodal model as an evaluator to assess 12 image generation models across three dimensions: rule compliance, visual consistency, and aesthetic quality. Results show that even top proprietary models like Nano Banana Pro score below 60 overall, revealing a significant gap in fluid intelligence—particularly in adapting to novel or conflicting rules—while open-source models lag further behind. Aesthetic quality scores often mask deeper failures in logical adherence, indicating current models prioritize surface realism over contextual reasoning.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp