Command Palette
Search for a command to run...
UI-Level Evaluation of ALLaM 34B: Measuring an Arabic-Centric LLM via HUMAIN Chat
Omer Nacar

Abstract
Large language models (LLMs) trained primarily on English corpora oftenstruggle to capture the linguistic and cultural nuances of Arabic. To addressthis gap, the Saudi Data and AI Authority (SDAIA) introduced the ALLaM familyof Arabic-focused models. The most capable of these available to the public,ALLaM-34B, was subsequently adopted by HUMAIN, who developed and deployedHUMAIN Chat, a closed conversational web service built on this model. Thispaper presents an expanded and refined UI-level evaluation of ALLaM-34B.Using a prompt pack spanning modern standard Arabic, five regional dialects,code-switching, factual knowledge, arithmetic and temporal reasoning, creativegeneration, and adversarial safety, we collected 115 outputs (23 prompts times5 runs) and scored each with three frontier LLM judges (GPT-5, Gemini 2.5 Pro,Claude Sonnet-4). We compute category-level means with 95\% confidenceintervals, analyze score distributions, and visualize dialect-wise metric heatmaps. The updated analysis reveals consistently high performance on generationand code-switching tasks (both averaging 4.92/5), alongside strong results inMSA handling (4.74/5), solid reasoning ability (4.64/5), and improved dialectfidelity (4.21/5). Safety-related prompts show stable, reliable performance of(4.54/5). Taken together, these results position ALLaM-34B as a robust andculturally grounded Arabic LLM, demonstrating both technical strength andpractical readiness for real-world deployment.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.