HyperAIHyperAI

Command Palette

Search for a command to run...

4 months ago

SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks

SciArena: An Open Evaluation Platform for Foundation Models in
  Scientific Literature Tasks

Abstract

We present SciArena, an open and collaborative platform for evaluatingfoundation models on scientific literature tasks. Unlike traditional benchmarksfor scientific literature understanding and synthesis, SciArena engages theresearch community directly, following the Chatbot Arena evaluation approach ofcommunity voting on model comparisons. By leveraging collective intelligence,SciArena offers a community-driven evaluation of model performance onopen-ended scientific tasks that demand literature-grounded, long-formresponses. The platform currently supports 23 open-source and proprietaryfoundation models and has collected over 13,000 votes from trusted researchersacross diverse scientific domains. We analyze the data collected so far andconfirm that the submitted questions are diverse, aligned with real-worldliterature needs, and that participating researchers demonstrate strongself-consistency and inter-annotator agreement in their evaluations. We discussthe results and insights based on the model ranking leaderboard. To furtherpromote research in building model-based automated evaluation systems forliterature tasks, we release SciArena-Eval, a meta-evaluation benchmark basedon our collected preference data. The benchmark measures the accuracy of modelsin judging answer quality by comparing their pairwise assessments with humanvotes. Our experiments highlight the benchmark's challenges and emphasize theneed for more reliable automated evaluation methods.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks | Papers | HyperAI