HyperAIHyperAI

Command Palette

Search for a command to run...

4 months ago

o3-mini vs DeepSeek-R1: Which One is Safer?

Aitor Arrieta Miriam Ugarte Pablo Valle José Antonio Parejo Sergio Segura

o3-mini vs DeepSeek-R1: Which One is Safer?

Abstract

The irruption of DeepSeek-R1 constitutes a turning point for the AI industryin general and the LLMs in particular. Its capabilities have demonstratedoutstanding performance in several tasks, including creative thinking, codegeneration, maths and automated program repair, at apparently lower executioncost. However, LLMs must adhere to an important qualitative property, i.e.,their alignment with safety and human values. A clear competitor of DeepSeek-R1is its American counterpart, OpenAI's o3-mini model, which is expected to sethigh standards in terms of performance, safety and cost. In this paper weconduct a systematic assessment of the safety level of both, DeepSeek-R1 (70bversion) and OpenAI's o3-mini (beta version). To this end, we make use of ourrecently released automated safety testing tool, named ASTRAL. By leveragingthis tool, we automatically and systematically generate and execute a total of1260 unsafe test inputs on both models. After conducting a semi-automatedassessment of the outcomes provided by both LLMs, the results indicate thatDeepSeek-R1 is highly unsafe as compared to OpenAI's o3-mini. Based on ourevaluation, DeepSeek-R1 answered unsafely to 11.98% of the executed promptswhereas o3-mini only to 1.19%.

Code Repositories

Benchmarks

BenchmarkMethodologyMetrics
question-answering-on-newsqaOpenAI/o3-mini-2025-01-31-high
EM: 96.52
F1: 92.13

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp