4 months ago

o3-mini vs DeepSeek-R1: Which One is Safer?

Aitor Arrieta Miriam Ugarte Pablo Valle José Antonio Parejo Sergio Segura

Abstract

The irruption of DeepSeek-R1 constitutes a turning point for the AI industryin general and the LLMs in particular. Its capabilities have demonstratedoutstanding performance in several tasks, including creative thinking, codegeneration, maths and automated program repair, at apparently lower executioncost. However, LLMs must adhere to an important qualitative property, i.e.,their alignment with safety and human values. A clear competitor of DeepSeek-R1is its American counterpart, OpenAI's o3-mini model, which is expected to sethigh standards in terms of performance, safety and cost. In this paper weconduct a systematic assessment of the safety level of both, DeepSeek-R1 (70bversion) and OpenAI's o3-mini (beta version). To this end, we make use of ourrecently released automated safety testing tool, named ASTRAL. By leveragingthis tool, we automatically and systematically generate and execute a total of1260 unsafe test inputs on both models. After conducting a semi-automatedassessment of the outcomes provided by both LLMs, the results indicate thatDeepSeek-R1 is highly unsafe as compared to OpenAI's o3-mini. Based on ourevaluation, DeepSeek-R1 answered unsafely to 11.98% of the executed promptswhereas o3-mini only to 1.19%.

Code Repositories

trust4ai/astral

Official

Benchmarks

Benchmark	Methodology	Metrics
question-answering-on-newsqa	OpenAI/o3-mini-2025-01-31-high	EM: 96.52 F1: 92.13

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette