HyperAIHyperAI

Command Palette

Search for a command to run...

20 days ago

Detect Anything via Next Point Prediction

Qing Jiang Junan Huo Xingyu Chen Yuda Xiong Zhaoyang Zeng Yihao Chen Tianhe Ren Junzhi Yu Lei Zhang

Detect Anything via Next Point Prediction

Abstract

Object detection has long been dominated by traditional coordinateregression-based models, such as YOLO, DETR, and Grounding DINO. Althoughrecent efforts have attempted to leverage MLLMs to tackle this task, they facechallenges like low recall rate, duplicate predictions, coordinatemisalignment, etc. In this work, we bridge this gap and propose Rex-Omni, a3B-scale MLLM that achieves state-of-the-art object perception performance. Onbenchmarks like COCO and LVIS, Rex-Omni attains performance comparable to orexceeding regression-based models (e.g., DINO, Grounding DINO) in a zero-shotsetting. This is enabled by three key designs: 1) Task Formulation: we usespecial tokens to represent quantized coordinates from 0 to 999, reducing themodel's learning difficulty and improving token efficiency for coordinateprediction; 2) Data Engines: we construct multiple data engines to generatehigh-quality grounding, referring, and pointing data, providing semanticallyrich supervision for training; \3) Training Pipelines: we employ a two-stagetraining process, combining supervised fine-tuning on 22 million data withGRPO-based reinforcement post-training. This RL post-training leveragesgeometry-aware rewards to effectively bridge the discrete-to-continuouscoordinate prediction gap, improve box accuracy, and mitigate undesirablebehaviors like duplicate predictions that stem from the teacher-guided natureof the initial SFT stage. Beyond conventional detection, Rex-Omni's inherentlanguage understanding enables versatile capabilities such as object referring,pointing, visual prompting, GUI grounding, spatial referring, OCR andkey-pointing, all systematically evaluated on dedicated benchmarks. We believethat Rex-Omni paves the way for more versatile and language-aware visualperception systems.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Detect Anything via Next Point Prediction | Papers | HyperAI