HyperAI

Main

GPU

Console
Studio
Docs
Pricing

Pulse

News

Resources

Papers
Notebooks
Datasets
Wiki

Benchmarks

SOTA
LLM Models
GPU Leaderboard

Community

Events

Utility

About Terms of Service Privacy Policy
English

Command Palette

Search for a command to run...

HyperAI
Papers

Papers

Daily updated cutting-edge AI research papers to help you keep up with the latest AI trends

Build the Future of Artificial Intelligence

About

About Us Support Dataset Help

Products

News Papers Notebooks Datasets Wiki

Links

© HyperAI

GitHub Discord X (formerly Twitter)

HyperAI

Main

GPU

Console
Studio
Docs
Pricing

Pulse

News

Resources

Papers
Notebooks
Datasets
Wiki

Benchmarks

SOTA
LLM Models
GPU Leaderboard

Community

Events

Utility

About Terms of Service Privacy Policy
English

Command Palette

Search for a command to run...

HyperAI
Papers

Papers

Daily updated cutting-edge AI research papers to help you keep up with the latest AI trends

Build the Future of Artificial Intelligence

About

About Us Support Dataset Help

Products

News Papers Notebooks Datasets Wiki

Links

© HyperAI

GitHub Discord X (formerly Twitter)

Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens

Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens

Chengshuai Zhao, Zhen Tan, Pingchuan Ma, et al.

VeriGUI: Verifiable Long-Chain GUI Dataset

VeriGUI: Verifiable Long-Chain GUI Dataset

Shunyu Liu, Minghao Liu, Huichi Zhou, et al.

Qwen2.5-VL Technical Report

Document Understanding

Video Understanding

Shuai Bai, Keqin Chen, Xuejing Liu, et al.

The GAN is dead; long live the GAN! A Modern GAN Baseline

Computer Vision

Yiwen Huang, Aaron Gokaslan, Volodymyr Kuleshov, et al.

MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval

Junjie Zhou, Zheng Liu, Ze Liu, et al.

Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition

Zhisheng Zhong, Chengyao Wang, Yuqi Liu, et al.

NVILA: Efficient Frontier Visual Language Models

Video Understanding

Zhijian Liu, Ligeng Zhu, Baifeng Shi, et al.

Expanding Performance Boundaries of Open-Source Multimodal Models with
Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, et al.

VisionZip: Longer is Better but Not Necessary in Vision Language Models

Multimodal Representation

Senqiao Yang, Yukang Chen, Zhuotao Tian, et al.

Baichuan-Omni Technical Report

Yadong Li, Haoze Sun, Mingan Lin, et al.

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

Haotian Zhang, Mingfei Gao, Zhe Gan, et al.

Emu3: Next-Token Prediction is All You Need

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, et al.

CogVLM2: Visual Language Models for Image and Video Understanding

Image Understanding

Video Understanding

Wenyi Hong, Weihan Wang, Ming Ding, et al.

Qwen2 Technical Report

Code Generation

An Yang, Baosong Yang, Binyuan Hui, et al.

An Image is Worth 32 Tokens for Reconstruction and Generation

Image Generation

Qihang Yu, Mark Weber, Xueqing Deng, et al.

Autoregressive Model Beats Diffusion: Llama for Scalable Image
Generation

Image Generation

Peize Sun, Yi Jiang, Shoufa Chen, et al.

Meteor: Mamba-based Traversal of Rationale for Large Language and Vision
Models

Visual Question Answering

Byung-Kwan Lee, Chae Won Kim, Beomchan Park, et al.

FIFO-Diffusion: Generating Infinite Videos from Text without Training

Diffusion Model

Video Generation

Jihwan Kim, Junoh Kang, Jinyoung Choi, et al.

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal
Models with Open-Source Suites

Visual Question Answering

Document Understanding

Zhe Chen, Weiyun Wang, Hao Tian, et al.

Toward Self-Improvement of LLMs via Imagination, Searching, and
Criticizing

Ye Tian, Baolin Peng, Linfeng Song, et al.

OmniFusion Technical Report

Visual Question Answering

Elizaveta Goncharova, Anton Razzhigaev, Matvey Mikhalchuk, et al.

Machine learning prediction errors better than DFT accuracy

Molecular Network

Felix A. Faber, Luke Hutchison, Bing Huang, et al.

SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience

Zeyi Sun, Ziyu Liu, Yuhang Zang, et al.

AMix-1: A Pathway to Test-Time Scalable Protein Foundation Model

Changze Lv, Jiang Zhou, Siyu Long, et al.

CRINN: Contrastive Reinforcement Learning for Approximate Nearest Neighbor Search

Retrieval-Augmented Generation

Xiaoya Li, Xiaofei Sun, Albert Wang, et al.

Representation Shift: Unifying Token Compression with FlashAttention

Video Processing

Joonmyung Choi, Sanghyeok Lee, Byungoh Ko, et al.

CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and
Outcome Reward

Shudong Liu, Hongwei Liu, Junnan Liu, et al.

LongVie: Multimodal-Guided Controllable Ultra-Long Video Generation

Video Generation

Jianxiong Gao, Zhaoxi Chen, Xian Liu, et al.

Skywork UniPic: Unified Autoregressive Modeling for Visual Understanding
and Generation

Image Understanding

Peiyu Wang, Yi Peng, Yimeng Gan, et al.

Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference

Diffusion Model

\Yuxuan Song\, \ Zheng Zhang\, \ Cheng Luo\, et al.

Agent Lightning: Train ANY AI Agents with Reinforcement Learning

Reinforcement Learning

Xufang Luo, Yuge Zhang, Zhiyuan He, et al.

Automated Algorithmic Discovery for Gravitational-Wave Detection Guided by LLM-Informed Evolutionary Monte Carlo Tree Search

Machine Learning

He Wang, Liang Zeng

Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens

Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens

Chengshuai Zhao, Zhen Tan, Pingchuan Ma, et al.

VeriGUI: Verifiable Long-Chain GUI Dataset

VeriGUI: Verifiable Long-Chain GUI Dataset

Shunyu Liu, Minghao Liu, Huichi Zhou, et al.

Qwen2.5-VL Technical Report

Document Understanding

Video Understanding

Shuai Bai, Keqin Chen, Xuejing Liu, et al.

The GAN is dead; long live the GAN! A Modern GAN Baseline

Computer Vision

Yiwen Huang, Aaron Gokaslan, Volodymyr Kuleshov, et al.

MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval

Junjie Zhou, Zheng Liu, Ze Liu, et al.

Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition

Zhisheng Zhong, Chengyao Wang, Yuqi Liu, et al.

NVILA: Efficient Frontier Visual Language Models

Video Understanding

Zhijian Liu, Ligeng Zhu, Baifeng Shi, et al.

Expanding Performance Boundaries of Open-Source Multimodal Models with
Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, et al.

VisionZip: Longer is Better but Not Necessary in Vision Language Models

Multimodal Representation

Senqiao Yang, Yukang Chen, Zhuotao Tian, et al.

Baichuan-Omni Technical Report

Yadong Li, Haoze Sun, Mingan Lin, et al.

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

Haotian Zhang, Mingfei Gao, Zhe Gan, et al.

Emu3: Next-Token Prediction is All You Need

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, et al.

CogVLM2: Visual Language Models for Image and Video Understanding

Image Understanding

Video Understanding

Wenyi Hong, Weihan Wang, Ming Ding, et al.

Qwen2 Technical Report

Code Generation

An Yang, Baosong Yang, Binyuan Hui, et al.

An Image is Worth 32 Tokens for Reconstruction and Generation

Image Generation

Qihang Yu, Mark Weber, Xueqing Deng, et al.

Autoregressive Model Beats Diffusion: Llama for Scalable Image
Generation

Image Generation

Peize Sun, Yi Jiang, Shoufa Chen, et al.

Meteor: Mamba-based Traversal of Rationale for Large Language and Vision
Models

Visual Question Answering

Byung-Kwan Lee, Chae Won Kim, Beomchan Park, et al.

FIFO-Diffusion: Generating Infinite Videos from Text without Training

Diffusion Model

Video Generation

Jihwan Kim, Junoh Kang, Jinyoung Choi, et al.

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal
Models with Open-Source Suites

Visual Question Answering

Document Understanding

Zhe Chen, Weiyun Wang, Hao Tian, et al.

Toward Self-Improvement of LLMs via Imagination, Searching, and
Criticizing

Ye Tian, Baolin Peng, Linfeng Song, et al.

OmniFusion Technical Report

Visual Question Answering

Elizaveta Goncharova, Anton Razzhigaev, Matvey Mikhalchuk, et al.

Machine learning prediction errors better than DFT accuracy

Molecular Network

Felix A. Faber, Luke Hutchison, Bing Huang, et al.

SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience

Zeyi Sun, Ziyu Liu, Yuhang Zang, et al.

AMix-1: A Pathway to Test-Time Scalable Protein Foundation Model

Changze Lv, Jiang Zhou, Siyu Long, et al.

CRINN: Contrastive Reinforcement Learning for Approximate Nearest Neighbor Search

Retrieval-Augmented Generation

Xiaoya Li, Xiaofei Sun, Albert Wang, et al.

Representation Shift: Unifying Token Compression with FlashAttention

Video Processing

Joonmyung Choi, Sanghyeok Lee, Byungoh Ko, et al.

CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and
Outcome Reward

Shudong Liu, Hongwei Liu, Junnan Liu, et al.

LongVie: Multimodal-Guided Controllable Ultra-Long Video Generation

Video Generation

Jianxiong Gao, Zhaoxi Chen, Xian Liu, et al.

Skywork UniPic: Unified Autoregressive Modeling for Visual Understanding
and Generation

Image Understanding

Peiyu Wang, Yi Peng, Yimeng Gan, et al.

Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference

Diffusion Model

\Yuxuan Song\, \ Zheng Zhang\, \ Cheng Luo\, et al.

Agent Lightning: Train ANY AI Agents with Reinforcement Learning

Reinforcement Learning

Xufang Luo, Yuge Zhang, Zhiyuan He, et al.

Automated Algorithmic Discovery for Gravitational-Wave Detection Guided by LLM-Informed Evolutionary Monte Carlo Tree Search

Machine Learning

He Wang, Liang Zeng

Qwen2.5-VL Technical Report

The GAN is dead; long live the GAN! A Modern GAN Baseline

MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval

Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition

NVILA: Efficient Frontier Visual Language Models

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

VisionZip: Longer is Better but Not Necessary in Vision Language Models

Baichuan-Omni Technical Report

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

Emu3: Next-Token Prediction is All You Need

CogVLM2: Visual Language Models for Image and Video Understanding

Qwen2 Technical Report

An Image is Worth 32 Tokens for Reconstruction and Generation

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models

FIFO-Diffusion: Generating Infinite Videos from Text without Training

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing

OmniFusion Technical Report

Machine learning prediction errors better than DFT accuracy

SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience

AMix-1: A Pathway to Test-Time Scalable Protein Foundation Model

CRINN: Contrastive Reinforcement Learning for Approximate Nearest Neighbor Search

Representation Shift: Unifying Token Compression with FlashAttention

CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward

LongVie: Multimodal-Guided Controllable Ultra-Long Video Generation

Skywork UniPic: Unified Autoregressive Modeling for Visual Understanding and Generation

Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference

Agent Lightning: Train ANY AI Agents with Reinforcement Learning

Automated Algorithmic Discovery for Gravitational-Wave Detection Guided by LLM-Informed Evolutionary Monte Carlo Tree Search

Qwen2.5-VL Technical Report

The GAN is dead; long live the GAN! A Modern GAN Baseline

MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval

Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition

NVILA: Efficient Frontier Visual Language Models

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

VisionZip: Longer is Better but Not Necessary in Vision Language Models

Baichuan-Omni Technical Report

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

Emu3: Next-Token Prediction is All You Need

CogVLM2: Visual Language Models for Image and Video Understanding

Qwen2 Technical Report

An Image is Worth 32 Tokens for Reconstruction and Generation

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models

FIFO-Diffusion: Generating Infinite Videos from Text without Training

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing

OmniFusion Technical Report

Machine learning prediction errors better than DFT accuracy

SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience

AMix-1: A Pathway to Test-Time Scalable Protein Foundation Model

CRINN: Contrastive Reinforcement Learning for Approximate Nearest Neighbor Search

Representation Shift: Unifying Token Compression with FlashAttention

CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward

LongVie: Multimodal-Guided Controllable Ultra-Long Video Generation

Skywork UniPic: Unified Autoregressive Modeling for Visual Understanding and Generation

Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference

Agent Lightning: Train ANY AI Agents with Reinforcement Learning

Automated Algorithmic Discovery for Gravitational-Wave Detection Guided by LLM-Informed Evolutionary Monte Carlo Tree Search