Command Palette
Search for a command to run...
Papers
Daily updated cutting-edge AI research papers to help you keep up with the latest AI trends

Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation

ScribblePrompt: Fast and Flexible Interactive Segmentation for Any Biomedical Image































Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation

ScribblePrompt: Fast and Flexible Interactive Segmentation for Any Biomedical Image






























Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
HunyuanVideo: A Systematic Framework for Large Video Generative Models
MathNet: A Global Multimodal Benchmark for Mathematical Reasoning and Retrieval
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
Active Context Compression: Autonomous Memory Management in LLM Agents
Cut Your Losses! Learning to Prune Paths Early for Efficient Parallel Reasoning
Qwen3.5-Omni Technical Report
Web Retrieval-Aware Chunking (W-RAC) for Efficient and Cost-Effective Retrieval-Augmented Generation Systems
PersonaVLM: Long-Term Personalized Multimodal LLMs
Maximal Brain Damage Without Data or Optimization: Disrupting Neural Networks via Sign-Bit Flips
Elucidating the SNR-t Bias of Diffusion Probabilistic Models
Multimodal OCR: Parse Anything from Documents
Granite-speech: open-source speech-aware LLMs with strong English ASR capabilities
Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis
Video Object and Interaction Deletion
VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning
OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models
Where Vision Becomes Text: Locating the OCR Routing Bottleneck in Vision-Language Models
OCR or Not? Rethinking Document Information Extraction in the MLLMs Era with Real-World Large-Scale Datasets
dnaHNet: A Scalable and Hierarchical Foundation Model for Genomic Sequence Learning
Neural Computers
ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack
GlobalSplat: Efficient Feed-Forward 3D Gaussian Splatting via Global Scene Tokens
How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data
RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework
DR3-Eval: Towards Realistic and Reproducible Deep Research Evaluation
HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds
pi0.7: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities
GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training
Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
HunyuanVideo: A Systematic Framework for Large Video Generative Models
MathNet: A Global Multimodal Benchmark for Mathematical Reasoning and Retrieval
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
Active Context Compression: Autonomous Memory Management in LLM Agents
Cut Your Losses! Learning to Prune Paths Early for Efficient Parallel Reasoning
Qwen3.5-Omni Technical Report
Web Retrieval-Aware Chunking (W-RAC) for Efficient and Cost-Effective Retrieval-Augmented Generation Systems
PersonaVLM: Long-Term Personalized Multimodal LLMs
Maximal Brain Damage Without Data or Optimization: Disrupting Neural Networks via Sign-Bit Flips
Elucidating the SNR-t Bias of Diffusion Probabilistic Models
Multimodal OCR: Parse Anything from Documents
Granite-speech: open-source speech-aware LLMs with strong English ASR capabilities
Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis
Video Object and Interaction Deletion
VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning
OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models
Where Vision Becomes Text: Locating the OCR Routing Bottleneck in Vision-Language Models
OCR or Not? Rethinking Document Information Extraction in the MLLMs Era with Real-World Large-Scale Datasets
dnaHNet: A Scalable and Hierarchical Foundation Model for Genomic Sequence Learning
Neural Computers
ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack
GlobalSplat: Efficient Feed-Forward 3D Gaussian Splatting via Global Scene Tokens
How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data
RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework
DR3-Eval: Towards Realistic and Reproducible Deep Research Evaluation
HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds
pi0.7: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities
GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training