3 months ago

SmallThinker: A Family of Efficient Large Language Models Natively Trained for Local Deployment

Yixin Song Zhenliang Xue Dongliang Wei Feiyang Chen Jianxiang Gao Junchen Liu Hangyu Liang Guangshuo Qin Chengrong Tian Bo Wen

Abstract

While frontier large language models (LLMs) continue to push capabilityboundaries, their deployment remains confined to GPU-powered cloudinfrastructure. We challenge this paradigm with SmallThinker, a family of LLMsnatively designed - not adapted - for the unique constraints of local devices:weak computational power, limited memory, and slow storage. Unlike traditionalapproaches that mainly compress existing models built for clouds, we architectSmallThinker from the ground up to thrive within these limitations. Ourinnovation lies in a deployment-aware architecture that transforms constraintsinto design principles. First, We introduce a two-level sparse structurecombining fine-grained Mixture-of-Experts (MoE) with sparse feed-forwardnetworks, drastically reducing computational demands without sacrificing modelcapacity. Second, to conquer the I/O bottleneck of slow storage, we design apre-attention router that enables our co-designed inference engine to prefetchexpert parameters from storage while computing attention, effectively hidingstorage latency that would otherwise cripple on-device inference. Third, formemory efficiency, we utilize NoPE-RoPE hybrid sparse attention mechanism toslash KV cache requirements. We release SmallThinker-4B-A0.6B andSmallThinker-21B-A3B, which achieve state-of-the-art performance scores andeven outperform larger LLMs. Remarkably, our co-designed system mostlyeliminates the need for expensive GPU hardware: with Q4_0 quantization, bothmodels exceed 20 tokens/s on ordinary consumer CPUs, while consuming only 1GBand 8GB of memory respectively. SmallThinker is publicly available athf.co/PowerInfer/SmallThinker-4BA0.6B-Instruct andhf.co/PowerInfer/SmallThinker-21BA3B-Instruct.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

SmallThinker: A Family of Efficient Large Language Models Natively Trained for Local Deployment

Yixin Song Zhenliang Xue Dongliang Wei Feiyang Chen Jianxiang Gao Junchen Liu Hangyu Liang Guangshuo Qin Chengrong Tian Bo Wen4 more

Abstract

Build AI with AI

Hyper Newsletters

Yixin Song Zhenliang Xue Dongliang Wei Feiyang Chen Jianxiang Gao Junchen Liu Hangyu Liang Guangshuo Qin Chengrong Tian Bo Wen