HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

SmallThinker: A Family of Efficient Large Language Models Natively Trained for Local Deployment

SmallThinker: A Family of Efficient Large Language Models Natively
  Trained for Local Deployment

Abstract

While frontier large language models (LLMs) continue to push capabilityboundaries, their deployment remains confined to GPU-powered cloudinfrastructure. We challenge this paradigm with SmallThinker, a family of LLMsnatively designed - not adapted - for the unique constraints of local devices:weak computational power, limited memory, and slow storage. Unlike traditionalapproaches that mainly compress existing models built for clouds, we architectSmallThinker from the ground up to thrive within these limitations. Ourinnovation lies in a deployment-aware architecture that transforms constraintsinto design principles. First, We introduce a two-level sparse structurecombining fine-grained Mixture-of-Experts (MoE) with sparse feed-forwardnetworks, drastically reducing computational demands without sacrificing modelcapacity. Second, to conquer the I/O bottleneck of slow storage, we design apre-attention router that enables our co-designed inference engine to prefetchexpert parameters from storage while computing attention, effectively hidingstorage latency that would otherwise cripple on-device inference. Third, formemory efficiency, we utilize NoPE-RoPE hybrid sparse attention mechanism toslash KV cache requirements. We release SmallThinker-4B-A0.6B andSmallThinker-21B-A3B, which achieve state-of-the-art performance scores andeven outperform larger LLMs. Remarkably, our co-designed system mostlyeliminates the need for expensive GPU hardware: with Q4_0 quantization, bothmodels exceed 20 tokens/s on ordinary consumer CPUs, while consuming only 1GBand 8GB of memory respectively. SmallThinker is publicly available athf.co/PowerInfer/SmallThinker-4BA0.6B-Instruct andhf.co/PowerInfer/SmallThinker-21BA3B-Instruct.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
SmallThinker: A Family of Efficient Large Language Models Natively Trained for Local Deployment | Papers | HyperAI