Command Palette
Search for a command to run...
Zhuang Liu Hanzi Mao Chao-Yuan Wu Christoph Feichtenhofer Trevor Darrell Saining Xie

Abstract
The "Roaring 20s" of visual recognition began with the introduction of Vision Transformers (ViTs), which quickly superseded ConvNets as the state-of-the-art image classification model. A vanilla ViT, on the other hand, faces difficulties when applied to general computer vision tasks such as object detection and semantic segmentation. It is the hierarchical Transformers (e.g., Swin Transformers) that reintroduced several ConvNet priors, making Transformers practically viable as a generic vision backbone and demonstrating remarkable performance on a wide variety of vision tasks. However, the effectiveness of such hybrid approaches is still largely credited to the intrinsic superiority of Transformers, rather than the inherent inductive biases of convolutions. In this work, we reexamine the design spaces and test the limits of what a pure ConvNet can achieve. We gradually "modernize" a standard ResNet toward the design of a vision Transformer, and discover several key components that contribute to the performance difference along the way. The outcome of this exploration is a family of pure ConvNet models dubbed ConvNeXt. Constructed entirely from standard ConvNet modules, ConvNeXts compete favorably with Transformers in terms of accuracy and scalability, achieving 87.8% ImageNet top-1 accuracy and outperforming Swin Transformers on COCO detection and ADE20K segmentation, while maintaining the simplicity and efficiency of standard ConvNets.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| classification-on-indl | ConvNext | Average Recall: 93.47% |
| domain-generalization-on-imagenet-a | ConvNeXt-XL (Im21k, 384) | Top-1 accuracy %: 69.3 |
| domain-generalization-on-imagenet-c | ConvNeXt-XL (Im21k) (augmentation overlap with ImageNet-C) | Number of params: 350M mean Corruption Error (mCE): 38.8 |
| domain-generalization-on-imagenet-r | ConvNeXt-XL (Im21k, 384) | Top-1 Error Rate: 31.8 |
| domain-generalization-on-imagenet-sketch | ConvNeXt-XL (Im21k, 384) | Top-1 accuracy: 55.0 |
| domain-generalization-on-vizwiz | ConvNeXt-B | Accuracy - All Images: 53.5 Accuracy - Clean Images: 56 Accuracy - Corrupted Images: 46.9 |
| image-classification-on-imagenet | ConvNeXt-XL (ImageNet-22k) | GFLOPs: 179 Number of params: 350M Top 1 Accuracy: 87.8% |
| image-classification-on-imagenet | Adlik-ViT-SG+Swin_large+Convnext_xlarge(384) | Number of params: 1827M Top 1 Accuracy: 88.36% |
| image-classification-on-imagenet | ConvNeXt-L (384 res) | GFLOPs: 101 Number of params: 198M Top 1 Accuracy: 85.5% |
| image-classification-on-imagenet | ConvNeXt-T | GFLOPs: 4.5 Number of params: 29M Top 1 Accuracy: 82.1% |
| object-detection-on-coco-o | ConvNeXt-XL (Cascade Mask R-CNN) | Average mAP: 37.5 Effective Robustness: 12.68 |
| semantic-segmentation-on-ade20k | ConvNeXt-S | GFLOPs (512 x 512): 1027 Params (M): 82 Validation mIoU: 49.6 |
| semantic-segmentation-on-ade20k | ConvNeXt-B++ | GFLOPs (512 x 512): 1828 Params (M): 122 Validation mIoU: 53.1 |
| semantic-segmentation-on-ade20k | ConvNeXt-B | GFLOPs (512 x 512): 1170 Params (M): 122 Validation mIoU: 49.9 |
| semantic-segmentation-on-ade20k | ConvNeXt-T | GFLOPs (512 x 512): 939 Params (M): 60 Validation mIoU: 46.7 |
| semantic-segmentation-on-ade20k | ConvNeXt-L++ | GFLOPs (512 x 512): 2458 Params (M): 235 Validation mIoU: 53.7 |
| semantic-segmentation-on-ade20k | ConvNeXt-XL++ | GFLOPs (512 x 512): 3335 Params (M): 391 Validation mIoU: 54 |
| semantic-segmentation-on-imagenet-s | ConvNext-Tiny (P4, 224x224, SUP) | mIoU (test): 48.8 mIoU (val): 48.7 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.