
摘要
视觉识别领域的“20年代”始于视觉Transformer(Vision Transformers, ViTs)的提出,该模型迅速取代了传统卷积神经网络(ConvNets),成为图像分类任务的最先进方法。然而,原始的ViT在应用于目标检测、语义分割等通用计算机视觉任务时面临诸多挑战。正是层次化Transformer(如Swin Transformer)重新引入了卷积神经网络中的一些先验知识,使Transformer在实际应用中具备了作为通用视觉主干网络的可行性,并在多种视觉任务上展现出卓越性能。然而,这类混合方法的有效性在很大程度上仍归功于Transformer本身的内在优势,而非卷积操作所固有的归纳偏置。在本工作中,我们重新审视了模型设计空间,探索纯卷积网络(pure ConvNet)所能达到的极限。我们逐步将标准的ResNet“现代化”,向视觉Transformer的设计理念靠拢,并在此过程中发现若干关键组件,这些组件对性能差异起到了决定性作用。基于这一探索,我们提出了一类全新的纯卷积网络模型,命名为ConvNeXt。该系列模型完全由标准卷积模块构建而成,在准确率和可扩展性方面与Transformer模型相媲美:在ImageNet上达到87.8%的Top-1准确率,并在COCO目标检测和ADE20K语义分割任务上超越Swin Transformer,同时保持了标准卷积网络所特有的简洁性与高效性。
代码仓库
k-h-ismail/convnext-dcls
pytorch
GitHub 中提及
dongkyuk/ConvNext-tensorflow
tf
GitHub 中提及
hmichaeli/alias_free_convnets
pytorch
GitHub 中提及
kingcong/convnext-
mindspore
mzeromiko/vmamba
pytorch
GitHub 中提及
sayakpaul/ConvNeXt-TF
tf
GitHub 中提及
james77777778/keras-image-models
pytorch
GitHub 中提及
pytorch/vision
pytorch
PaddlePaddle/PASSL
paddle
frgfm/Holocron
pytorch
GitHub 中提及
rwightman/pytorch-image-models
pytorch
GitHub 中提及
Westlake-AI/openmixup
pytorch
GitHub 中提及
Owais-Ansari/Unet3plus
pytorch
GitHub 中提及
PaddlePaddle/PaddleClas
paddle
hanfried/hanfried-bookmarks
pytorch
GitHub 中提及
duyhominhnguyen/LVM-Med
pytorch
GitHub 中提及
jmnolte/hccnet
pytorch
GitHub 中提及
martinsbruveris/tensorflow-image-models
tf
GitHub 中提及
AlassaneSakande/A-ConvNet-of-2020s
pytorch
GitHub 中提及
IMvision12/keras-vision-models
pytorch
GitHub 中提及
open-mmlab/mmclassification
pytorch
waterdisappear/nudt4mstar
pytorch
GitHub 中提及
lucidrains/denoising-diffusion-pytorch
pytorch
GitHub 中提及
yaya-yns/tart
pytorch
GitHub 中提及
avocardio/resnet_vs_convnext
tf
GitHub 中提及
MindCode-4/code-3/tree/main/convnext
mindspore
facebookresearch/ConvNeXt
官方
pytorch
GitHub 中提及
mit-han-lab/litepose
pytorch
GitHub 中提及
tuanio/nextformer
pytorch
GitHub 中提及
kingcong/convnext
mindspore
facebookresearch/ppuda
pytorch
GitHub 中提及
Raghvender1205/ConvNeXt
pytorch
GitHub 中提及
DarshanDeshpande/jax-models
jax
GitHub 中提及
bamps53/convnext-tf
pytorch
flytocc/ConvNeXt-paddle
paddle
GitHub 中提及
protonx-tf-04-projects/ConvNext-2020s
tf
GitHub 中提及
sithu31296/semantic-segmentation
pytorch
GitHub 中提及
towhee-io/towhee
pytorch
0jason000/convnext
mindspore
GitHub 中提及
lyqcom/convnext
mindspore
zibbini/convnext-v2_tensorflow
tf
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| classification-on-indl | ConvNext | Average Recall: 93.47% |
| domain-generalization-on-imagenet-a | ConvNeXt-XL (Im21k, 384) | Top-1 accuracy %: 69.3 |
| domain-generalization-on-imagenet-c | ConvNeXt-XL (Im21k) (augmentation overlap with ImageNet-C) | Number of params: 350M mean Corruption Error (mCE): 38.8 |
| domain-generalization-on-imagenet-r | ConvNeXt-XL (Im21k, 384) | Top-1 Error Rate: 31.8 |
| domain-generalization-on-imagenet-sketch | ConvNeXt-XL (Im21k, 384) | Top-1 accuracy: 55.0 |
| domain-generalization-on-vizwiz | ConvNeXt-B | Accuracy - All Images: 53.5 Accuracy - Clean Images: 56 Accuracy - Corrupted Images: 46.9 |
| image-classification-on-imagenet | ConvNeXt-XL (ImageNet-22k) | GFLOPs: 179 Number of params: 350M Top 1 Accuracy: 87.8% |
| image-classification-on-imagenet | Adlik-ViT-SG+Swin_large+Convnext_xlarge(384) | Number of params: 1827M Top 1 Accuracy: 88.36% |
| image-classification-on-imagenet | ConvNeXt-L (384 res) | GFLOPs: 101 Number of params: 198M Top 1 Accuracy: 85.5% |
| image-classification-on-imagenet | ConvNeXt-T | GFLOPs: 4.5 Number of params: 29M Top 1 Accuracy: 82.1% |
| object-detection-on-coco-o | ConvNeXt-XL (Cascade Mask R-CNN) | Average mAP: 37.5 Effective Robustness: 12.68 |
| semantic-segmentation-on-ade20k | ConvNeXt-S | GFLOPs (512 x 512): 1027 Params (M): 82 Validation mIoU: 49.6 |
| semantic-segmentation-on-ade20k | ConvNeXt-B++ | GFLOPs (512 x 512): 1828 Params (M): 122 Validation mIoU: 53.1 |
| semantic-segmentation-on-ade20k | ConvNeXt-B | GFLOPs (512 x 512): 1170 Params (M): 122 Validation mIoU: 49.9 |
| semantic-segmentation-on-ade20k | ConvNeXt-T | GFLOPs (512 x 512): 939 Params (M): 60 Validation mIoU: 46.7 |
| semantic-segmentation-on-ade20k | ConvNeXt-L++ | GFLOPs (512 x 512): 2458 Params (M): 235 Validation mIoU: 53.7 |
| semantic-segmentation-on-ade20k | ConvNeXt-XL++ | GFLOPs (512 x 512): 3335 Params (M): 391 Validation mIoU: 54 |
| semantic-segmentation-on-imagenet-s | ConvNext-Tiny (P4, 224x224, SUP) | mIoU (test): 48.8 mIoU (val): 48.7 |