Wenhai WangJifeng DaiZhe ChenZhenhang HuangZhiqi LiXizhou ZhuXiaowei HuTong LuLewei LuHongsheng LiXiaogang WangYu Qiao

摘要
近年来,大规模视觉变换器(Vision Transformers, ViTs)取得了显著进展,而基于卷积神经网络(Convolutional Neural Networks, CNNs)的大规模模型仍处于初步发展阶段。本文提出了一种新的大规模CNN基础模型——InternImage,该模型能够像ViTs一样,通过增加参数量和训练数据获得性能提升。与近期聚焦于大密集卷积核的CNN不同,InternImage以可变形卷积(deformable convolution)为核心运算单元,使得模型不仅具备下游任务(如目标检测和图像分割)所需的较大有效感受野,还能够根据输入数据和任务信息自适应地进行空间特征聚合。由此,所提出的InternImage显著降低了传统CNN中严格的归纳偏置(inductive bias),从而在大规模参数和海量数据的条件下,具备学习更强、更鲁棒特征表示的能力,与ViTs相当。我们在ImageNet、COCO和ADE20K等多个具有挑战性的基准测试上验证了该模型的有效性。值得一提的是,InternImage-H在COCO test-dev上取得了65.4 mAP的新纪录,在ADE20K上达到62.9 mIoU,超越了当前领先的CNN与ViT模型。相关代码将开源,发布于 https://github.com/OpenGVLab/InternImage。
代码仓库
opengvlab/internimage
官方
pytorch
GitHub 中提及
OpenGVLab/M3I-Pretraining
GitHub 中提及
chenller/mmseg-extension
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| 2d-object-detection-on-bdd100k-val | InternImage-H | mAP: 38.8 |
| image-classification-on-imagenet | InternImage-S | GFLOPs: 8 Number of params: 50M Top 1 Accuracy: 84.2% |
| image-classification-on-imagenet | InternImage-B | GFLOPs: 16 Number of params: 97M Top 1 Accuracy: 84.9% |
| image-classification-on-imagenet | InternImage-DCNv3-G (M3I Pre-training) | Number of params: 3000M Top 1 Accuracy: 90.1% |
| image-classification-on-imagenet | InternImage-T | GFLOPs: 5 Number of params: 30M Top 1 Accuracy: 83.5% |
| image-classification-on-imagenet | InternImage-L | GFLOPs: 108 Number of params: 223M Top 1 Accuracy: 87.7% |
| image-classification-on-imagenet | InternImage-H | GFLOPs: 1478 Number of params: 1080M Top 1 Accuracy: 89.6% |
| image-classification-on-imagenet | InternImage-XL | GFLOPs: 163 Number of params: 335M Top 1 Accuracy: 88% |
| image-classification-on-inaturalist-2018 | InternImage-H | Top-1 Accuracy: 92.6% |
| image-classification-on-places205 | InternImage-H | Top 1 Accuracy: 71.7% |
| image-classification-on-places365 | InternImage-H(CNN) | Top 1 Accuracy: 61.2% |
| instance-segmentation-on-coco | InternImage-H | AP50: 80.8 AP75: 62.2 APL: 70.3 APM: 58.9 APS: 41.0 |
| instance-segmentation-on-coco-minival | InternImage-S | GFLOPs: 340 Params (M): 69 box AP: 49.7 mask AP: 44.5 |
| instance-segmentation-on-coco-minival | InternImage-T | GFLOPs: 270 Params (M): 49 box AP: 49.1 mask AP: 43.7 |
| instance-segmentation-on-coco-minival | InternImage-XL | GFLOPs: 1782 Params (M): 387 mask AP: 48.8 |
| instance-segmentation-on-coco-minival | InternImage-H | AP50: 80.1 AP75: 61.5 APL: 74.4 APM: 58.4 APS: 37.9 mask AP: 55.4 |
| instance-segmentation-on-coco-minival | InternImage-B | GFLOPs: 501 Params (M): 115 |
| instance-segmentation-on-coco-minival | InternImage-L | GFLOPs: 1399 Params (M): 277 box AP: 56.1 mask AP: 48.5 |
| object-detection-on-coco | InternImage-XL | Params (M): 602 box mAP: 64.3 |
| object-detection-on-coco | InternImage-H (M3I Pre-training) | Params (M): 2180 |
| object-detection-on-coco-minival | InternImage-H | box AP: 65.0 |
| object-detection-on-coco-minival | InternImage-XL | box AP: 64.2 |
| object-detection-on-coco-o | InternImage-L (Cascade Mask R-CNN) | Average mAP: 37.0 Effective Robustness: 11.72 |
| object-detection-on-crowdhuman-full-body | InternImage-H | AP: 97.2 |
| object-detection-on-lvis-v1-0-minival | InternImage-H | box AP: 65.8 |
| object-detection-on-lvis-v1-0-val | InternImage-H | box AP: 63.2 |
| object-detection-on-openimages-v6 | InternImage-H | box AP: 74.1 |
| object-detection-on-pascal-voc-2012 | InternImage-H | MAP: 97.2 |
| semantic-segmentation-on-ade20k | InternImage-L | GFLOPs: 2526 Params (M): 256 Validation mIoU: 54.1 |
| semantic-segmentation-on-ade20k | InternImage-H | GFLOPs: 4635 Params (M): 1310 Validation mIoU: 62.9 |
| semantic-segmentation-on-ade20k | InternImage-XL | GFLOPs: 3142 Params (M): 368 Validation mIoU: 55.3 |
| semantic-segmentation-on-ade20k | InternImage-S | GFLOPs: 1017 Params (M): 80 Validation mIoU: 50.9 |
| semantic-segmentation-on-ade20k | InternImage-H (M3I Pre-training) | Params (M): 1310 |
| semantic-segmentation-on-ade20k | InternImage-B | GFLOPs: 1185 Params (M): 128 Validation mIoU: 51.3 |
| semantic-segmentation-on-ade20k | InternImage-T | GFLOPs: 944 Params (M): 59 Validation mIoU: 48.1 |
| semantic-segmentation-on-cityscapes | InternImage-H | Mean IoU (class): 86.1% |
| semantic-segmentation-on-cityscapes-val | InternImage-H | mIoU: 87 |
| semantic-segmentation-on-cityscapes-val | InternImage-XL | mIoU: 86.4 |
| semantic-segmentation-on-pascal-context | InternImage-H | mIoU: 70.3 |
| semantic-segmentation-on-replica | InternImage | mIoU: 38.4 |