3 个月前

InternImage:基于可变形卷积探索大规模视觉基础模型

InternImage:基于可变形卷积探索大规模视觉基础模型

摘要

近年来,大规模视觉变换器(Vision Transformers, ViTs)取得了显著进展,而基于卷积神经网络(Convolutional Neural Networks, CNNs)的大规模模型仍处于初步发展阶段。本文提出了一种新的大规模CNN基础模型——InternImage,该模型能够像ViTs一样,通过增加参数量和训练数据获得性能提升。与近期聚焦于大密集卷积核的CNN不同,InternImage以可变形卷积(deformable convolution)为核心运算单元,使得模型不仅具备下游任务(如目标检测和图像分割)所需的较大有效感受野,还能够根据输入数据和任务信息自适应地进行空间特征聚合。由此,所提出的InternImage显著降低了传统CNN中严格的归纳偏置(inductive bias),从而在大规模参数和海量数据的条件下,具备学习更强、更鲁棒特征表示的能力,与ViTs相当。我们在ImageNet、COCO和ADE20K等多个具有挑战性的基准测试上验证了该模型的有效性。值得一提的是,InternImage-H在COCO test-dev上取得了65.4 mAP的新纪录,在ADE20K上达到62.9 mIoU,超越了当前领先的CNN与ViT模型。相关代码将开源,发布于 https://github.com/OpenGVLab/InternImage。

代码仓库

opengvlab/internimage
官方
pytorch
GitHub 中提及
chenller/mmseg-extension
pytorch
GitHub 中提及

基准测试

基准方法指标
2d-object-detection-on-bdd100k-valInternImage-H
mAP: 38.8
image-classification-on-imagenetInternImage-S
GFLOPs: 8
Number of params: 50M
Top 1 Accuracy: 84.2%
image-classification-on-imagenetInternImage-B
GFLOPs: 16
Number of params: 97M
Top 1 Accuracy: 84.9%
image-classification-on-imagenetInternImage-DCNv3-G (M3I Pre-training)
Number of params: 3000M
Top 1 Accuracy: 90.1%
image-classification-on-imagenetInternImage-T
GFLOPs: 5
Number of params: 30M
Top 1 Accuracy: 83.5%
image-classification-on-imagenetInternImage-L
GFLOPs: 108
Number of params: 223M
Top 1 Accuracy: 87.7%
image-classification-on-imagenetInternImage-H
GFLOPs: 1478
Number of params: 1080M
Top 1 Accuracy: 89.6%
image-classification-on-imagenetInternImage-XL
GFLOPs: 163
Number of params: 335M
Top 1 Accuracy: 88%
image-classification-on-inaturalist-2018InternImage-H
Top-1 Accuracy: 92.6%
image-classification-on-places205InternImage-H
Top 1 Accuracy: 71.7%
image-classification-on-places365InternImage-H(CNN)
Top 1 Accuracy: 61.2%
instance-segmentation-on-cocoInternImage-H
AP50: 80.8
AP75: 62.2
APL: 70.3
APM: 58.9
APS: 41.0
instance-segmentation-on-coco-minivalInternImage-S
GFLOPs: 340
Params (M): 69
box AP: 49.7
mask AP: 44.5
instance-segmentation-on-coco-minivalInternImage-T
GFLOPs: 270
Params (M): 49
box AP: 49.1
mask AP: 43.7
instance-segmentation-on-coco-minivalInternImage-XL
GFLOPs: 1782
Params (M): 387
mask AP: 48.8
instance-segmentation-on-coco-minivalInternImage-H
AP50: 80.1
AP75: 61.5
APL: 74.4
APM: 58.4
APS: 37.9
mask AP: 55.4
instance-segmentation-on-coco-minivalInternImage-B
GFLOPs: 501
Params (M): 115
instance-segmentation-on-coco-minivalInternImage-L
GFLOPs: 1399
Params (M): 277
box AP: 56.1
mask AP: 48.5
object-detection-on-cocoInternImage-XL
Params (M): 602
box mAP: 64.3
object-detection-on-cocoInternImage-H (M3I Pre-training)
Params (M): 2180
object-detection-on-coco-minivalInternImage-H
box AP: 65.0
object-detection-on-coco-minivalInternImage-XL
box AP: 64.2
object-detection-on-coco-oInternImage-L (Cascade Mask R-CNN)
Average mAP: 37.0
Effective Robustness: 11.72
object-detection-on-crowdhuman-full-bodyInternImage-H
AP: 97.2
object-detection-on-lvis-v1-0-minivalInternImage-H
box AP: 65.8
object-detection-on-lvis-v1-0-valInternImage-H
box AP: 63.2
object-detection-on-openimages-v6InternImage-H
box AP: 74.1
object-detection-on-pascal-voc-2012InternImage-H
MAP: 97.2
semantic-segmentation-on-ade20kInternImage-L
GFLOPs: 2526
Params (M): 256
Validation mIoU: 54.1
semantic-segmentation-on-ade20kInternImage-H
GFLOPs: 4635
Params (M): 1310
Validation mIoU: 62.9
semantic-segmentation-on-ade20kInternImage-XL
GFLOPs: 3142
Params (M): 368
Validation mIoU: 55.3
semantic-segmentation-on-ade20kInternImage-S
GFLOPs: 1017
Params (M): 80
Validation mIoU: 50.9
semantic-segmentation-on-ade20kInternImage-H (M3I Pre-training)
Params (M): 1310
semantic-segmentation-on-ade20kInternImage-B
GFLOPs: 1185
Params (M): 128
Validation mIoU: 51.3
semantic-segmentation-on-ade20kInternImage-T
GFLOPs: 944
Params (M): 59
Validation mIoU: 48.1
semantic-segmentation-on-cityscapesInternImage-H
Mean IoU (class): 86.1%
semantic-segmentation-on-cityscapes-valInternImage-H
mIoU: 87
semantic-segmentation-on-cityscapes-valInternImage-XL
mIoU: 86.4
semantic-segmentation-on-pascal-contextInternImage-H
mIoU: 70.3
semantic-segmentation-on-replicaInternImage
mIoU: 38.4

用 AI 构建 AI

从想法到上线——通过免费 AI 协同编程、开箱即用的环境和市场最优价格的 GPU 加速您的 AI 开发

AI 协同编程
即用型 GPU
最优价格
立即开始

Hyper Newsletters

订阅我们的最新资讯
我们会在北京时间 每周一的上午九点 向您的邮箱投递本周内的最新更新
邮件发送服务由 MailChimp 提供
InternImage:基于可变形卷积探索大规模视觉基础模型 | 论文 | HyperAI超神经