3 个月前

MogaNet:多阶门控聚合网络

MogaNet:多阶门控聚合网络

摘要

通过尽可能将卷积核(kernel)的上下文信息全局化,现代卷积神经网络(ConvNets)在计算机视觉任务中展现出巨大潜力。然而,近期关于深度神经网络(DNNs)中多阶博弈论交互的研究揭示了现代ConvNets的表征瓶颈:随着卷积核尺寸的增大,其表达性交互能力并未得到有效编码。为应对这一挑战,本文提出一类新型现代ConvNets,命名为MogaNet,旨在基于纯卷积网络架构实现判别性视觉表征学习,并在模型复杂度与性能之间取得优异的权衡。MogaNet将概念简洁但高效的卷积操作与门控聚合机制整合进一个紧凑模块中,能够高效地聚集并自适应地上下文化判别性特征。MogaNet展现出卓越的可扩展性、参数效率以及在ImageNet及多个下游视觉任务基准上的竞争力表现,涵盖COCO目标检测、ADE20K语义分割、2D与3D人体姿态估计以及视频预测等任务。值得注意的是,MogaNet在ImageNet-1K数据集上分别以520万和1810万参数实现了80.0%和87.8%的准确率,显著优于ParC-Net与ConvNeXt-L,同时分别减少了59%的浮点运算量(FLOPs)和1700万参数。相关源代码已开源,地址为:https://github.com/Westlake-AI/MogaNet。

基准测试

基准方法指标
image-classification-on-imagenetMogaNet-XT (256res)
GFLOPs: 1.04
Number of params: 3M
Top 1 Accuracy: 77.2%
image-classification-on-imagenetMogaNet-L
GFLOPs: 15.9
Number of params: 83M
Top 1 Accuracy: 84.7%
image-classification-on-imagenetMogaNet-S
GFLOPs: 5
Number of params: 25M
Top 1 Accuracy: 83.4%
image-classification-on-imagenetMogaNet-T (256res)
GFLOPs: 1.44
Number of params: 5.2M
Top 1 Accuracy: 80%
image-classification-on-imagenetMogaNet-B
GFLOPs: 9.9
Number of params: 44M
Top 1 Accuracy: 84.3%
image-classification-on-imagenetMogaNet-XL (384res)
GFLOPs: 102
Number of params: 181M
Top 1 Accuracy: 87.8%
instance-segmentation-on-cocoMogaNet-B (Cascade Mask R-CNN)
mask AP: 46
instance-segmentation-on-cocoMogaNet-T
mask AP: 35.8
instance-segmentation-on-cocoMogaNet-B (Mask R-CNN 1x)
mask AP: 43.2
instance-segmentation-on-cocoMogaNet-L (Cascade Mask R-CNN)
mask AP: 46.1
instance-segmentation-on-cocoMogaNet-S (Mask R-CNN 1x)
mask AP: 42.2
instance-segmentation-on-cocoMogaNet-S (Cascade Mask R-CNN)
mask AP: 45.1
instance-segmentation-on-cocoMogaNet-L (Mask R-CNN 1x)
mask AP: 44.1
instance-segmentation-on-cocoMogaNet-XT
mask AP: 37.6
instance-segmentation-on-cocoMogaNet-XL (Cascade Mask R-CNN)
mask AP: 48.8
instance-segmentation-on-cocoMogaNet-T (Mask R-CNN 1x)
mask AP: 39.1
instance-segmentation-on-coco-val2017MogaNet-S (256x192)
AP50: 90.7
AP75: 82.8
object-detection-on-coco-2017-valMogaNet-XL (Cascade Mask R-CNN)
AP: 56.2
object-detection-on-coco-2017-valMogaNet-S (RetinaNet 1x)
AP: 45.8
object-detection-on-coco-2017-valMogaNet-B (Cascade Mask R-CNN)
AP: 52.6
object-detection-on-coco-2017-valMogaNet-L (Mask R-CNN 1x)
AP: 49.4
object-detection-on-coco-2017-valMogaNet-S (Mask R-CNN 1x)
AP: 46.7
object-detection-on-coco-2017-valMogaNet-B (RetinaNet 1x)
AP: 47.7
object-detection-on-coco-2017-valMogaNet-L (Cascade Mask R-CNN)
AP: 53.3
object-detection-on-coco-2017-valMogaNet-XT (RetinaNet 1x)
AP: 39.7
object-detection-on-coco-2017-valMogaNet-L (RetinaNet 1x)
AP: 48.7
object-detection-on-coco-2017-valMogaNet-XT (Mask R-CNN 1x)
AP: 40.7
object-detection-on-coco-2017-valMogaNet-T (Mask R-CNN 1x)
AP: 42.6
object-detection-on-coco-2017-valMogaNet-B (Mask R-CNN 1x)
AP: 47.9
object-detection-on-coco-2017-valMogaNet-T (RetinaNet 1x)
AP: 41.4
object-detection-on-coco-2017-valMogaNet-S (Cascade Mask R-CNN)
AP: 51.6
pose-estimation-on-coco-val2017MogaNet-S (256x192)
AP: 74.9
AR: 80.1
pose-estimation-on-coco-val2017MogaNet-T (256x192)
AP: 73.2
AP50: 90.1
AP75: 81
AR: 78.8
pose-estimation-on-coco-val2017MogaNet-B (384x288)
AP: 77.3
AP50: 91.4
AP75: 84
AR: 82.2
pose-estimation-on-coco-val2017MogaNet-S (384x288)
AP: 76.4
AP50: 91
AP75: 83.3
AR: 81.4
semantic-segmentation-on-ade20kMogaNet-B (UperNet)
GFLOPs (512 x 512): 1050
Validation mIoU: 50.1
semantic-segmentation-on-ade20kMogaNet-L (UperNet)
GFLOPs (512 x 512): 1176
Validation mIoU: 50.9
semantic-segmentation-on-ade20kMogaNet-S (Semantic FPN)
GFLOPs (512 x 512): 189
Validation mIoU: 47.7
semantic-segmentation-on-ade20kMogaNet-S (UperNet)
GFLOPs (512 x 512): 946
Validation mIoU: 49.2
semantic-segmentation-on-ade20kMogaNet-XL (UperNet)
Validation mIoU: 54
video-prediction-on-moving-mnistVAN (SimVP 10x)
MAE: 53.57
MSE: 16.21
SSIM: 0.9646
video-prediction-on-moving-mnistSwin (SimVP 10x)
MAE: 59.84
MSE: 19.11
video-prediction-on-moving-mnistConvMixer (SimVP 10x)
MAE: 67.37
MSE: 22.3
video-prediction-on-moving-mnistUniformer (SimVP 10x)
MAE: 57.52
MSE: 18.01
video-prediction-on-moving-mnistMLP-Mixer (SimVP 10x)
MAE: 59.86
MSE: 18.85
video-prediction-on-moving-mnistViT (SimVP 10x)
MAE: 61.65
MSE: 19.74
SSIM: 0.9539
video-prediction-on-moving-mnistMogaNet (SimVP 10x)
MAE: 51.84
MSE: 15.67
SSIM: 0.9661
video-prediction-on-moving-mnistConvNeXt (SimVP 10x)
MAE: 55.76
MSE: 17.58
SSIM: 0.9617
video-prediction-on-moving-mnistHorNet (SimVP 10x)
MAE: 55.7
MSE: 17.4
SSIM: 0.9624
video-prediction-on-moving-mnistPoolformer (SimVP 10x)
MAE: 64.31
MSE: 20.96

用 AI 构建 AI

从想法到上线——通过免费 AI 协同编程、开箱即用的环境和市场最优价格的 GPU 加速您的 AI 开发

AI 协同编程
即用型 GPU
最优价格
立即开始

Hyper Newsletters

订阅我们的最新资讯
我们会在北京时间 每周一的上午九点 向您的邮箱投递本周内的最新更新
邮件发送服务由 MailChimp 提供
MogaNet:多阶门控聚合网络 | 论文 | HyperAI超神经