
摘要
通过尽可能将卷积核(kernel)的上下文信息全局化,现代卷积神经网络(ConvNets)在计算机视觉任务中展现出巨大潜力。然而,近期关于深度神经网络(DNNs)中多阶博弈论交互的研究揭示了现代ConvNets的表征瓶颈:随着卷积核尺寸的增大,其表达性交互能力并未得到有效编码。为应对这一挑战,本文提出一类新型现代ConvNets,命名为MogaNet,旨在基于纯卷积网络架构实现判别性视觉表征学习,并在模型复杂度与性能之间取得优异的权衡。MogaNet将概念简洁但高效的卷积操作与门控聚合机制整合进一个紧凑模块中,能够高效地聚集并自适应地上下文化判别性特征。MogaNet展现出卓越的可扩展性、参数效率以及在ImageNet及多个下游视觉任务基准上的竞争力表现,涵盖COCO目标检测、ADE20K语义分割、2D与3D人体姿态估计以及视频预测等任务。值得注意的是,MogaNet在ImageNet-1K数据集上分别以520万和1810万参数实现了80.0%和87.8%的准确率,显著优于ParC-Net与ConvNeXt-L,同时分别减少了59%的浮点运算量(FLOPs)和1700万参数。相关源代码已开源,地址为:https://github.com/Westlake-AI/MogaNet。
代码仓库
shanglianlm0525/CvPytorch
pytorch
Westlake-AI/openmixup
官方
pytorch
GitHub 中提及
chengtan9907/simvpv2
pytorch
GitHub 中提及
chengtan9907/OpenSTL
官方
pytorch
GitHub 中提及
Westlake-AI/MogaNet
官方
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| image-classification-on-imagenet | MogaNet-XT (256res) | GFLOPs: 1.04 Number of params: 3M Top 1 Accuracy: 77.2% |
| image-classification-on-imagenet | MogaNet-L | GFLOPs: 15.9 Number of params: 83M Top 1 Accuracy: 84.7% |
| image-classification-on-imagenet | MogaNet-S | GFLOPs: 5 Number of params: 25M Top 1 Accuracy: 83.4% |
| image-classification-on-imagenet | MogaNet-T (256res) | GFLOPs: 1.44 Number of params: 5.2M Top 1 Accuracy: 80% |
| image-classification-on-imagenet | MogaNet-B | GFLOPs: 9.9 Number of params: 44M Top 1 Accuracy: 84.3% |
| image-classification-on-imagenet | MogaNet-XL (384res) | GFLOPs: 102 Number of params: 181M Top 1 Accuracy: 87.8% |
| instance-segmentation-on-coco | MogaNet-B (Cascade Mask R-CNN) | mask AP: 46 |
| instance-segmentation-on-coco | MogaNet-T | mask AP: 35.8 |
| instance-segmentation-on-coco | MogaNet-B (Mask R-CNN 1x) | mask AP: 43.2 |
| instance-segmentation-on-coco | MogaNet-L (Cascade Mask R-CNN) | mask AP: 46.1 |
| instance-segmentation-on-coco | MogaNet-S (Mask R-CNN 1x) | mask AP: 42.2 |
| instance-segmentation-on-coco | MogaNet-S (Cascade Mask R-CNN) | mask AP: 45.1 |
| instance-segmentation-on-coco | MogaNet-L (Mask R-CNN 1x) | mask AP: 44.1 |
| instance-segmentation-on-coco | MogaNet-XT | mask AP: 37.6 |
| instance-segmentation-on-coco | MogaNet-XL (Cascade Mask R-CNN) | mask AP: 48.8 |
| instance-segmentation-on-coco | MogaNet-T (Mask R-CNN 1x) | mask AP: 39.1 |
| instance-segmentation-on-coco-val2017 | MogaNet-S (256x192) | AP50: 90.7 AP75: 82.8 |
| object-detection-on-coco-2017-val | MogaNet-XL (Cascade Mask R-CNN) | AP: 56.2 |
| object-detection-on-coco-2017-val | MogaNet-S (RetinaNet 1x) | AP: 45.8 |
| object-detection-on-coco-2017-val | MogaNet-B (Cascade Mask R-CNN) | AP: 52.6 |
| object-detection-on-coco-2017-val | MogaNet-L (Mask R-CNN 1x) | AP: 49.4 |
| object-detection-on-coco-2017-val | MogaNet-S (Mask R-CNN 1x) | AP: 46.7 |
| object-detection-on-coco-2017-val | MogaNet-B (RetinaNet 1x) | AP: 47.7 |
| object-detection-on-coco-2017-val | MogaNet-L (Cascade Mask R-CNN) | AP: 53.3 |
| object-detection-on-coco-2017-val | MogaNet-XT (RetinaNet 1x) | AP: 39.7 |
| object-detection-on-coco-2017-val | MogaNet-L (RetinaNet 1x) | AP: 48.7 |
| object-detection-on-coco-2017-val | MogaNet-XT (Mask R-CNN 1x) | AP: 40.7 |
| object-detection-on-coco-2017-val | MogaNet-T (Mask R-CNN 1x) | AP: 42.6 |
| object-detection-on-coco-2017-val | MogaNet-B (Mask R-CNN 1x) | AP: 47.9 |
| object-detection-on-coco-2017-val | MogaNet-T (RetinaNet 1x) | AP: 41.4 |
| object-detection-on-coco-2017-val | MogaNet-S (Cascade Mask R-CNN) | AP: 51.6 |
| pose-estimation-on-coco-val2017 | MogaNet-S (256x192) | AP: 74.9 AR: 80.1 |
| pose-estimation-on-coco-val2017 | MogaNet-T (256x192) | AP: 73.2 AP50: 90.1 AP75: 81 AR: 78.8 |
| pose-estimation-on-coco-val2017 | MogaNet-B (384x288) | AP: 77.3 AP50: 91.4 AP75: 84 AR: 82.2 |
| pose-estimation-on-coco-val2017 | MogaNet-S (384x288) | AP: 76.4 AP50: 91 AP75: 83.3 AR: 81.4 |
| semantic-segmentation-on-ade20k | MogaNet-B (UperNet) | GFLOPs (512 x 512): 1050 Validation mIoU: 50.1 |
| semantic-segmentation-on-ade20k | MogaNet-L (UperNet) | GFLOPs (512 x 512): 1176 Validation mIoU: 50.9 |
| semantic-segmentation-on-ade20k | MogaNet-S (Semantic FPN) | GFLOPs (512 x 512): 189 Validation mIoU: 47.7 |
| semantic-segmentation-on-ade20k | MogaNet-S (UperNet) | GFLOPs (512 x 512): 946 Validation mIoU: 49.2 |
| semantic-segmentation-on-ade20k | MogaNet-XL (UperNet) | Validation mIoU: 54 |
| video-prediction-on-moving-mnist | VAN (SimVP 10x) | MAE: 53.57 MSE: 16.21 SSIM: 0.9646 |
| video-prediction-on-moving-mnist | Swin (SimVP 10x) | MAE: 59.84 MSE: 19.11 |
| video-prediction-on-moving-mnist | ConvMixer (SimVP 10x) | MAE: 67.37 MSE: 22.3 |
| video-prediction-on-moving-mnist | Uniformer (SimVP 10x) | MAE: 57.52 MSE: 18.01 |
| video-prediction-on-moving-mnist | MLP-Mixer (SimVP 10x) | MAE: 59.86 MSE: 18.85 |
| video-prediction-on-moving-mnist | ViT (SimVP 10x) | MAE: 61.65 MSE: 19.74 SSIM: 0.9539 |
| video-prediction-on-moving-mnist | MogaNet (SimVP 10x) | MAE: 51.84 MSE: 15.67 SSIM: 0.9661 |
| video-prediction-on-moving-mnist | ConvNeXt (SimVP 10x) | MAE: 55.76 MSE: 17.58 SSIM: 0.9617 |
| video-prediction-on-moving-mnist | HorNet (SimVP 10x) | MAE: 55.7 MSE: 17.4 SSIM: 0.9624 |
| video-prediction-on-moving-mnist | Poolformer (SimVP 10x) | MAE: 64.31 MSE: 20.96 |