Command Palette
Search for a command to run...
Siyuan Li Zedong Wang Zicheng Liu Cheng Tan Haitao Lin Di Wu Zhiyuan Chen Jiangbin Zheng Stan Z. Li

Abstract
By contextualizing the kernel as global as possible, Modern ConvNets have shown great potential in computer vision tasks. However, recent progress on multi-order game-theoretic interaction within deep neural networks (DNNs) reveals the representation bottleneck of modern ConvNets, where the expressive interactions have not been effectively encoded with the increased kernel size. To tackle this challenge, we propose a new family of modern ConvNets, dubbed MogaNet, for discriminative visual representation learning in pure ConvNet-based models with favorable complexity-performance trade-offs. MogaNet encapsulates conceptually simple yet effective convolutions and gated aggregation into a compact module, where discriminative features are efficiently gathered and contextualized adaptively. MogaNet exhibits great scalability, impressive efficiency of parameters, and competitive performance compared to state-of-the-art ViTs and ConvNets on ImageNet and various downstream vision benchmarks, including COCO object detection, ADE20K semantic segmentation, 2D&3D human pose estimation, and video prediction. Notably, MogaNet hits 80.0% and 87.8% accuracy with 5.2M and 181M parameters on ImageNet-1K, outperforming ParC-Net and ConvNeXt-L, while saving 59% FLOPs and 17M parameters, respectively. The source code is available at https://github.com/Westlake-AI/MogaNet.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| image-classification-on-imagenet | MogaNet-XT (256res) | GFLOPs: 1.04 Number of params: 3M Top 1 Accuracy: 77.2% |
| image-classification-on-imagenet | MogaNet-L | GFLOPs: 15.9 Number of params: 83M Top 1 Accuracy: 84.7% |
| image-classification-on-imagenet | MogaNet-S | GFLOPs: 5 Number of params: 25M Top 1 Accuracy: 83.4% |
| image-classification-on-imagenet | MogaNet-T (256res) | GFLOPs: 1.44 Number of params: 5.2M Top 1 Accuracy: 80% |
| image-classification-on-imagenet | MogaNet-B | GFLOPs: 9.9 Number of params: 44M Top 1 Accuracy: 84.3% |
| image-classification-on-imagenet | MogaNet-XL (384res) | GFLOPs: 102 Number of params: 181M Top 1 Accuracy: 87.8% |
| instance-segmentation-on-coco | MogaNet-B (Cascade Mask R-CNN) | mask AP: 46 |
| instance-segmentation-on-coco | MogaNet-T | mask AP: 35.8 |
| instance-segmentation-on-coco | MogaNet-B (Mask R-CNN 1x) | mask AP: 43.2 |
| instance-segmentation-on-coco | MogaNet-L (Cascade Mask R-CNN) | mask AP: 46.1 |
| instance-segmentation-on-coco | MogaNet-S (Mask R-CNN 1x) | mask AP: 42.2 |
| instance-segmentation-on-coco | MogaNet-S (Cascade Mask R-CNN) | mask AP: 45.1 |
| instance-segmentation-on-coco | MogaNet-L (Mask R-CNN 1x) | mask AP: 44.1 |
| instance-segmentation-on-coco | MogaNet-XT | mask AP: 37.6 |
| instance-segmentation-on-coco | MogaNet-XL (Cascade Mask R-CNN) | mask AP: 48.8 |
| instance-segmentation-on-coco | MogaNet-T (Mask R-CNN 1x) | mask AP: 39.1 |
| instance-segmentation-on-coco-val2017 | MogaNet-S (256x192) | AP50: 90.7 AP75: 82.8 |
| object-detection-on-coco-2017-val | MogaNet-XL (Cascade Mask R-CNN) | AP: 56.2 |
| object-detection-on-coco-2017-val | MogaNet-S (RetinaNet 1x) | AP: 45.8 |
| object-detection-on-coco-2017-val | MogaNet-B (Cascade Mask R-CNN) | AP: 52.6 |
| object-detection-on-coco-2017-val | MogaNet-L (Mask R-CNN 1x) | AP: 49.4 |
| object-detection-on-coco-2017-val | MogaNet-S (Mask R-CNN 1x) | AP: 46.7 |
| object-detection-on-coco-2017-val | MogaNet-B (RetinaNet 1x) | AP: 47.7 |
| object-detection-on-coco-2017-val | MogaNet-L (Cascade Mask R-CNN) | AP: 53.3 |
| object-detection-on-coco-2017-val | MogaNet-XT (RetinaNet 1x) | AP: 39.7 |
| object-detection-on-coco-2017-val | MogaNet-L (RetinaNet 1x) | AP: 48.7 |
| object-detection-on-coco-2017-val | MogaNet-XT (Mask R-CNN 1x) | AP: 40.7 |
| object-detection-on-coco-2017-val | MogaNet-T (Mask R-CNN 1x) | AP: 42.6 |
| object-detection-on-coco-2017-val | MogaNet-B (Mask R-CNN 1x) | AP: 47.9 |
| object-detection-on-coco-2017-val | MogaNet-T (RetinaNet 1x) | AP: 41.4 |
| object-detection-on-coco-2017-val | MogaNet-S (Cascade Mask R-CNN) | AP: 51.6 |
| pose-estimation-on-coco-val2017 | MogaNet-S (256x192) | AP: 74.9 AR: 80.1 |
| pose-estimation-on-coco-val2017 | MogaNet-T (256x192) | AP: 73.2 AP50: 90.1 AP75: 81 AR: 78.8 |
| pose-estimation-on-coco-val2017 | MogaNet-B (384x288) | AP: 77.3 AP50: 91.4 AP75: 84 AR: 82.2 |
| pose-estimation-on-coco-val2017 | MogaNet-S (384x288) | AP: 76.4 AP50: 91 AP75: 83.3 AR: 81.4 |
| semantic-segmentation-on-ade20k | MogaNet-B (UperNet) | GFLOPs (512 x 512): 1050 Validation mIoU: 50.1 |
| semantic-segmentation-on-ade20k | MogaNet-L (UperNet) | GFLOPs (512 x 512): 1176 Validation mIoU: 50.9 |
| semantic-segmentation-on-ade20k | MogaNet-S (Semantic FPN) | GFLOPs (512 x 512): 189 Validation mIoU: 47.7 |
| semantic-segmentation-on-ade20k | MogaNet-S (UperNet) | GFLOPs (512 x 512): 946 Validation mIoU: 49.2 |
| semantic-segmentation-on-ade20k | MogaNet-XL (UperNet) | Validation mIoU: 54 |
| video-prediction-on-moving-mnist | VAN (SimVP 10x) | MAE: 53.57 MSE: 16.21 SSIM: 0.9646 |
| video-prediction-on-moving-mnist | Swin (SimVP 10x) | MAE: 59.84 MSE: 19.11 |
| video-prediction-on-moving-mnist | ConvMixer (SimVP 10x) | MAE: 67.37 MSE: 22.3 |
| video-prediction-on-moving-mnist | Uniformer (SimVP 10x) | MAE: 57.52 MSE: 18.01 |
| video-prediction-on-moving-mnist | MLP-Mixer (SimVP 10x) | MAE: 59.86 MSE: 18.85 |
| video-prediction-on-moving-mnist | ViT (SimVP 10x) | MAE: 61.65 MSE: 19.74 SSIM: 0.9539 |
| video-prediction-on-moving-mnist | MogaNet (SimVP 10x) | MAE: 51.84 MSE: 15.67 SSIM: 0.9661 |
| video-prediction-on-moving-mnist | ConvNeXt (SimVP 10x) | MAE: 55.76 MSE: 17.58 SSIM: 0.9617 |
| video-prediction-on-moving-mnist | HorNet (SimVP 10x) | MAE: 55.7 MSE: 17.4 SSIM: 0.9624 |
| video-prediction-on-moving-mnist | Poolformer (SimVP 10x) | MAE: 64.31 MSE: 20.96 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.