Carlos RiquelmeJoan PuigcerverBasil MustafaMaxim NeumannRodolphe JenattonAndré Susano PintoDaniel KeysersNeil Houlsby

摘要
稀疏门控的专家混合网络(Sparsely-gated Mixture of Experts, MoE)在自然语言处理领域已展现出卓越的可扩展性。然而,在计算机视觉领域,几乎所有的高性能网络仍采用“稠密”结构,即每个输入都经过所有参数的处理。本文提出了一种视觉专家混合网络(Vision MoE, V-MoE),这是一种稀疏化的视觉Transformer架构,具备良好的可扩展性,并在性能上可与当前最大的稠密网络相媲美。在图像识别任务中,V-MoE达到了与最先进网络相当的性能水平,同时在推理阶段所需的计算量可减少至其一半。此外,我们对路由算法进行了扩展,使其能够对整个批次中每个输入的子集进行优先级调度,从而实现自适应的每图像计算量调节。这一特性使V-MoE能够在测试阶段平滑地在性能与计算开销之间进行权衡。最后,我们展示了V-MoE在扩展视觉模型方面的巨大潜力,并成功训练了一个参数量达150亿的模型,在ImageNet数据集上取得了90.35%的准确率。
代码仓库
google-research/vmoe
官方
jax
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| few-shot-image-classification-on-imagenet-1-1 | VIT-H/14 | Top 1 Accuracy: 62.34 |
| few-shot-image-classification-on-imagenet-1-1 | ViT-MoE-15B (Every-2) | Top 1 Accuracy: 68.66 |
| few-shot-image-classification-on-imagenet-1-1 | V-MoE-L/16 (Every-2) | Top 1 Accuracy: 62.41 |
| few-shot-image-classification-on-imagenet-1-1 | V-MoE-H/14 (Last-5) | Top 1 Accuracy: 62.95 |
| few-shot-image-classification-on-imagenet-1-1 | V-MoE-H/14 (Every-2) | Top 1 Accuracy: 63.38 |
| few-shot-image-classification-on-imagenet-10 | ViT-MoE-15B (Every-2) | Top 1 Accuracy: 84.29 |
| few-shot-image-classification-on-imagenet-10 | V-MoE-H/14 (Last-5) | Top 1 Accuracy: 80.1 |
| few-shot-image-classification-on-imagenet-10 | V-MoE-H/14 (Every-2) | Top 1 Accuracy: 80.33 |
| few-shot-image-classification-on-imagenet-10 | VIT-H/14 | Top 1 Accuracy: 79.01 |
| few-shot-image-classification-on-imagenet-5 | V-MoE-H/14 (Every-2) | Top 1 Accuracy: 78.21 |
| few-shot-image-classification-on-imagenet-5 | ViT-MoE-15B (Every-2) | Top 1 Accuracy: 82.78 |
| few-shot-image-classification-on-imagenet-5 | V-MoE-L/16 (Every-2) | Top 1 Accuracy: 77.1 |
| few-shot-image-classification-on-imagenet-5 | V-MoE-H/14 (Last-5) | Top 1 Accuracy: 78.08 |
| few-shot-image-classification-on-imagenet-5 | VIT-H/14 | Top 1 Accuracy: 76.95 |
| image-classification-on-imagenet | V-MoE-H/14 (Every-2) | Number of params: 7200M Top 1 Accuracy: 88.36% |
| image-classification-on-imagenet | VIT-H/14 | Number of params: 656M Top 1 Accuracy: 88.08% |
| image-classification-on-imagenet | V-MoE-L/16 (Every-2) | Number of params: 3400M Top 1 Accuracy: 87.41% |
| image-classification-on-jft-300m | VIT-H/14 | prec@1: 56.68 |
| image-classification-on-jft-300m | V-MoE-H/14 (Every-2) | prec@1: 60.62 |
| image-classification-on-jft-300m | V-MoE-L/16 (Every-2) | prec@1: 57.65 |
| image-classification-on-jft-300m | V-MoE-H/14 (Last-5) | prec@1: 60.12 |