
摘要
我们提出Mobile Video Networks(MoViNets),这是一类计算与内存高效型视频网络,能够对流式视频进行在线推理。三维卷积神经网络(3D CNN)在视频识别任务中具有较高的准确性,但其计算与内存开销巨大,且不支持在线推理,难以在移动设备上部署。为此,我们提出一种三步法,在显著降低3D CNN峰值内存使用的同时大幅提升计算效率。首先,我们构建了一个视频网络搜索空间,并采用神经架构搜索(Neural Architecture Search, NAS)技术,生成高效且多样化的3D CNN架构。其次,我们引入“流缓冲”(Stream Buffer)技术,将内存需求与视频片段时长解耦,使3D CNN能够在训练和推理阶段以极小的恒定内存开销,处理任意长度的流式视频序列。第三,我们提出一种简洁的集成(ensembling)策略,在不牺牲效率的前提下进一步提升模型精度。这三项逐步递进的技术使MoViNets在Kinetics、Moments in Time和Charades等视频动作识别数据集上达到了当前最优的精度与效率平衡。例如,MoViNet-A5-Stream在Kinetics 600数据集上的精度与X3D-XL相当,但所需浮点运算量(FLOPs)减少80%,内存占用降低65%。相关代码将发布于:https://github.com/tensorflow/models/tree/master/official/vision。
代码仓库
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| action-classification-on-charades | MoViNet-A2 | MAP: 32.5 |
| action-classification-on-charades | MoViNet-A6 | MAP: 63.2 |
| action-classification-on-charades | MoViNet-A4 | MAP: 48.5 |
| action-classification-on-kinetics-400 | MoViNet-A4 | Acc@1: 80.5 Acc@5: 94.5 FLOPs (G) x views: 105x1 |
| action-classification-on-kinetics-400 | MoViNet-A6 | Acc@1: 81.5 FLOPs (G) x views: 386x1 |
| action-classification-on-kinetics-400 | MoViNet-A1 | Acc@1: 72.7 Acc@5: 91.2 FLOPs (G) x views: 6.0x1 |
| action-classification-on-kinetics-400 | MoViNet-A2 | Acc@1: 75.0 Acc@5: 92.3 FLOPs (G) x views: 10.3x1 |
| action-classification-on-kinetics-400 | MoViNet-A5 | Acc@1: 80.9 Acc@5: 94.9 FLOPs (G) x views: 281x1 |
| action-classification-on-kinetics-400 | MoViNet-A3 | Acc@1: 78.2 Acc@5: 93.8 FLOPs (G) x views: 56.9x1 |
| action-classification-on-kinetics-400 | MoViNet-A0 | Acc@1: 65.8 Acc@5: 87.4 FLOPs (G) x views: 2.7x1 |
| action-classification-on-kinetics-600 | MoViNet-A5 | GFLOPs: 281x1 Top-1 Accuracy: 82.7 Top-5 Accuracy: 95.7 |
| action-classification-on-kinetics-600 | MoViNet-A2 | GFLOPs: 10.3x1 Top-1 Accuracy: 77.5 Top-5 Accuracy: 93.4 |
| action-classification-on-kinetics-600 | MoViNet-A6 | GFLOPs: 386x1 Top-1 Accuracy: 83.5 Top-5 Accuracy: 96.5 |
| action-classification-on-kinetics-600 | MoViNet-A1 | GFLOPs: 6.0x1 Top-1 Accuracy: 76.0 Top-5 Accuracy: 92.6 |
| action-classification-on-kinetics-600 | MoViNet-A4 | GFLOPs: 105x1 Top-1 Accuracy: 81.2 Top-5 Accuracy: 94.9 |
| action-classification-on-kinetics-600 | MoViNet-A5 (AutoAugment) | GFLOPs: 281x1 Top-1 Accuracy: 84.3 Top-5 Accuracy: 96.4 |
| action-classification-on-kinetics-600 | MoViNet-A0 | GFLOPs: 2.7x1 Top-1 Accuracy: 71.5 Top-5 Accuracy: 90.4 |
| action-classification-on-kinetics-600 | MoViNet-A3 | GFLOPs: 56.9x1 Top-1 Accuracy: 80.8 Top-5 Accuracy: 80.8 |
| action-classification-on-kinetics-700 | MoViNet-A1 | Top-1 Accuracy: 63.5 |
| action-classification-on-kinetics-700 | MoViNet-A2 | Top-1 Accuracy: 66.7 |
| action-classification-on-kinetics-700 | MoViNet-A3 | Top-1 Accuracy: 68.0 |
| action-classification-on-kinetics-700 | MoViNet-A4 | Top-1 Accuracy: 70.7 |
| action-classification-on-kinetics-700 | MoViNet-A5 | Top-1 Accuracy: 71.7 |
| action-classification-on-kinetics-700 | MoViNet-A6 | Top-1 Accuracy: 72.3 |
| action-classification-on-kinetics-700 | MoViNet-A0 | Top-1 Accuracy: 58.5 |
| action-classification-on-moments-in-time | MoViNet-A5 | Top 1 Accuracy: 39.1 |
| action-classification-on-moments-in-time | MoViNet-A4 | Top 1 Accuracy: 37.9 |
| action-classification-on-moments-in-time | MoViNet-A0 | Top 1 Accuracy: 27.5 |
| action-classification-on-moments-in-time | MoViNet-A2 | Top 1 Accuracy: 34.3 |
| action-classification-on-moments-in-time | MoViNet-A6 | Top 1 Accuracy: 40.2 |
| action-classification-on-moments-in-time | MoViNet-A1 | Top 1 Accuracy: 32.0 |
| action-classification-on-moments-in-time | MoViNet-A3 | Top 1 Accuracy: 35.6 |
| action-recognition-in-videos-on-something | MoViNet-A0 | GFLOPs: 2.7x1 Parameters: 3.1M Top-1 Accuracy: 61.3 Top-5 Accuracy: 88.2 |
| action-recognition-in-videos-on-something | MoViNet-A1 | GFLOPs: 6.0x1 Parameters: 4.6M Top-1 Accuracy: 62.7 Top-5 Accuracy: 89.0 |
| action-recognition-in-videos-on-something | MoViNet-A3 | GFLOPs: 23.7x1 Parameters: 5.3M |
| action-recognition-in-videos-on-something | MoViNet-A2 | GFLOPs: 10.3x1 Parameters: 4.8M Top-1 Accuracy: 63.5 Top-5 Accuracy: 89.0 |
| action-recognition-on-epic-kitchens-100 | MoViNet-A5 | Action@1: 44.5 GFLOPs: 74.9x1 Noun@1: 55.1 Verb@1: 69.1 |
| action-recognition-on-epic-kitchens-100 | MoViNet-A2 | Action@1: 41.2 GFLOPs: 7.59x1 Noun@1: 52.3 Verb@1: 67.1 |
| action-recognition-on-epic-kitchens-100 | MoViNet-A6 | Action@1: 47.7 GFLOPs: 117x1 Noun@1: 57.3 Verb@1: 72.2 |
| action-recognition-on-epic-kitchens-100 | MoViNet-A4 | Action@1: 44.4 GFLOPs: 42.2x1 Noun@1: 56.2 Verb@1: 68.8 |
| action-recognition-on-epic-kitchens-100 | MoViNet-A0 | Action@1: 36.8 GFLOPs: 1.74x1 Noun@1: 47.4 Verb@1: 64.8 |