3 个月前

MoViNets:面向高效视频识别的移动视频网络

MoViNets:面向高效视频识别的移动视频网络

摘要

我们提出Mobile Video Networks(MoViNets),这是一类计算与内存高效型视频网络,能够对流式视频进行在线推理。三维卷积神经网络(3D CNN)在视频识别任务中具有较高的准确性,但其计算与内存开销巨大,且不支持在线推理,难以在移动设备上部署。为此,我们提出一种三步法,在显著降低3D CNN峰值内存使用的同时大幅提升计算效率。首先,我们构建了一个视频网络搜索空间,并采用神经架构搜索(Neural Architecture Search, NAS)技术,生成高效且多样化的3D CNN架构。其次,我们引入“流缓冲”(Stream Buffer)技术,将内存需求与视频片段时长解耦,使3D CNN能够在训练和推理阶段以极小的恒定内存开销,处理任意长度的流式视频序列。第三,我们提出一种简洁的集成(ensembling)策略,在不牺牲效率的前提下进一步提升模型精度。这三项逐步递进的技术使MoViNets在Kinetics、Moments in Time和Charades等视频动作识别数据集上达到了当前最优的精度与效率平衡。例如,MoViNet-A5-Stream在Kinetics 600数据集上的精度与X3D-XL相当,但所需浮点运算量(FLOPs)减少80%,内存占用降低65%。相关代码将发布于:https://github.com/tensorflow/models/tree/master/official/vision。

代码仓库

Atze00/MoViNet-pytorch
pytorch
GitHub 中提及

基准测试

基准方法指标
action-classification-on-charadesMoViNet-A2
MAP: 32.5
action-classification-on-charadesMoViNet-A6
MAP: 63.2
action-classification-on-charadesMoViNet-A4
MAP: 48.5
action-classification-on-kinetics-400MoViNet-A4
Acc@1: 80.5
Acc@5: 94.5
FLOPs (G) x views: 105x1
action-classification-on-kinetics-400MoViNet-A6
Acc@1: 81.5
FLOPs (G) x views: 386x1
action-classification-on-kinetics-400MoViNet-A1
Acc@1: 72.7
Acc@5: 91.2
FLOPs (G) x views: 6.0x1
action-classification-on-kinetics-400MoViNet-A2
Acc@1: 75.0
Acc@5: 92.3
FLOPs (G) x views: 10.3x1
action-classification-on-kinetics-400MoViNet-A5
Acc@1: 80.9
Acc@5: 94.9
FLOPs (G) x views: 281x1
action-classification-on-kinetics-400MoViNet-A3
Acc@1: 78.2
Acc@5: 93.8
FLOPs (G) x views: 56.9x1
action-classification-on-kinetics-400MoViNet-A0
Acc@1: 65.8
Acc@5: 87.4
FLOPs (G) x views: 2.7x1
action-classification-on-kinetics-600MoViNet-A5
GFLOPs: 281x1
Top-1 Accuracy: 82.7
Top-5 Accuracy: 95.7
action-classification-on-kinetics-600MoViNet-A2
GFLOPs: 10.3x1
Top-1 Accuracy: 77.5
Top-5 Accuracy: 93.4
action-classification-on-kinetics-600MoViNet-A6
GFLOPs: 386x1
Top-1 Accuracy: 83.5
Top-5 Accuracy: 96.5
action-classification-on-kinetics-600MoViNet-A1
GFLOPs: 6.0x1
Top-1 Accuracy: 76.0
Top-5 Accuracy: 92.6
action-classification-on-kinetics-600MoViNet-A4
GFLOPs: 105x1
Top-1 Accuracy: 81.2
Top-5 Accuracy: 94.9
action-classification-on-kinetics-600MoViNet-A5 (AutoAugment)
GFLOPs: 281x1
Top-1 Accuracy: 84.3
Top-5 Accuracy: 96.4
action-classification-on-kinetics-600MoViNet-A0
GFLOPs: 2.7x1
Top-1 Accuracy: 71.5
Top-5 Accuracy: 90.4
action-classification-on-kinetics-600MoViNet-A3
GFLOPs: 56.9x1
Top-1 Accuracy: 80.8
Top-5 Accuracy: 80.8
action-classification-on-kinetics-700MoViNet-A1
Top-1 Accuracy: 63.5
action-classification-on-kinetics-700MoViNet-A2
Top-1 Accuracy: 66.7
action-classification-on-kinetics-700MoViNet-A3
Top-1 Accuracy: 68.0
action-classification-on-kinetics-700MoViNet-A4
Top-1 Accuracy: 70.7
action-classification-on-kinetics-700MoViNet-A5
Top-1 Accuracy: 71.7
action-classification-on-kinetics-700MoViNet-A6
Top-1 Accuracy: 72.3
action-classification-on-kinetics-700MoViNet-A0
Top-1 Accuracy: 58.5
action-classification-on-moments-in-timeMoViNet-A5
Top 1 Accuracy: 39.1
action-classification-on-moments-in-timeMoViNet-A4
Top 1 Accuracy: 37.9
action-classification-on-moments-in-timeMoViNet-A0
Top 1 Accuracy: 27.5
action-classification-on-moments-in-timeMoViNet-A2
Top 1 Accuracy: 34.3
action-classification-on-moments-in-timeMoViNet-A6
Top 1 Accuracy: 40.2
action-classification-on-moments-in-timeMoViNet-A1
Top 1 Accuracy: 32.0
action-classification-on-moments-in-timeMoViNet-A3
Top 1 Accuracy: 35.6
action-recognition-in-videos-on-somethingMoViNet-A0
GFLOPs: 2.7x1
Parameters: 3.1M
Top-1 Accuracy: 61.3
Top-5 Accuracy: 88.2
action-recognition-in-videos-on-somethingMoViNet-A1
GFLOPs: 6.0x1
Parameters: 4.6M
Top-1 Accuracy: 62.7
Top-5 Accuracy: 89.0
action-recognition-in-videos-on-somethingMoViNet-A3
GFLOPs: 23.7x1
Parameters: 5.3M
action-recognition-in-videos-on-somethingMoViNet-A2
GFLOPs: 10.3x1
Parameters: 4.8M
Top-1 Accuracy: 63.5
Top-5 Accuracy: 89.0
action-recognition-on-epic-kitchens-100MoViNet-A5
Action@1: 44.5
GFLOPs: 74.9x1
Noun@1: 55.1
Verb@1: 69.1
action-recognition-on-epic-kitchens-100MoViNet-A2
Action@1: 41.2
GFLOPs: 7.59x1
Noun@1: 52.3
Verb@1: 67.1
action-recognition-on-epic-kitchens-100MoViNet-A6
Action@1: 47.7
GFLOPs: 117x1
Noun@1: 57.3
Verb@1: 72.2
action-recognition-on-epic-kitchens-100MoViNet-A4
Action@1: 44.4
GFLOPs: 42.2x1
Noun@1: 56.2
Verb@1: 68.8
action-recognition-on-epic-kitchens-100MoViNet-A0
Action@1: 36.8
GFLOPs: 1.74x1
Noun@1: 47.4
Verb@1: 64.8

用 AI 构建 AI

从想法到上线——通过免费 AI 协同编程、开箱即用的环境和市场最优价格的 GPU 加速您的 AI 开发

AI 协同编程
即用型 GPU
最优价格
立即开始

Hyper Newsletters

订阅我们的最新资讯
我们会在北京时间 每周一的上午九点 向您的邮箱投递本周内的最新更新
邮件发送服务由 MailChimp 提供
MoViNets:面向高效视频识别的移动视频网络 | 论文 | HyperAI超神经