RyaliChaitanya ; HuYuan-Ting ; BolyaDaniel ; WeiChen ; FanHaoqi ; HuangPo-Yao ; AggarwalVaibhav ; ChowdhuryArkabandhu ; PoursaeedOmid ; HoffmanJudy ; MalikJitendra ; LiYanghao ; FeichtenhoferChristoph

摘要
现代分层视觉变换器在追求监督分类性能的过程中添加了多个视觉特定组件。尽管这些组件提高了模型的有效精度并降低了浮点运算次数(FLOPs),但增加的复杂性实际上使得这些变换器比其基础版本的视觉变换器(ViT)运行得更慢。在本文中,我们认为这种额外的复杂性是不必要的。通过使用强大的视觉预训练任务(如掩码自动编码器(MAE)),我们可以从最先进的多阶段视觉变换器中移除所有附加组件而不损失精度。在此过程中,我们创建了Hiera,这是一种极其简单的分层视觉变换器,不仅比之前的模型更准确,而且在推理和训练过程中都显著更快。我们在多种图像和视频识别任务上对Hiera进行了评估。我们的代码和模型可在https://github.com/facebookresearch/hiera 获取。
代码仓库
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| action-classification-on-kinetics-400 | Hiera-H (no extra data) | Acc@1: 87.8 |
| action-classification-on-kinetics-600 | Hiera-H (no extra data) | Top-1 Accuracy: 88.8 |
| action-classification-on-kinetics-700 | Hiera-H (no extra data) | Top-1 Accuracy: 81.1 |
| action-recognition-in-videos-on-something | Hiera-L (no extra data) | Top-1 Accuracy: 76.5 |
| action-recognition-on-ava-v2-2 | Hiera-H (K700 PT+FT) | mAP: 43.3 |
| image-classification-on-imagenet | Hiera-H | Top 1 Accuracy: 86.9% |
| image-classification-on-inaturalist | Hiera-H (448px) | Top 1 Accuracy: 83.8 |
| image-classification-on-inaturalist-2018 | Hiera-H (448px) | Top-1 Accuracy: 87.3% |
| image-classification-on-inaturalist-2019 | Hiera-H (448px) | Top-1 Accuracy: 88.5 |
| image-classification-on-places365-standard | Hiera-H (448px) | Top 1 Accuracy: 60.6 |
| instance-segmentation-on-coco-minival | Heira-L | mask AP: 48.6 |
| object-detection-on-coco-minival | Hiera-L | box AP: 55 |