
摘要
我们提出了一种新型的多模态多任务网络及其配套的训练算法。该方法能够接收来自约12种不同模态的数据,包括图像、视频、音频、文本、深度图、点云、时间序列、表格数据、图结构数据、X射线、红外图像、惯性测量单元(IMU)以及高光谱数据。所提出的方法采用模态专用的分词器(tokenizers)、共享的Transformer架构以及跨模态注意力机制,将来自不同模态的数据映射到统一的嵌入空间中。通过为各模态中的不同任务配置特定的模态任务头(modality-specific task heads),该方法有效应对多模态与多任务的学习场景。我们进一步提出一种新颖的预训练策略——迭代式模态切换(iterative modality switching),用于网络初始化,并设计了一种训练算法,在对所有模态进行完全联合训练与每次仅针对模态对进行训练之间实现权衡。我们在来自12种模态的25个数据集上进行了全面评估,结果表明该方法在多项任务上达到当前最优性能,充分验证了所提出架构、预训练策略以及自适应多任务训练范式的有效性。
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| 3d-point-cloud-classification-on-modelnet40-c | OmniVec2 | Error Rate: 0.142 |
| 3d-point-cloud-classification-on-scanobjectnn | OmniVec2 | Overall Accuracy: 97.2 |
| action-classification-on-kinetics-400 | OmniVec2 | Acc@1: 93.6 |
| action-classification-on-moments-in-time | OmniVec2 | Top 1 Accuracy: 53.1 |
| action-classification-on-moments-in-time-2 | OmniVec2 | Top 1 Accuracy: 53.1 |
| action-recognition-in-videos-on-ucf101 | OmniVec2 | 3-fold Accuracy: 99.6 |
| audio-classification-on-audioset | OmniVec2 | Test mAP: 0.558 |
| audio-classification-on-esc-50 | OmniVec2 | Accuracy (5-fold): 99.1 PRE-TRAINING DATASET: Multiple Top-1 Accuracy: 99.1 |
| fine-grained-image-classification-on-oxford-1 | OmniVec2 | Accuracy: 99.6 |
| image-classification-on-imagenet | OmniVec2 | Top 1 Accuracy: 89.3% |
| image-classification-on-inaturalist-2018 | OmniVec2 | Top-1 Accuracy: 94.6 |
| image-classification-on-places365 | OmniVec2 | Top 1 Accuracy: 65.1 |
| semantic-segmentation-on-nyu-depth-v2 | OmniVec2 | Mean IoU: 63.6 |
| text-summarization-on-dialogsum | OmniVec2 | BertScore: 72.8 Rouge1: 47.6 Rouge2: 22.1 RougeL: 41.4 |
| text-summarization-on-samsum-corpus | OmniVec2 | BertScoreF1: 65.1 ROUGE-1: 59.1 ROUGE-2: 34.1 ROUGE-L: 63.7 |
| zero-shot-video-retrieval-on-youcook2 | OmniVec2 | text-to-video R@1: 26.1 text-to-video R@10: 70.8 text-to-video R@5: 54.1 |