HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

M&M Mix: A Multimodal Multiview Transformer Ensemble

Xuehan Xiong Anurag Arnab Arsha Nagrani Cordelia Schmid

M&M Mix: A Multimodal Multiview Transformer Ensemble

Abstract

This report describes the approach behind our winning solution to the 2022 Epic-Kitchens Action Recognition Challenge. Our approach builds upon our recent work, Multiview Transformer for Video Recognition (MTV), and adapts it to multimodal inputs. Our final submission consists of an ensemble of Multimodal MTV (M&M) models varying backbone sizes and input modalities. Our approach achieved 52.8% Top-1 accuracy on the test set in action classes, which is 4.1% higher than last year's winning entry.

Benchmarks

BenchmarkMethodologyMetrics
action-recognition-on-epic-kitchens-100M&M (WTS 60M)
Action@1: 53.6
Noun@1: 66.3
Verb@1: 72.0

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp