HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

MOMA-LRG: Language-Refined Graphs for Multi-Object Multi-Actor Activity Parsing

MOMA-LRG: Language-Refined Graphs for Multi-Object Multi-Actor Activity Parsing

Abstract

Video-language models (VLMs), large models pre-trained on numerous but noisy video-text pairs from the internet, have revolutionized activity recognition through their remarkable generalization and open-vocabulary capabilities. While complex human activities are often hierarchical and compositional, most existing tasks for evaluating VLMs focus only on high-level video understanding, making it difficult to accurately assess and interpret the ability of VLMs to understand complex and fine-grained human activities. Inspired by the recently proposed MOMA framework, we define activity graphs as a single universal representation of human activities that encompasses video understanding at the activity, sub-activity, and atomic action level. We redefine activity parsing as the overarching task of activity graph generation, requiring understanding human activities across all three levels. To facilitate the evaluation of models on activity parsing, we introduce MOMA-LRG (Multi-Object Multi-Actor Language-Refined Graphs), a large dataset of complex human activities with activity graph annotations that can be readily transformed into natural language sentences. Lastly, we present a model-agnostic and lightweight approach to adapting and evaluating VLMs by incorporating structured knowledge from activity graphs into VLMs, addressing the individual limitations of language and graphical models. We demonstrate strong performance on few-shot activity parsing, and our framework is intended to foster future research in the joint modeling of videos, graphs, and language.

Benchmarks

BenchmarkMethodologyMetrics
few-shot-action-recognition-on-moma-lrgCMN
Activity Classification Accuracy (5-shot 5-way): 86.3
Subactivity Classification Accuracy (5-shot 5-way): 66.6
few-shot-action-recognition-on-moma-lrgOTAM
Activity Classification Accuracy (5-shot 5-way): 92.07
Subactivity Classification Accuracy (5-shot 5-way): 72.59
few-shot-action-recognition-on-moma-lrgSG-VLM
Activity Classification Accuracy (5-shot 5-way): 92.5
Subactivity Classification Accuracy (5-shot 5-way): 32.70

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
MOMA-LRG: Language-Refined Graphs for Multi-Object Multi-Actor Activity Parsing | Papers | HyperAI