Command Palette
Search for a command to run...
Kumara Kahatapitiya Kanchana Ranasinghe Jongwoo Park Michael S. Ryoo

Abstract
Language has become a prominent modality in computer vision with the rise of LLMs. Despite supporting long context-lengths, their effectiveness in handling long-term information gradually declines with input length. This becomes critical, especially in applications such as long-form video understanding. In this paper, we introduce a Language Repository (LangRepo) for LLMs, that maintains concise and structured information as an interpretable (i.e., all-textual) representation. Our repository is updated iteratively based on multi-scale video chunks. We introduce write and read operations that focus on pruning redundancies in text, and extracting information at various temporal scales. The proposed framework is evaluated on zero-shot visual question-answering benchmarks including EgoSchema, NExT-QA, IntentQA and NExT-GQA, showing state-of-the-art performance at its scale. Our code is available at https://github.com/kkahatapitiya/LangRepo.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| zero-shot-video-question-answer-on-egoschema | LangRepo (12B) | Accuracy: 66.2 |
| zero-shot-video-question-answer-on-egoschema-1 | LangRepo (12B) | Accuracy: 41.2 |
| zero-shot-video-question-answer-on-intentqa | LangRepo (12B) | Accuracy: 59.1 |
| zero-shot-video-question-answer-on-next-gqa | LangRepo (12B) | Acc@GQA: 17.1 |
| zero-shot-video-question-answer-on-next-qa | LangRepo (12B) | Accuracy: 60.9 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.