8 months ago

Abstract

In this paper, we investigate the feasibility of leveraging large languagemodels (LLMs) for integrating general knowledge and incorporating pseudo-eventsas priors for temporal content distribution in video moment retrieval (VMR)models. The motivation behind this study arises from the limitations of usingLLMs as decoders for generating discrete textual descriptions, which hinderstheir direct application to continuous outputs like salience scores andinter-frame embeddings that capture inter-frame relations. To overcome theselimitations, we propose utilizing LLM encoders instead of decoders. Through afeasibility study, we demonstrate that LLM encoders effectively refineinter-concept relations in multimodal embeddings, even without being trained ontextual embeddings. We also show that the refinement capability of LLM encoderscan be transferred to other embeddings, such as BLIP and T5, as long as theseembeddings exhibit similar inter-concept similarity patterns to CLIPembeddings. We present a general framework for integrating LLM encoders intoexisting VMR architectures, specifically within the fusion module. Throughexperimental validation, we demonstrate the effectiveness of our proposedmethods by achieving state-of-the-art performance in VMR. The source code canbe accessed at https://github.com/fletcherjiang/LLMEPET.

Source PDF View Code