HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Combining Global and Local Attention with Positional Encoding for Video Summarization

{Ioannis Patras Vasileios Mezaris Georgios Balaouras Evlampios Apostolidis}

Combining Global and Local Attention with Positional Encoding for Video Summarization

Abstract

This paper presents a new method for supervised video summarization. To overcome drawbacks of existing RNN-based summarization architectures, that relate to the modeling of long-range frames' dependencies and the ability to parallelize the training process, the developed model relies on the use of self-attention mechanisms to estimate the importance of video frames. Contrary to previous attention-based summarization approaches that model the frames' dependencies by observing the entire frame sequence, our method combines global and local multi-head attention mechanisms to discover different modelings of the frames' dependencies at different levels of granularity. Moreover, the utilized attention mechanisms integrate a component that encodes the temporal position of video frames - this is of major importance when producing a video summary. Experiments on two datasets (SumMe and TVSum) demonstrate the effectiveness of the proposed model compared to existing attention-based methods, and its competitiveness against other state-of-the-art supervised summarization approaches. An ablation study that focuses on our main proposed components, namely the use of global and local multi-head attention mechanisms in collaboration with an absolute positional encoding component, shows their relative contributions to the overall summarization performance.

Benchmarks

BenchmarkMethodologyMetrics
supervised-video-summarization-on-summePGL-SUM
F1-score (Canonical): 55.6
supervised-video-summarization-on-summePGL-SUM (maximum learning capacity)
F1-score (Canonical): 57.1
supervised-video-summarization-on-tvsumPGL-SUM
F1-score (Canonical): 61.0
Kendall's Tau: 0.157
Spearman's Rho: 0.206
supervised-video-summarization-on-tvsumPGL-SUM (maximum learning capacity)
F1-score (Canonical): 62.7
video-summarization-on-summePGL-SUM
F1-score (Canonical): 55.6
video-summarization-on-tvsumPGL-SUM
F1-score (Canonical): 61.0

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Combining Global and Local Attention with Positional Encoding for Video Summarization | Papers | HyperAI