Command Palette
Search for a command to run...
Jingyun Liang; Jiezhang Cao; Yuchen Fan; Kai Zhang; Rakesh Ranjan; Yawei Li; Radu Timofte; Luc Van Gool

Abstract
Video restoration (e.g., video super-resolution) aims to restore high-quality frames from low-quality frames. Different from single image restoration, video restoration generally requires to utilize temporal information from multiple adjacent but usually misaligned video frames. Existing deep methods generally tackle with this by exploiting a sliding window strategy or a recurrent architecture, which either is restricted by frame-by-frame restoration or lacks long-range modelling ability. In this paper, we propose a Video Restoration Transformer (VRT) with parallel frame prediction and long-range temporal dependency modelling abilities. More specifically, VRT is composed of multiple scales, each of which consists of two kinds of modules: temporal mutual self attention (TMSA) and parallel warping. TMSA divides the video into small clips, on which mutual attention is applied for joint motion estimation, feature alignment and feature fusion, while self attention is used for feature extraction. To enable cross-clip interactions, the video sequence is shifted for every other layer. Besides, parallel warping is used to further fuse information from neighboring frames by parallel feature warping. Experimental results on five tasks, including video super-resolution, video deblurring, video denoising, video frame interpolation and space-time video super-resolution, demonstrate that VRT outperforms the state-of-the-art methods by large margins ($\textbf{up to 2.16dB}$) on fourteen benchmark datasets.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| deblurring-on-based | VRT (GoPro) | ERQAv2.0: 0.74874 LPIPS: 0.08165 PSNR: 31.42945 SSIM: 0.94503 Subjective: 2.3854 VMAF: 66.72253 |
| deblurring-on-based | VRT (REDS) | ERQAv2.0: 0.75056 LPIPS: 0.08248 PSNR: 30.97878 SSIM: 0.94601 Subjective: 1.5660 VMAF: 66.81782 |
| deblurring-on-based-1 | VRT (GoPro) | PSNR: 31.42945 VMAF: 66.72253 |
| deblurring-on-based-1 | VRT (REDS) | ERQAv2.0: 0.74874 LPIPS: 0.08248 PSNR: 30.97878 SSIM: 0.94503 VMAF: 66.81782 |
| deblurring-on-dvd-1 | VRT | PSNR: 34.27 |
| deblurring-on-gopro | VRT | PSNR: 34.81 SSIM: 0.9724 |
| deblurring-on-reds | VRT | Average PSNR: 36.79 |
| space-time-video-super-resolution-on-vimeo90k | VRT | PSNR: 36.98 SSIM: 0.9439 |
| space-time-video-super-resolution-on-vimeo90k-1 | VRT | PSNR: 36.01 SSIM: 0.9434 |
| video-denoising-on-davis-sigma10 | VRT | PSNR: 40.82 |
| video-denoising-on-davis-sigma20 | VRT | PSNR: 38.15 |
| video-denoising-on-davis-sigma30 | VRT | PSNR: 36.52 |
| video-denoising-on-davis-sigma40 | VRT | PSNR: 35.32 |
| video-denoising-on-davis-sigma50 | VRT | PSNR: 34.36 |
| video-denoising-on-set8-sigma10 | VRT | PSNR: 37.88 |
| video-denoising-on-set8-sigma20 | VRT | PSNR: 35.02 |
| video-denoising-on-set8-sigma30 | VRT | PSNR: 33.35 |
| video-denoising-on-set8-sigma40 | VRT | PSNR: 32.15 |
| video-denoising-on-set8-sigma50 | VRT | PSNR: 31.22 |
| video-frame-interpolation-on-vid4-4x | VRT | PSNR: 27.46 Parameters: 4450000 SSIM: 0.8392 |
| video-super-resolution-on-msu-super-1 | VRT + uavs3e | BSQ-rate over ERQA: 6.619 BSQ-rate over LPIPS: 4.003 BSQ-rate over MS-SSIM: 1.982 BSQ-rate over PSNR: 5.862 BSQ-rate over Subjective Score: 2.511 BSQ-rate over VMAF: 1.425 |
| video-super-resolution-on-msu-super-1 | VRT + aomenc | BSQ-rate over ERQA: 12.289 BSQ-rate over LPIPS: 4.429 BSQ-rate over MS-SSIM: 2.797 BSQ-rate over PSNR: 10.075 BSQ-rate over Subjective Score: 2.631 BSQ-rate over VMAF: 1.733 |
| video-super-resolution-on-msu-super-1 | VRT + vvenc | BSQ-rate over ERQA: 18.333 BSQ-rate over LPIPS: 11.496 BSQ-rate over MS-SSIM: 0.836 BSQ-rate over PSNR: 5.777 BSQ-rate over Subjective Score: 2.235 BSQ-rate over VMAF: 0.652 |
| video-super-resolution-on-msu-super-1 | VRT + x265 | BSQ-rate over ERQA: 8.92 BSQ-rate over LPIPS: 11.329 BSQ-rate over MS-SSIM: 1.257 BSQ-rate over PSNR: 6.634 BSQ-rate over Subjective Score: 2.023 BSQ-rate over VMAF: 1.217 |
| video-super-resolution-on-msu-super-1 | VRT + x264 | BSQ-rate over ERQA: 1.578 BSQ-rate over LPIPS: 1.259 BSQ-rate over MS-SSIM: 0.662 BSQ-rate over PSNR: 1.09 BSQ-rate over Subjective Score: 1.245 BSQ-rate over VMAF: 0.7 |
| video-super-resolution-on-msu-video-upscalers | VRT-Reds-L | LPIPS: 0.343 PSNR: 31.01 SSIM: 0.869 |
| video-super-resolution-on-msu-vsr-benchmark | VRT | 1 - LPIPS: 0.929 ERQAv1.0: 0.758 FPS: 2.778 PSNR: 31.669 QRCRv1.0: 0.722 SSIM: 0.902 Subjective score: 7.628 |
| video-super-resolution-on-udm10-4x-upscaling | VRT | PSNR: 41.05 SSIM: 0.9737 |
| video-super-resolution-on-vid4-4x-upscaling | VRT | PSNR: 27.93 SSIM: 0.8425 |
| video-super-resolution-on-vid4-4x-upscaling-1 | VRT | PSNR: 29.42 SSIM: 0.8795 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.