
摘要
视频修复(例如,视频超分辨率)旨在从低质量帧中恢复高质量帧。与单图像修复不同,视频修复通常需要利用多个相邻但通常未对齐的视频帧中的时间信息。现有的深度学习方法通常通过滑动窗口策略或递归架构来解决这一问题,但前者受限于逐帧修复,后者则缺乏长距离建模能力。在本文中,我们提出了一种具有并行帧预测和长距离时间依赖建模能力的视频修复变压器(Video Restoration Transformer, VRT)。具体而言,VRT由多个尺度组成,每个尺度包含两种模块:时间互自注意力(Temporal Mutual Self Attention, TMSA)和平行变形(Parallel Warping)。TMSA将视频划分为小片段,在这些片段上应用互注意力进行联合运动估计、特征对齐和特征融合,而自注意力则用于特征提取。为了实现跨片段的交互,每两层之间会移动视频序列。此外,平行变形通过并行特征变形进一步融合邻近帧的信息。实验结果表明,在包括视频超分辨率、视频去模糊、视频降噪、视频帧插值和时空视频超分辨率在内的五项任务中,VRT在十四种基准数据集上的表现显著优于现有最先进方法(最高可达2.16 dB)。
代码仓库
jingyunliang/vrt
官方
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| deblurring-on-based | VRT (GoPro) | ERQAv2.0: 0.74874 LPIPS: 0.08165 PSNR: 31.42945 SSIM: 0.94503 Subjective: 2.3854 VMAF: 66.72253 |
| deblurring-on-based | VRT (REDS) | ERQAv2.0: 0.75056 LPIPS: 0.08248 PSNR: 30.97878 SSIM: 0.94601 Subjective: 1.5660 VMAF: 66.81782 |
| deblurring-on-based-1 | VRT (GoPro) | PSNR: 31.42945 VMAF: 66.72253 |
| deblurring-on-based-1 | VRT (REDS) | ERQAv2.0: 0.74874 LPIPS: 0.08248 PSNR: 30.97878 SSIM: 0.94503 VMAF: 66.81782 |
| deblurring-on-dvd-1 | VRT | PSNR: 34.27 |
| deblurring-on-gopro | VRT | PSNR: 34.81 SSIM: 0.9724 |
| deblurring-on-reds | VRT | Average PSNR: 36.79 |
| space-time-video-super-resolution-on-vimeo90k | VRT | PSNR: 36.98 SSIM: 0.9439 |
| space-time-video-super-resolution-on-vimeo90k-1 | VRT | PSNR: 36.01 SSIM: 0.9434 |
| video-denoising-on-davis-sigma10 | VRT | PSNR: 40.82 |
| video-denoising-on-davis-sigma20 | VRT | PSNR: 38.15 |
| video-denoising-on-davis-sigma30 | VRT | PSNR: 36.52 |
| video-denoising-on-davis-sigma40 | VRT | PSNR: 35.32 |
| video-denoising-on-davis-sigma50 | VRT | PSNR: 34.36 |
| video-denoising-on-set8-sigma10 | VRT | PSNR: 37.88 |
| video-denoising-on-set8-sigma20 | VRT | PSNR: 35.02 |
| video-denoising-on-set8-sigma30 | VRT | PSNR: 33.35 |
| video-denoising-on-set8-sigma40 | VRT | PSNR: 32.15 |
| video-denoising-on-set8-sigma50 | VRT | PSNR: 31.22 |
| video-frame-interpolation-on-vid4-4x | VRT | PSNR: 27.46 Parameters: 4450000 SSIM: 0.8392 |
| video-super-resolution-on-msu-super-1 | VRT + uavs3e | BSQ-rate over ERQA: 6.619 BSQ-rate over LPIPS: 4.003 BSQ-rate over MS-SSIM: 1.982 BSQ-rate over PSNR: 5.862 BSQ-rate over Subjective Score: 2.511 BSQ-rate over VMAF: 1.425 |
| video-super-resolution-on-msu-super-1 | VRT + aomenc | BSQ-rate over ERQA: 12.289 BSQ-rate over LPIPS: 4.429 BSQ-rate over MS-SSIM: 2.797 BSQ-rate over PSNR: 10.075 BSQ-rate over Subjective Score: 2.631 BSQ-rate over VMAF: 1.733 |
| video-super-resolution-on-msu-super-1 | VRT + vvenc | BSQ-rate over ERQA: 18.333 BSQ-rate over LPIPS: 11.496 BSQ-rate over MS-SSIM: 0.836 BSQ-rate over PSNR: 5.777 BSQ-rate over Subjective Score: 2.235 BSQ-rate over VMAF: 0.652 |
| video-super-resolution-on-msu-super-1 | VRT + x265 | BSQ-rate over ERQA: 8.92 BSQ-rate over LPIPS: 11.329 BSQ-rate over MS-SSIM: 1.257 BSQ-rate over PSNR: 6.634 BSQ-rate over Subjective Score: 2.023 BSQ-rate over VMAF: 1.217 |
| video-super-resolution-on-msu-super-1 | VRT + x264 | BSQ-rate over ERQA: 1.578 BSQ-rate over LPIPS: 1.259 BSQ-rate over MS-SSIM: 0.662 BSQ-rate over PSNR: 1.09 BSQ-rate over Subjective Score: 1.245 BSQ-rate over VMAF: 0.7 |
| video-super-resolution-on-msu-video-upscalers | VRT-Reds-L | LPIPS: 0.343 PSNR: 31.01 SSIM: 0.869 |
| video-super-resolution-on-msu-vsr-benchmark | VRT | 1 - LPIPS: 0.929 ERQAv1.0: 0.758 FPS: 2.778 PSNR: 31.669 QRCRv1.0: 0.722 SSIM: 0.902 Subjective score: 7.628 |
| video-super-resolution-on-udm10-4x-upscaling | VRT | PSNR: 41.05 SSIM: 0.9737 |
| video-super-resolution-on-vid4-4x-upscaling | VRT | PSNR: 27.93 SSIM: 0.8425 |
| video-super-resolution-on-vid4-4x-upscaling-1 | VRT | PSNR: 29.42 SSIM: 0.8795 |