Command Palette
Search for a command to run...
Yanhong Zeng; Jianlong Fu; Hongyang Chao

Abstract
High-quality video inpainting that completes missing regions in video frames is a promising yet challenging task. State-of-the-art approaches adopt attention models to complete a frame by searching missing contents from reference frames, and further complete whole videos frame by frame. However, these approaches can suffer from inconsistent attention results along spatial and temporal dimensions, which often leads to blurriness and temporal artifacts in videos. In this paper, we propose to learn a joint Spatial-Temporal Transformer Network (STTN) for video inpainting. Specifically, we simultaneously fill missing regions in all input frames by self-attention, and propose to optimize STTN by a spatial-temporal adversarial loss. To show the superiority of the proposed model, we conduct both quantitative and qualitative evaluations by using standard stationary masks and more realistic moving object masks. Demo videos are available at https://github.com/researchmm/STTN.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| seeing-beyond-the-visible-on-kitti360-ex | STTN | Average PSNR: 18.73 |
| video-inpainting-on-davis | STTN | Ewarp: 0.1449 PSNR: 30.67 SSIM: 0.9560 VFID: 0.149 |
| video-inpainting-on-hqvi-240p | STTN | LPIPS: 0.0528 PSNR: 29.64 SSIM: 0.9339 VFID: 0.2594 |
| video-inpainting-on-youtube-vos | STTN | Ewarp: 0.0907 PSNR: 32.34 SSIM: 0.9655 VFID: 0.053 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.