
摘要
生成能够准确预测未来世界状态的视频帧是一项具有挑战性的任务。现有的方法要么无法捕捉到所有可能结果的完整分布,要么生成的图像模糊不清,甚至两者兼有。本文介绍了一种无监督视频生成模型,该模型能够在给定环境中学习不确定性先验模型。通过从该先验模型中抽取样本,并将其与对未来帧的确定性估计相结合,从而生成视频帧。该方法简单且易于训练,可以在多种数据集上进行端到端的训练。即使在预测较远未来的帧时,生成的样本既多样化又清晰,并且与现有方法相比表现出色。
代码仓库
joelouismarino/amortized-variational-filtering
pytorch
GitHub 中提及
edenton/svg
官方
pytorch
GitHub 中提及
MIT-Omnipush/video-prediction
tf
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| video-generation-on-bair-robot-pushing | SVG-FP (from FVD) | Cond: 2 FVD score: 315.5 Pred: 14 Train: 14 |
| video-generation-on-bair-robot-pushing | SVG-LP (from vRNN) | Cond: 2 FVD score: 256.62 LPIPS: 0.061±0.03 Pred: 28 SSIM: 0.816±0.07 Train: 10 |
| video-generation-on-bair-robot-pushing | SVG (from SRVP) | Cond: 2 FVD score: 255±4 LPIPS: 0.0609±0.0034 PSNR: 18.95±0.26 Pred: 28 SSIM: 0.8058±0.0088 Train: 12 |
| video-prediction-on-cityscapes-128x128 | SVG (from Hier-VRNN) | Cond.: 2 FVD: 1300.26 LPIPS: 0.549 ± 0.06 Pred: 28 SSIM: 0.574±0.08 Train: 10 |
| video-prediction-on-kth | SVG-LP (from Grid-keypoints) | Cond: 10 FVD: 157.9 LPIPS: 0.129 PSNR: 23.91 Params (M): 22.8 Pred: 40 SSIM: 0.800 Train: 10 |
| video-prediction-on-kth | SVG-LP (from SRVP) | Cond: 10 FVD: 377 ± 6 LPIPS: 0.0923±0.0038 PSNR: 28.06±0.29 Pred: 30 SSIM: 0.8438±0.0054 Train: 10 |
| video-prediction-on-synpickvp | SVG-LP | LPIPS: 0.066 MSE: 51.82 PSNR: 27..38 SSIM: 0.886 |
| video-prediction-on-synpickvp | SVG-Det | LPIPS: 0.068 MSE: 60.60 PSNR: 26.92 SSIM: 0.879 |