HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

L-Verse: Bidirectional Generation Between Image and Text

Taehoon Kim; Gwangmo Song; Sihaeng Lee; Sangyun Kim; Yewon Seo; Soonyoung Lee; Seung Hwan Kim; Honglak Lee; Kyunghoon Bae

L-Verse: Bidirectional Generation Between Image and Text

Abstract

Far beyond learning long-range interactions of natural language, transformers are becoming the de-facto standard for many vision tasks with their power and scalability. Especially with cross-modal tasks between image and text, vector quantized variational autoencoders (VQ-VAEs) are widely used to make a raw RGB image into a sequence of feature vectors. To better leverage the correlation between image and text, we propose L-Verse, a novel architecture consisting of feature-augmented variational autoencoder (AugVAE) and bidirectional auto-regressive transformer (BiART) for image-to-text and text-to-image generation. Our AugVAE shows the state-of-the-art reconstruction performance on ImageNet1K validation set, along with the robustness to unseen images in the wild. Unlike other models, BiART can distinguish between image (or text) as a conditional reference and a generation target. L-Verse can be directly used for image-to-text or text-to-image generation without any finetuning or extra object detection framework. In quantitative and qualitative experiments, L-Verse shows impressive results against previous methods in both image-to-text and text-to-image generation on MS-COCO Captions. We furthermore assess the scalability of L-Verse architecture on Conceptual Captions and present the initial result of bidirectional vision-language representation learning on general domain.

Code Repositories

tgisaturday/L-Verse
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
image-captioning-on-coco-captionsL-Verse
BLEU-4: 39.9
METEOR: 31.4
ROUGE-L: 60.4
SPICE: 23.3
image-reconstruction-on-imagenet-256x256AugVAE-SL
FID: 3.28
image-reconstruction-on-imagenet-256x256AugVAE-ML
FID: 1.04
text-to-image-generation-on-cocoL-Verse
FID: 45.8
FID-1: 41.9
FID-2: 35.5
FID-4: 30.2
FID-8: 29.83
text-to-image-generation-on-cocoL-Verse-CC
FID: 37.2
FID-1: 31.6
FID-2: 25.7
FID-4: 21.4
FID-8: 21.1

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
L-Verse: Bidirectional Generation Between Image and Text | Papers | HyperAI