Command Palette
Search for a command to run...
Mol2Lang-VLM: Vision- and Text-Guided Generative Pre-trained Language Models for Advancing Molecule Captioning through Multimodal Fusion
{and Balachandran Manavalan Nguyen Nguyen Nhat Truong Pham Duong Tran}

Abstract
This paper introduces Mol2Lang-VLM, an enhanced method for refining generative pre-trained language models for molecule captioning using multimodal features to achieve more accurate caption generation. Our approach leverages the encoder and decoder blocks of the Transformer-based architecture by introducing third sub-layers into both. Specifically, we insert sub-layers in the encoder to fuse features from SELFIES strings and molecular images, while the decoder fuses features from SMILES strings and their corresponding descriptions. Moreover, cross multi-head attention is employed instead of common multi-head attention to enable the decoder to attend to the encoder’s output, thereby integrating the encoded contextual information for better and more accurate caption generation. Performance evaluation on the CheBI-20 and L+M-24 benchmark datasets demonstrates Mol2Lang-VLM’s superiority, achieving higher accuracy and quality in caption generation compared to existing methods. Our code and pre-processed data are available at https://github.com/nhattruongpham/mol-lang-bridge/tree/mol2lang/.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| molecule-captioning-on-chebi-20 | Mol2Lang-VLM | BLEU-2: 61.2 BLEU-4: 52.7 METEOR: 63.3 ROUGE-1: 67.4 ROUGE-2: 53.2 ROUGE-L: 61.4 Text2Mol: 59.8 |
| molecule-captioning-on-l-m-24 | Mol2Lang-VLM | BLEU-2: 77.7 BLEU-4: 56.3 METEOR: 74.1 ROUGE-1: 78.6 ROUGE-2: 59.1 ROUGE-L: 56.5 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.