8 months ago

Abstract

Hateful memes are a growing menace on social media. While the image and itscorresponding text in a meme are related, they do not necessarily convey thesame meaning when viewed individually. Hence, detecting hateful memes requirescareful consideration of both visual and textual information. Multimodalpre-training can be beneficial for this task because it effectively capturesthe relationship between the image and the text by representing them in asimilar feature space. Furthermore, it is essential to model the interactionsbetween the image and text features through intermediate fusion. Most existingmethods either employ multimodal pre-training or intermediate fusion, but notboth. In this work, we propose the Hate-CLIPper architecture, which explicitlymodels the cross-modal interactions between the image and text representationsobtained using Contrastive Language-Image Pre-training (CLIP) encoders via afeature interaction matrix (FIM). A simple classifier based on the FIMrepresentation is able to achieve state-of-the-art performance on the HatefulMemes Challenge (HMC) dataset with an AUROC of 85.8, which even surpasses thehuman performance of 82.65. Experiments on other meme datasets such asPropaganda Memes and TamilMemes also demonstrate the generalizability of theproposed approach. Finally, we analyze the interpretability of the FIMrepresentation and show that cross-modal interactions can indeed facilitate thelearning of meaningful concepts. The code for this work is available athttps://github.com/gokulkarthik/hateclipper.

Source PDF View Code