8 months ago

Abstract

Effectiveness and interpretability are two essential properties fortrustworthy AI systems. Most recent studies in visual reasoning are dedicatedto improving the accuracy of predicted answers, and less attention is paid toexplaining the rationales behind the decisions. As a result, they commonly takeadvantage of spurious biases instead of actually reasoning on thevisual-textual data, and have yet developed the capability to explain theirdecision making by considering key information from both modalities. This paperaims to close the gap from three distinct perspectives: first, we define a newtype of multi-modal explanations that explain the decisions by progressivelytraversing the reasoning process and grounding keywords in the images. Wedevelop a functional program to sequentially execute different reasoning stepsand construct a new dataset with 1,040,830 multi-modal explanations. Second, weidentify the critical need to tightly couple important components across thevisual and textual modalities for explaining the decisions, and propose a novelexplanation generation method that explicitly models the pairwisecorrespondence between words and regions of interest. It improves the visualgrounding capability by a considerable margin, resulting in enhancedinterpretability and reasoning performance. Finally, with our new data andmethod, we perform extensive analyses to study the effectiveness of ourexplanation under different settings, including multi-task learning andtransfer learning. Our code and data are available athttps://github.com/szzexpoi/rex.

Source PDF