| image-sentence-alignment-on-valse | ViLBERT 12-in-1 | Average Accuracy: 63.2 average pairwise accuracy: 75.1 |
| image-sentence-alignment-on-valse | LXMERT | Average Accuracy: 53.5 average pairwise accuracy: 59.6 |
| image-sentence-alignment-on-valse | CLIP | average pairwise accuracy: 64.0 |
| image-sentence-alignment-on-valse | ViLBERT | Average Accuracy: 51.3 average pairwise accuracy: 63.7 |
| image-sentence-alignment-on-valse | VisualBERT | Average Accuracy: 48.8 average pairwise accuracy: 46.4 |
| image-sentence-alignment-on-valse | GPT1 | average pairwise accuracy: 60.7 |
| image-sentence-alignment-on-valse | GPT2 | average pairwise accuracy: 60.1 |
| image-sentence-alignment-on-valse-actant-swap | LXMERT | Accuracy (%): 48.5 pairwise accuracy: 45.8 |
| image-sentence-alignment-on-valse-actant-swap | CLIP | |
| image-sentence-alignment-on-valse-actant-swap | ViLBERT 12-in-1 | Accuracy (%): 52.2 pairwise accuracy: 58.9 |
| image-sentence-alignment-on-valse-actant-swap | VisualBERT | Accuracy (%): 49.7 pairwise accuracy: 44.4 |
| image-sentence-alignment-on-valse-actant-swap | GPT2 | |
| image-sentence-alignment-on-valse-actant-swap | ViLBERT | Accuracy (%): 50.4 pairwise accuracy: 68.3 |
| image-sentence-alignment-on-valse-actant-swap | GPT1 | |
| image-sentence-alignment-on-valse-action | GPT2 | |
| image-sentence-alignment-on-valse-action | VisualBERT | Accuracy (%): 48.8 pairwise accuracy: 49.2 |
| image-sentence-alignment-on-valse-action | GPT1 | |
| image-sentence-alignment-on-valse-action | LXMERT | Accuracy (%): 51.1 pairwise accuracy: 54.8 |
| image-sentence-alignment-on-valse-action | ViLBERT | Accuracy (%): 52.6 pairwise accuracy: 70.7 |
| image-sentence-alignment-on-valse-action | CLIP | |
| image-sentence-alignment-on-valse-action | ViLBERT 12-in-1 | Accuracy (%): 57.3 pairwise accuracy: 65.9 |
| image-sentence-alignment-on-valse-coreference | ViLBERT 12-in-1 | Accuracy (%): 54.4 pairwise accuracy: 75.7 |
| image-sentence-alignment-on-valse-coreference | CLIP | |
| image-sentence-alignment-on-valse-coreference | LXMERT | Accuracy (%): 49.8 pairwise accuracy: 46.8 |
| image-sentence-alignment-on-valse-coreference | ViLBERT | Accuracy (%): 50.0 pairwise accuracy: 47.2 |
| image-sentence-alignment-on-valse-coreference | VisualBERT | Accuracy (%): 50.0 pairwise accuracy: 49.5 |
| image-sentence-alignment-on-valse-coreference | GPT1 | |
| image-sentence-alignment-on-valse-coreference | GPT2 | |
| image-sentence-alignment-on-valse-coreference-1 | VisualBERT | Accuracy (%): 50.0 pairwise accuracy: 47.6 |
| image-sentence-alignment-on-valse-coreference-1 | ViLBERT 12-in-1 | Accuracy (%): 54.3 pairwise accuracy: 69.2 |
| image-sentence-alignment-on-valse-coreference-1 | GPT1 | |
| image-sentence-alignment-on-valse-coreference-1 | CLIP | |
| image-sentence-alignment-on-valse-coreference-1 | GPT2 | |
| image-sentence-alignment-on-valse-coreference-1 | LXMERT | Accuracy (%): 49.0 pairwise accuracy: 44.2 |
| image-sentence-alignment-on-valse-coreference-1 | ViLBERT | Accuracy (%): 50.0 pairwise accuracy: 48.1 |
| image-sentence-alignment-on-valse-counting | LXMERT | Accuracy (%): 52.0 pairwise accuracy: 62.2 |
| image-sentence-alignment-on-valse-counting | ViLBERT 12-in-1 | Accuracy (%): 64.9 pairwise accuracy: 76.7 |
| image-sentence-alignment-on-valse-counting | GPT2 | |
| image-sentence-alignment-on-valse-counting | VisualBERT | Accuracy (%): 48.3 pairwise accuracy: 48.2 |
| image-sentence-alignment-on-valse-counting | CLIP | |
| image-sentence-alignment-on-valse-counting | GPT1 | |
| image-sentence-alignment-on-valse-counting | ViLBERT | Accuracy (%): 50.7 pairwise accuracy: 58.6 |
| image-sentence-alignment-on-valse-counting-1 | VisualBERT | Accuracy (%): 47.8 pairwise accuracy: 48.2 |
| image-sentence-alignment-on-valse-counting-1 | ViLBERT | Accuracy (%): 50.6 pairwise accuracy: 62.9 |
| image-sentence-alignment-on-valse-counting-1 | CLIP | |
| image-sentence-alignment-on-valse-counting-1 | ViLBERT 12-in-1 | Accuracy (%): 69.2 pairwise accuracy: 80.2 |
| image-sentence-alignment-on-valse-counting-1 | GPT1 | |
| image-sentence-alignment-on-valse-counting-1 | LXMERT | Accuracy (%): 55.4 pairwise accuracy: 69.2 |
| image-sentence-alignment-on-valse-counting-1 | GPT2 | |
| image-sentence-alignment-on-valse-counting-2 | ViLBERT | Accuracy (%): 51.8 pairwise accuracy: 73.7 |
| image-sentence-alignment-on-valse-counting-2 | GPT1 | |
| image-sentence-alignment-on-valse-counting-2 | CLIP | |
| image-sentence-alignment-on-valse-counting-2 | GPT2 | |
| image-sentence-alignment-on-valse-counting-2 | LXMERT | Accuracy (%): 49.9 pairwise accuracy: 42.6 |
| image-sentence-alignment-on-valse-counting-2 | VisualBERT | Accuracy (%): 50.0 pairwise accuracy: 50.0 |
| image-sentence-alignment-on-valse-counting-2 | ViLBERT 12-in-1 | Accuracy (%): 66.7 pairwise accuracy: 77.3 |
| image-sentence-alignment-on-valse-existence | VisualBERT | Accuracy (%): 49.3 pairwise accuracy: 39.7 |
| image-sentence-alignment-on-valse-existence | LXMERT | Accuracy (%): 55.8 pairwise accuracy: 78.6 |
| image-sentence-alignment-on-valse-existence | CLIP | |
| image-sentence-alignment-on-valse-existence | ViLBERT 12-in-1 | Accuracy (%): 89.0 pairwise accuracy: 95.6 |
| image-sentence-alignment-on-valse-existence | ViLBERT | Accuracy (%): 2.4 pairwise accuracy: 66.5 |
| image-sentence-alignment-on-valse-existence | GPT1 | |
| image-sentence-alignment-on-valse-existence | GPT2 | |
| image-sentence-alignment-on-valse-foil-it | GPT2 | |
| image-sentence-alignment-on-valse-foil-it | ViLBERT 12-in-1 | Accuracy (%): 71.5 pairwise accuracy: 86.9 |
| image-sentence-alignment-on-valse-foil-it | GPT1 | |
| image-sentence-alignment-on-valse-foil-it | LXMERT | Accuracy (%): 70.8 pairwise accuracy: 87.1 |
| image-sentence-alignment-on-valse-foil-it | VisualBERT | Accuracy (%): 46.6 pairwise accuracy: 48.5 |
| image-sentence-alignment-on-valse-foil-it | ViLBERT | Accuracy (%): 55.9 pairwise accuracy: 86.9 |
| image-sentence-alignment-on-valse-foil-it | CLIP | |
| image-sentence-alignment-on-valse-plurality | ViLBERT 12-in-1 | Accuracy (%): 62.0 pairwise accuracy: 72.4 |
| image-sentence-alignment-on-valse-plurality | LXMERT | Accuracy (%): 55.1 pairwise accuracy: 64.4 |
| image-sentence-alignment-on-valse-plurality | CLIP | |
| image-sentence-alignment-on-valse-plurality | GPT1 | |
| image-sentence-alignment-on-valse-plurality | ViLBERT | Accuracy (%): 50.3 pairwise accuracy: 61.2 |
| image-sentence-alignment-on-valse-plurality | VisualBERT | Accuracy (%): 46.5 pairwise accuracy: 45.7 |
| image-sentence-alignment-on-valse-plurality | GPT2 | |
| image-sentence-alignment-on-valse-spatial | VisualBERT | Accuracy (%): 49.3 pairwise accuracy: 39.7 |
| image-sentence-alignment-on-valse-spatial | GPT2 | |
| image-sentence-alignment-on-valse-spatial | CLIP | |
| image-sentence-alignment-on-valse-spatial | LXMERT | Accuracy (%): 50.8 pairwise accuracy: 60.2 |
| image-sentence-alignment-on-valse-spatial | ViLBERT | Accuracy (%): 49.9 pairwise accuracy: 57.2 |
| image-sentence-alignment-on-valse-spatial | ViLBERT 12-in-1 | Accuracy (%): 53.4 pairwise accuracy: 67.7 |
| image-sentence-alignment-on-valse-spatial | GPT1 | |