| ADDS(ViT-L-336, resolution 1344) | 93.54 | Open Vocabulary Multi-Label Classification with Dual-Modal Decoder on Aligned Visual-Textual Features | - |
| ADDS(ViT-L-336, resolution 640) | 93.41 | Open Vocabulary Multi-Label Classification with Dual-Modal Decoder on Aligned Visual-Textual Features | - |
| ADDS(ViT-L-336, resolution 336) | 91.76 | Open Vocabulary Multi-Label Classification with Dual-Modal Decoder on Aligned Visual-Textual Features | - |
| ML-Decoder(TResNet-XL, resolution 640) | 91.4 | ML-Decoder: Scalable and Versatile Classification Head | |
| Q2L-CvT(ImageNet-21K pretraining, resolution 384) | 91.3 | Query2Label: A Simple Transformer Way to Multi-Label Classification | |
| MLD-TResNet-L-AAM[640x640] | 91.30 | Combining Metric Learning and Attention Heads For Accurate and Efficient Multilabel Image Classification | |
| ML-Decoder(TResNet-L, resolution 640) | 91.1 | ML-Decoder: Scalable and Versatile Classification Head | |
| Q2L-SwinL(ImageNet-21K pretraining, resolution 384) | 90.5 | Query2Label: A Simple Transformer Way to Multi-Label Classification | |
| IDA-SwinL | 90.3 | Causality Compensated Attention for Contextual Biased Visual Recognition | - |
| CCD-SwinL | 90.3 | Contextual Debiasing for Visual Recognition With Causal Mechanisms | - |
| Q2L-TResL(ImageNet-21K pretraining, resolution 640) | 90.3 | Query2Label: A Simple Transformer Way to Multi-Label Classification | |
| MlTr-XL(ImageNet-21K pretraining, resolution 384) | 90.0 | MlTr: Multi-label Classification with Transformer | |
| TResNet-L-V2, (ImageNet-21K-P pretraining, resolution 640) | 89.8 | ImageNet-21K Pretraining for the Masses | |
| MlTr-L(ImageNet-21K pretraining, resolution 384) | 88.5 | MlTr: Multi-label Classification with Transformer | |
| TResNet-XL (resolution 640) | 88.4 | Asymmetric Loss For Multi-Label Classification | |
| TResNet-L-V2, (ImageNet-21K-P pretraining, resolution 448) | 88.4 | ImageNet-21K Pretraining for the Masses | |
| GKGNet(resolution 576) | 87.7 | GKGNet: Group K-Nearest Neighbor based Graph Convolutional Network for
Multi-Label Image Recognition | |
| M3TR(ImageNet-21K-P pretraining, resolution 448) | 87.5 | M3TR: Multi-modal Multi-label Recognition with Transformer | - |
| GKGNet(resolution 448) | 86.7 | GKGNet: Group K-Nearest Neighbor based Graph Convolutional Network for
Multi-Label Image Recognition | |
| TResNet-L (resolution 448) | 86.6 | Asymmetric Loss For Multi-Label Classification | |