
摘要
我们介绍了一种创新的实时无约束开放词汇图像分类器——NOVIC,该分类器利用自回归变换器以语言形式生成输出分类标签。借助CLIP模型的广泛知识,NOVIC通过嵌入空间实现了从纯文本到图像的零样本迁移。尽管传统的CLIP模型具备开放词汇分类的能力,但它们需要提供详尽的潜在类别标签提示,这限制了其在已知内容或上下文图像中的应用。为了解决这一问题,我们提出了一种“对象解码器”模型,该模型在大规模9200万目标数据集上进行训练,数据集中包含模板化的对象名词集合和大型语言模型(LLM)生成的标题,从而始终输出所询问的对象名词。这实际上反转了CLIP文本编码器,使得可以从图像派生的嵌入向量中直接生成几乎整个英语词汇中的文本对象标签,而无需事先了解图像的潜在内容,并且没有标签偏差。经过训练的解码器在手动和网络策划的数据集以及标准图像分类基准上进行了测试,取得了高达87.5%的细粒度无提示预测分数,考虑到该模型必须适用于任何可想象的图像并且没有任何上下文线索,这是一个非常强大的结果。
代码仓库
pallgeuer/object_noun_dictionary
官方
GitHub 中提及
pallgeuer/novic
官方
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| image-classification-on-imagenet | DFN-5B H/14-378 + PrefixedIter Decoder | Top 1 Accuracy: 88.21% |
| image-classification-on-imagenet | SigLIP B/16 + PrefixedIter Decoder | Top 1 Accuracy: 83.46% |
| open-vocabulary-image-classification-on-ovic | DFN-5B H/14-378 + PrefixedIter Decoder (FT2) | Overall Score: 87.13 Prediction Score: 87.94 Prediction Score (mean of 3): 87.08 Top 1 Accuracy: 86.77 |
| open-vocabulary-image-classification-on-ovic | DFN-5B H/14-378 + PrefixedIter Decoder (FT0) | Overall Score: 87.90 Prediction Score: 88.27 Prediction Score (mean of 3): 86.41 Top 1 Accuracy: 86.95 |
| open-vocabulary-image-classification-on-ovic | SigLIP SO/14 + PrefixedIter Decoder (FT2) | Prediction Score (mean of 3): 87.49 |
| open-vocabulary-image-classification-on-ovic-1 | SigLIP B/16 + PrefixedIter Decoder (FT2) | Prediction Score (mean of 3): 72.03 |
| open-vocabulary-image-classification-on-ovic-1 | DFN-5B H/14-378 + PrefixedIter Decoder (FT0) | Overall Score: 78.21 Prediction Score: 79.18 Top 1 Accuracy: 77.10 |
| open-vocabulary-image-classification-on-ovic-1 | DFN-5B H/14-378 + PrefixedIter Decoder (FT2) | Overall Score: 79.02 Prediction Score: 80.13 Top 1 Accuracy: 77.05 |
| open-vocabulary-image-classification-on-ovic-2 | SigLIP B/16 + PrefixedIter Decoder (FT2) | Prediction Score (mean of 3): 74.35 Top 1 Accuracy (mean of 3): 72.90 |
| open-vocabulary-image-classification-on-ovic-2 | SigLIP B/16 + PrefixedIter Decoder (FT6) | Prediction Score (mean of 3): 76.50 Top 1 Accuracy (mean of 3): 75.04 |
| open-vocabulary-image-classification-on-ovic-3 | DFN-5B H/14-378 + PrefixedIter Decoder (FT0) | Prediction Score (mean of 3): 74.48 |
| open-vocabulary-image-classification-on-ovic-3 | DFN-5B H/14-378 + PrefixedIter Decoder (FT2) | Prediction Score (mean of 3): 74.88 |