
摘要
我们提出了CLIP-EBC,这是首个完全基于CLIP的模型,用于实现高精度的群体密度估计。尽管CLIP模型在零样本图像分类等识别任务中已展现出卓越性能,但其在计数任务中的潜力尚未得到充分探索,主要原因在于将回归问题(如计数)转化为识别任务存在固有挑战。在本研究中,我们系统地探究并提升了CLIP的计数能力,重点关注从图像中估计人群规模的任务。现有的基于分类的计数框架存在显著局限性,包括将计数值量化为相邻的实数值区间(bin),以及仅关注分类误差。这些做法导致在区间边界附近出现标签歧义,并造成计数预测不准确。因此,直接将CLIP应用于此类框架可能难以获得最优性能。为解决上述问题,我们首先提出增强型分块分类(Enhanced Blockwise Classification, EBC)框架。与以往方法不同,EBC采用整数值区间,有效降低了区间边界附近的歧义性;同时,引入基于密度图的回归损失,进一步提升计数预测的准确性。在此与骨干网络无关的EBC框架基础上,我们进一步构建了CLIP-EBC,以充分挖掘CLIP在识别任务中的强大能力,应用于人群密度估计。大量实验验证了EBC框架的有效性,以及CLIP-EBC的卓越性能。具体而言,我们的EBC框架在UCF-QNRF数据集上可使现有基于分类的方法性能提升高达44.5%;而CLIP-EBC在NWPU-Crowd测试集上达到当前最优水平,平均绝对误差(MAE)为58.2,均方根误差(RMSE)为268.5,相较于此前最佳方法STEERER,分别提升了8.6%和13.3%。相关代码与模型权重已开源,地址为:https://github.com/Yiming-M/CLIP-EBC。
代码仓库
Yiming-M/CLIP-EBC
官方
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| crowd-counting-on-nwpu-crowd-val | CLIP-EBC (ViT-L/14) | MAE: 32.3 RMSE: 79.7 |
| crowd-counting-on-nwpu-crowd-val | DMCount-EBC | MAE: 39.6 RMSE: 95.8 |
| crowd-counting-on-nwpu-crowd-val | CSRNet-EBC | MAE: 42.9 RMSE: 100.1 |
| crowd-counting-on-nwpu-crowd-val | CLIP-EBC (ResNet50) | MAE: 38.6 RMSE: 90.3 |
| crowd-counting-on-nwpu-crowd-val | CLIP-EBC (ViT-B/16) | MAE: 36.6 RMSE: 81.7 |
| crowd-counting-on-shanghaitech-a | CSRNet-EBC | MAE: 66.3 RMSE: 105.0 |
| crowd-counting-on-shanghaitech-a | CLIP-EBC (ViT-B/16) | MAE: 52.5 RMSE: 85.9 |
| crowd-counting-on-shanghaitech-a | DMCount-EBC | MAE: 62.3 RMSE: 98.9 |
| crowd-counting-on-shanghaitech-a | CLIP-EBC (ResNet50) | MAE: 54.0 RMSE: 83.2 |
| crowd-counting-on-shanghaitech-b | CLIP-EBC (ViT-B/16) | MAE: 6.6 RMSE: 10.5 |
| crowd-counting-on-shanghaitech-b | CLIP-EBC (ResNet50) | MAE: 6.0 RMSE: 10.1 |
| crowd-counting-on-shanghaitech-b | CSRNet-EBC | MAE: 6.9 RMSE: 11.3 |
| crowd-counting-on-shanghaitech-b | DMCount-EBC | MAE: 7.0 RMSE: 10.9 |
| crowd-counting-on-shanghaitech-b | CLIP-EBC (ViT-L/14) | MAE: 5.9 RMSE: 9.2 |
| crowd-counting-on-ucf-qnrf | CLIP-EBC (ResNet50) | MAE: 80.5 RMSE: 136.6 |
| crowd-counting-on-ucf-qnrf | CSRNet-EBC | MAE: 79.3 RMSE: 135.8 |
| crowd-counting-on-ucf-qnrf | DMCount-EBC (32, dynamic) | MAE: 76.06 RMSE: 127.72 |
| crowd-counting-on-ucf-qnrf | DMCount-EBC | MAE: 77.2 RMSE: 130.4 |
| crowd-counting-on-ucf-qnrf | DMCount-EBC (16, dynamic) | MAE: 75.90 RMSE: 130.48 |
| crowd-counting-on-ucf-qnrf | CLIP-EBC (ViT-B/16) | MAE: 80.3 RMSE: 139.3 |