
摘要
本文提出了一种名为重塑维度网络(Reshape Dimensions Network, ReDimNet)的新型神经网络架构,用于提取话语级说话人表征。该方法通过将二维特征图在时频维度上进行维度重塑,实现从二维特征到一维信号表示的转换,以及反向转换,从而支持一维与二维模块的联合使用。我们设计了一种创新的网络拓扑结构,有效保持了一维与二维模块输出的通道-时间-频率维度体积,促进了残差特征图的高效聚合。此外,ReDimNet具有良好的可扩展性,本文构建了多种不同规模的模型,参数量范围为100万至1500万,计算量(GMACs)范围为0.5至20 GMACs。实验结果表明,ReDimNet在说话人识别任务中达到了当前最优性能,同时显著降低了计算复杂度和模型参数量。
代码仓库
IDRnD/ReDimNet
官方
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| speaker-verification-on-voxceleb | ReDimNet-B2-SF2-LM (4.7M) | EER: 0.57 |
| speaker-verification-on-voxceleb | ReDimNet-B3-LM (3.0M) | EER: 0.5 |
| speaker-verification-on-voxceleb | ReDimNet-B6-SF2-LM-ASNorm (15.0M) | EER: 0.37 |
| speaker-verification-on-voxceleb | ReDimNet-B0-LM-ASNorm (1.0M) | EER: 1.07 |
| speaker-verification-on-voxceleb | ReDimNet-B3-LM-ASNorm (3.0M) | EER: 0.47 |
| speaker-verification-on-voxceleb | ReDimNet-B0-LM (1.0M) | EER: 1.16 |
| speaker-verification-on-voxceleb | ReDimNet-B4-LM-ASNorm (6.3M) | EER: 0.44 |
| speaker-verification-on-voxceleb | ReDimNet-B4-LM (6.3M) | EER: 0.51 |
| speaker-verification-on-voxceleb | ReDimNet-B5-SF2-LM-ASNorm (9.2M) | EER: 0.39 |
| speaker-verification-on-voxceleb | ReDimNet-B6-SF2-LM (15.0M) | EER: 0.4 |
| speaker-verification-on-voxceleb | ReDimNet-B1-LM (2.2M) | EER: 0.85 |
| speaker-verification-on-voxceleb | ReDimNet-B5-SF2-LM (9.2M) | EER: 0.43 |
| speaker-verification-on-voxceleb | ReDimNet-B2-SF2-LM-ASNorm (4.7M) | EER: 0.52 |
| speaker-verification-on-voxceleb | ReDimNet-B1-LM-ASNorm (2.2M) | EER: 0.73 |
| speaker-verification-on-voxceleb1 | ReDimNet-B4-LM-ASNorm (6.3M) | EER: 0.44 |
| speaker-verification-on-voxceleb1 | ReDimNet-B4-LM (6.3M) | EER: 0.51 |
| speaker-verification-on-voxceleb1 | ReDimNet-B1-LM (2.2M) | EER: 0.85 |
| speaker-verification-on-voxceleb1 | ReDimNet-B6-SF2-LM-ASNorm (15.0M) | EER: 0.37 |
| speaker-verification-on-voxceleb1 | ReDimNet-B2-SF2-LM-ASNorm (4.7M) | EER: 0.52 |
| speaker-verification-on-voxceleb1 | ReDimNet-B3-LM-ASNorm (3.0M) | EER: 0.47 |
| speaker-verification-on-voxceleb1 | ReDimNet-B5-SF2-LM (9.2M) | EER: 0.43 |
| speaker-verification-on-voxceleb1 | ReDimNet-B1-LM-ASNorm (2.2M) | EER: 0.73 |
| speaker-verification-on-voxceleb1 | ReDimNet-B5-SF2-LM-ASNorm (9.2M) | EER: 0.39 |
| speaker-verification-on-voxceleb1 | ReDimNet-B0-LM (1.0M) | EER: 1.16 |
| speaker-verification-on-voxceleb1 | ReDimNet-B3-LM (3.0M) | EER: 0.5 |
| speaker-verification-on-voxceleb1 | ReDimNet-B0-LM-ASNorm (1.0M) | EER: 1.07 |
| speaker-verification-on-voxceleb1 | ReDimNet-B6-SF2-LM (15.0M) | EER: 0.4 |
| speaker-verification-on-voxceleb1 | ReDimNet-B2-SF2-LM (4.7M) | EER: 0.57 |