Lipreading On Lrs3 Ted

评估指标

Word Error Rate (WER)

评测结果

各个模型在此基准测试上的表现结果

Paper TitleRepository
Conv-seq2seq60.1Spatio-Temporal Fusion Based Convolutional Sequence Learning for Lip Reading-
CTC + KD59.8ASR is all you need: cross-modal distillation for lip reading-
TM-seq2seq58.9Deep Audio-Visual Speech Recognition
EG-seq2seq57.8Discriminative Multi-modality Speech Recognition
CTC-V2P55.1Large-Scale Visual Speech Recognition-
Hyb + Conformer43.3End-to-end Audio-visual Speech Recognition with Conformers
VTP40.6Sub-word Level Lip Reading With Visual Attention-
ES³ Base40.3ES3: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations-
ES³ Large37.1ES3: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations-
RNN-T33.6Recurrent Neural Network Transducer for Audio-Visual Speech Recognition
CTC/Attention (LRW+LRS2/3+AVSpeech)31.5Visual Speech Recognition for Multiple Languages in the Wild
SyncVSR31.2SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization
VTP (more data)30.7Sub-word Level Lip Reading With Visual Attention-
AV-HuBERT Large26.9Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction
DistillAV26.2Audio-Visual Representation Learning via Knowledge Distillation from Speech Foundation Models
AV-HuBERT Large + Relaxed Attention + LM25.51Relaxed Attention for Transformer Models
VSP-LLM25.4Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing
RAVEn Large23.4Jointly Learning Visual and Auditory Speech Representations from Raw Data
USR (self-supervised)22.3Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs
SyncVSR21.5SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization
0 of 23 row(s) selected.
Lipreading On Lrs3 Ted | SOTA | HyperAI超神经