Command Palette
Search for a command to run...
Ma Pingchuan ; Wang Yujiang ; Petridis Stavros ; Shen Jie ; Pantic Maja

Abstract
Several training strategies and temporal models have been recently proposedfor isolated word lip-reading in a series of independent works. However, thepotential of combining the best strategies and investigating the impact of eachof them has not been explored. In this paper, we systematically investigate theperformance of state-of-the-art data augmentation approaches, temporal modelsand other training strategies, like self-distillation and using word boundaryindicators. Our results show that Time Masking (TM) is the most importantaugmentation followed by mixup and Densely-Connected Temporal ConvolutionalNetworks (DC-TCN) are the best temporal model for lip-reading of isolatedwords. Using self-distillation and word boundary indicators is also beneficialbut to a lesser extent. A combination of all the above methods results in aclassification accuracy of 93.4%, which is an absolute improvement of 4.6% overthe current state-of-the-art performance on the LRW dataset. The performancecan be further improved to 94.1% by pre-training on additional datasets. Anerror analysis of the various training strategies reveals that the performanceimproves by increasing the classification accuracy of hard-to-recognise words.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| lipreading-on-lip-reading-in-the-wild | 3D Conv + ResNet-18 + DC-TCN + KD (Ensemble & Word Boundary) | Top-1 Accuracy: 94.1 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.