Command Palette
Search for a command to run...
Jiahui Wang Zhenyou Wang Shanna Zhuang Hui Wang

Abstract
Temporal convolutions have been the paradigm of choice in action segmentation, which enhances long-term receptive fields by increasing convolution layers. However, high layers cause the loss of local information necessary for frame recognition. To solve the above problem, a novel encoder-decoder structure is proposed in this paper, called Cross-Enhancement Transformer. Our approach can be effective learning of temporal structure representation with interactive self-attention mechanism. Concatenated each layer convolutional feature maps in encoder with a set of features in decoder produced via self-attention. Therefore, local and global information are used in a series of frame actions simultaneously. In addition, a new loss function is proposed to enhance the training process that penalizes over-segmentation errors. Experiments show that our framework performs state-of-the-art on three challenging datasets: 50Salads, Georgia Tech Egocentric Activities and the Breakfast dataset.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| action-segmentation-on-50-salads-1 | CETNet | Acc: 86.9 Edit: 81.7 F1@10%: 87.6 F1@25%: 86.5 F1@50%: 80.1 |
| action-segmentation-on-breakfast-1 | CETNet | Acc: 74.9 Average F1: 71.8 Edit: 77.8 F1@10%: 79.3 F1@25%: 74.3 F1@50%: 61.9 |
| action-segmentation-on-gtea-1 | CETNet | Acc: 80.3 Edit: 87.9 F1@10%: 91.8 F1@25%: 91.2 F1@50%: 81.3 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.