3 months ago

Multimodal Fusion via Teacher-Student Network for Indoor Action Recognition

{Keith C.C. Chan Yan Liu Bruce X.B. Yu}

Abstract

Indoor action recognition plays an important role in modernsociety, such as intelligent healthcare in large mobile cabinhospitals. With the wide usage of depth sensors like Kinect,multimodal information including skeleton and RGB modalitiesbrings a promising way to improve the performance.However, existing methods are either focusing on a singledata modality or failed to take the advantage of multiple datamodalities. In this paper, we propose a Teacher-Student MultimodalFusion (TSMF) model that fuses the skeleton andRGB modalities at the model level for indoor action recognition.In our TSMF, we utilize a teacher network to transferthe structural knowledge of the skeleton modality to astudent network for the RGB modality. With extensive experimentson two benchmarking datasets: NTU RGB+D andPKU-MMD, results show that the proposed TSMF consistentlyperforms better than state-of-the-art single modal andmultimodal methods. It also indicates that our TSMF couldnot only improve the accuracy of the student network but alsosignificantly improve the ensemble accuracy.

Benchmarks

Benchmark	Methodology	Metrics
action-recognition-in-videos-on-ntu-rgbd	TSMF (RGB + Pose)	Accuracy (CS): 92.5 Accuracy (CV): 97.4
action-recognition-in-videos-on-pku-mmd	TSMF	X-Sub: 95.8 X-View: 97.8

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning