8 months ago

Multi-Task Learning

Multimodal Representation

Computer Vision

Method/Architecture

Computer Vision

Wentao Zhu Xiaoxuan Ma Zhaoyang Liu Libin Liu Wayne Wu Yizhou Wang

Abstract

We present a unified perspective on tackling various human-centric videotasks by learning human motion representations from large-scale andheterogeneous data resources. Specifically, we propose a pretraining stage inwhich a motion encoder is trained to recover the underlying 3D motion fromnoisy partial 2D observations. The motion representations acquired in this wayincorporate geometric, kinematic, and physical knowledge about human motion,which can be easily transferred to multiple downstream tasks. We implement themotion encoder with a Dual-stream Spatio-temporal Transformer (DSTformer)neural network. It could capture long-range spatio-temporal relationships amongthe skeletal joints comprehensively and adaptively, exemplified by the lowest3D pose estimation error so far when trained from scratch. Furthermore, ourproposed framework achieves state-of-the-art performance on all threedownstream tasks by simply finetuning the pretrained motion encoder with asimple regression head (1-2 layers), which demonstrates the versatility of thelearned motion representations. Code and models are available athttps://motionbert.github.io/

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

8 months ago

Multi-Task Learning

Multimodal Representation

Computer Vision

Method/Architecture

Computer Vision

Wentao Zhu Xiaoxuan Ma Zhaoyang Liu Libin Liu Wayne Wu Yizhou Wang

Abstract

We present a unified perspective on tackling various human-centric videotasks by learning human motion representations from large-scale andheterogeneous data resources. Specifically, we propose a pretraining stage inwhich a motion encoder is trained to recover the underlying 3D motion fromnoisy partial 2D observations. The motion representations acquired in this wayincorporate geometric, kinematic, and physical knowledge about human motion,which can be easily transferred to multiple downstream tasks. We implement themotion encoder with a Dual-stream Spatio-temporal Transformer (DSTformer)neural network. It could capture long-range spatio-temporal relationships amongthe skeletal joints comprehensively and adaptively, exemplified by the lowest3D pose estimation error so far when trained from scratch. Furthermore, ourproposed framework achieves state-of-the-art performance on all threedownstream tasks by simply finetuning the pretrained motion encoder with asimple regression head (1-2 layers), which demonstrates the versatility of thelearned motion representations. Code and models are available athttps://motionbert.github.io/

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp