3 months ago

CoordViT: A Novel Method of Improve Vision Transformer-Based Speech Emotion Recognition using Coordinate Information Concatenate

{Seung-Ho Lee Jeongyoon Kim}

Abstract

Recently, in speech emotion recognition, a Transformer-based method using spectrogram images instead of sound data showed improved accuracy than Convolutional Neural Networks (CNNs). Vision Transformer (ViT), a Transformer-based method, achieves high classification accuracy by using divided patches from the input image, but has a problem in that pixel position information is not retained due to embedding layers such as linear projection. Therefore, in this paper, we propose a novel method of improve ViT-based speech emotion recognition using coordinate information concatenate. Since the proposed method retains pixel position information by concatenating coordinate information to the input image, the accuracy of CREMA-D is greatly improved by 82.96% compared to the state-of-art about CREMA-D. As a result, it proved that the coordinate information concatenate proposed in this paper is effective not only for CNNs but also for Transformers.

Benchmarks

Benchmark	Methodology	Metrics
speech-emotion-recognition-on-crema-d	CoordViT	Accuracy: 82.96

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning