7 months ago

Image Classification

Multimodal Representation

Computer Vision

Marçal Rusiñol Mickael Coustaty Ziheng Ming Souhail Bakkali

Abstract

The topic of text document image classification has been explored extensively over the past few years. Most recent approaches handled this task by jointly learning the visual features of document images and their corresponding textual contents. Due to the various structures of document images, the extraction of semantic information from its textual content is beneficial for document image processing tasks such as document retrieval, information extraction, and text classification. In this work, a two-stream neural architecture is proposed to perform the document image classification task. We conduct an exhaustive investigation of nowadays widely used neural networks as well as word embedding procedures used as backbones, in order to extract both visual and textual features from document images. Moreover, a joint feature learning approach that combines image features and text embeddings is introduced as a late fusion methodology. Both the theoretical analysis and the experimental results demonstrate the superiority of our proposed joint feature learning method comparatively to the single modalities. This joint learning approach outperforms the state-of-the-art results with a classification accuracy of 97.05% on the large-scale RVL-CDIP dataset.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

7 months ago

Image Classification

Multimodal Representation

Computer Vision

Marçal Rusiñol Mickael Coustaty Ziheng Ming Souhail Bakkali

Abstract

The topic of text document image classification has been explored extensively over the past few years. Most recent approaches handled this task by jointly learning the visual features of document images and their corresponding textual contents. Due to the various structures of document images, the extraction of semantic information from its textual content is beneficial for document image processing tasks such as document retrieval, information extraction, and text classification. In this work, a two-stream neural architecture is proposed to perform the document image classification task. We conduct an exhaustive investigation of nowadays widely used neural networks as well as word embedding procedures used as backbones, in order to extract both visual and textual features from document images. Moreover, a joint feature learning approach that combines image features and text embeddings is introduced as a late fusion methodology. Both the theoretical analysis and the experimental results demonstrate the superiority of our proposed joint feature learning method comparatively to the single modalities. This joint learning approach outperforms the state-of-the-art results with a classification accuracy of 97.05% on the large-scale RVL-CDIP dataset.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

Visual and Textual Deep Feature Fusion for Document Image Classification | Papers | HyperAI