6 months ago

Abstract

Fine-grained visual categorization (FGVC) aims to automatically recognize objects from different sub-ordinate categories. Despite attracting considerable attention from both academia and industry, it remains a challenging task due to subtle visual differences among different classes. Cross-layer feature aggregation and crossimage pairwise learning become prevailing in improving the performance of FGVC by extracting discriminative class-specific features. However, they are still inefficient to fully use the cross-layer information based on the simple aggregation strategy, while existing pairwise learning methods also fail to explore long-range interactions between different images. To address these problems, we propose a novel Alignment Enhancement Network (AENet), including two-level alignments, Cross-layer Alignment (CLA) and Cross-image Alignment (CIA). The CLA module exploits the cross-layer relationship between low-level spatial information and highlevel semantic information, which contributes to cross-layer feature aggregation to improve the capacity of feature representation for input images. The new CIA module is further introduced to produce the aligned feature map, which can enhance the relevant information as well as suppress the irrelevant information across the whole spatial region. Our method is based on an underlying assumption that the aligned feature map should be closer to the inputs of CIA when they belong to the same category. Accordingly, we establish Semantic Affinity Loss to supervise the feature alignment within each CIA block. Experimental results on four challenging datasets show that the proposed AENet achieves the state-of-the-art results over prior arts.

Source PDF