8 months ago

Abstract

Humans tend to mine objects by learning from a group of images or severalframes of video since we live in a dynamic world. In the computer vision area,many researches focus on co-segmentation (CoS), co-saliency detection (CoSD)and video salient object detection (VSOD) to discover the co-occurrent objects.However, previous approaches design different networks on these similar tasksseparately, and they are difficult to apply to each other, which lowers theupper bound of the transferability of deep learning frameworks. Besides, theyfail to take full advantage of the cues among inter- and intra-feature within agroup of images. In this paper, we introduce a unified framework to tacklethese issues, term as UFO (Unified Framework for Co-Object Segmentation).Specifically, we first introduce a transformer block, which views the imagefeature as a patch token and then captures their long-range dependenciesthrough the self-attention mechanism. This can help the network to excavate thepatch structured similarities among the relevant objects. Furthermore, wepropose an intra-MLP learning module to produce self-mask to enhance thenetwork to avoid partial activation. Extensive experiments on four CoSbenchmarks (PASCAL, iCoseg, Internet and MSRC), three CoSD benchmarks(Cosal2015, CoSOD3k, and CocA) and four VSOD benchmarks (DAVIS16, FBMS, ViSaland SegV2) show that our method outperforms other state-of-the-arts on threedifferent tasks in both accuracy and speed by using the same networkarchitecture , which can reach 140 FPS in real-time.

Source PDF