Abstract

We consider the challenging problem of zero-shot video object segmentation(VOS). That is, segmenting and tracking multiple moving objects within a videofully automatically, without any manual initialization. We treat this as agrouping problem by exploiting object proposals and making a joint inferenceabout grouping over both space and time. We propose a network architecture fortractably performing proposal selection and joint grouping. Crucially, we thenshow how to train this network with reinforcement learning so that it learns toperform the optimal non-myopic sequence of grouping decisions to segment thewhole video. Unlike standard supervised techniques, this also enables us todirectly optimize for the non-differentiable overlap-based metrics used toevaluate VOS. We show that the proposed method, which we call ALBA outperformsthe previous stateof-the-art on three benchmarks: DAVIS 2017 [2], FBMS [20] andYoutube-VOS [27].

Source PDF