8 months ago

Abstract

In this paper, we build on a concept of self-supervision by taking RGB framesas input to learn to predict both action concepts and auxiliary descriptorse.g., object descriptors. So-called hallucination streams are trained topredict auxiliary cues, simultaneously fed into classification layers, and thenhallucinated at the testing stage to aid network. We design and hallucinate twodescriptors, one leveraging four popular object detectors applied to trainingvideos, and the other leveraging image- and video-level saliency detectors. Thefirst descriptor encodes the detector- and ImageNet-wise class predictionscores, confidence scores, and spatial locations of bounding boxes and frameindexes to capture the spatio-temporal distribution of features per video.Another descriptor encodes spatio-angular gradient distributions of saliencymaps and intensity patterns. Inspired by the characteristic function of theprobability distribution, we capture four statistical moments on the aboveintermediate descriptors. As numbers of coefficients in the mean, covariance,coskewness and cokurtotsis grow linearly, quadratically, cubically andquartically w.r.t. the dimension of feature vectors, we describe the covariancematrix by its leading n' eigenvectors (so-called subspace) and we captureskewness/kurtosis rather than costly coskewness/cokurtosis. We obtain state ofthe art on five popular datasets such as Charades and EPIC-Kitchens.

Source PDF