Command Palette
Search for a command to run...
Associating Neural Word Embeddings With Deep Image Representations Using Fisher Vectors
{Gil Sadeh Benjamin Klein Lior Wolf Guy Lev}

Abstract
In recent years, the problem of associating a sentence with an image has gained a lot of attention. This work continues to push the envelope and makes further progress in the performance of image annotation and image search by a sentence tasks. In this work, we are using the Fisher Vector as a sentence representation by pooling the word2vec embedding of each word in the sentence. The Fisher Vector is typically taken as the gradients of the log-likelihood of descriptors, with respect to the parameters of a Gaussian Mixture Model (GMM). In this work we present two other Mixture Models and derive their Expectation-Maximization and Fisher Vector expressions. The first is a Laplacian Mixture Model (LMM), which is based on the Laplacian distribution. The second Mixture Model presented is a Hybrid Gaussian-Laplacian Mixture Model (HGLMM) which is based on a weighted geometric mean of the Gaussian and Laplacian distribution. Finally, by using the new Fisher Vectors derived from HGLMMs to represent sentences, we achieve state-of-the-art results for both the image annotation and the image search by a sentence tasks on four benchmarks: Pascal1K, Flickr8K, Flickr30K, and COCO.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| video-retrieval-on-youcook2 | HGLMM FV CCA | text-to-video Median Rank: 75 text-to-video R@1: 4.6 text-to-video R@10: 21.6 text-to-video R@5: 14.3 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.