HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini 1.5: Unlocking multimodal understanding across millions of tokens
  of context

Abstract

In this report, we introduce the Gemini 1.5 family of models, representingthe next generation of highly compute-efficient multimodal models capable ofrecalling and reasoning over fine-grained information from millions of tokensof context, including multiple long documents and hours of video and audio. Thefamily includes two new models: (1) an updated Gemini 1.5 Pro, which exceedsthe February version on the great majority of capabilities and benchmarks; (2)Gemini 1.5 Flash, a more lightweight variant designed for efficiency withminimal regression in quality. Gemini 1.5 models achieve near-perfect recall onlong-context retrieval tasks across modalities, improve the state-of-the-art inlong-document QA, long-video QA and long-context ASR, and match or surpassGemini 1.0 Ultra's state-of-the-art performance across a broad set ofbenchmarks. Studying the limits of Gemini 1.5's long-context ability, we findcontinued improvement in next-token prediction and near-perfect retrieval(>99%) up to at least 10M tokens, a generational leap over existing models suchas Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-worlduse cases, such as Gemini 1.5 collaborating with professionals on completingtheir tasks achieving 26 to 75% time savings across 10 different jobcategories, as well as surprising new capabilities of large language models atthe frontier; when given a grammar manual for Kalamang, a language with fewerthan 200 speakers worldwide, the model learns to translate English to Kalamangat a similar level to a person who learned from the same content.

Code Repositories

dlvuldet/primevul
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
fs-mevqa-on-smeGemini-1.5 Pro
#Learning Samples (N): 16
ACC: 40.88
BLEU-4: 41.87
CIDEr: 276.14
Detection: 1.40
METEOR: 34.61
ROUGE-L: 55.90
SPICE: 40.58
long-context-understanding-on-mmneedleGemini Pro 1.5
1 Image, 2*2 Stitching, Exact Accuracy: 90.34
1 Image, 4*4 Stitching, Exact Accuracy: 39.85
1 Image, 8*8 Stitching, Exact Accuracy: 29.81
10 Images, 1*1 Stitching, Exact Accuracy: 89.94
10 Images, 2*2 Stitching, Exact Accuracy: 45.21
10 Images, 4*4 Stitching, Exact Accuracy: 6.09
10 Images, 8*8 Stitching, Exact Accuracy: 0.62
question-answering-on-newsqaGoogle/Gemini 1.5 Flash
EM: 68.75
F1: 79.91
temporal-relation-extraction-on-vinogroundGemini-1.5-Pro (CoT)
Group Score: 12.4
Text Score: 37
Video Score: 27.6
temporal-relation-extraction-on-vinogroundGemini-1.5-Pro
Group Score: 10.2
Text Score: 35.8
Video Score: 22.6
video-question-answering-on-tvbenchGemini 1.5 Pro
Average Accuracy: 47.6
visual-question-answering-on-mm-vetGemini 1.5 Pro (gemini-1.5-pro)
GPT-4 score: 65.8±0.1
visual-question-answering-on-mm-vetGemini 1.5 Pro (gemini-1.5-pro-002)
GPT-4 score: 76.9±0.1
visual-question-answering-on-mm-vet-v2Gemini 1.5 Pro
GPT-4 score: 66.9±0.2
zero-shot-video-question-answer-on-video-mmeGemini 1.5 Flash
Accuracy (%): 66.3
zero-shot-video-question-answer-on-video-mmeGemini 1.5 Pro
Accuracy (%): 71.9
zero-shot-video-question-answer-on-video-mme-1Gemini 1.5 Pro
Accuracy (%): 81.3
zero-shot-video-question-answer-on-video-mme-1Gemini 1.5 Flash
Accuracy (%): 75.0
zero-shot-video-question-answer-on-zero-shotGemini 1.5 Pro
Accuracy (% ): 66.7

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context | Papers | HyperAI