Command Palette
Search for a command to run...
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Abstract
In this report, we introduce the Gemini 1.5 family of models, representingthe next generation of highly compute-efficient multimodal models capable ofrecalling and reasoning over fine-grained information from millions of tokensof context, including multiple long documents and hours of video and audio. Thefamily includes two new models: (1) an updated Gemini 1.5 Pro, which exceedsthe February version on the great majority of capabilities and benchmarks; (2)Gemini 1.5 Flash, a more lightweight variant designed for efficiency withminimal regression in quality. Gemini 1.5 models achieve near-perfect recall onlong-context retrieval tasks across modalities, improve the state-of-the-art inlong-document QA, long-video QA and long-context ASR, and match or surpassGemini 1.0 Ultra's state-of-the-art performance across a broad set ofbenchmarks. Studying the limits of Gemini 1.5's long-context ability, we findcontinued improvement in next-token prediction and near-perfect retrieval(>99%) up to at least 10M tokens, a generational leap over existing models suchas Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-worlduse cases, such as Gemini 1.5 collaborating with professionals on completingtheir tasks achieving 26 to 75% time savings across 10 different jobcategories, as well as surprising new capabilities of large language models atthe frontier; when given a grammar manual for Kalamang, a language with fewerthan 200 speakers worldwide, the model learns to translate English to Kalamangat a similar level to a person who learned from the same content.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| fs-mevqa-on-sme | Gemini-1.5 Pro | #Learning Samples (N): 16 ACC: 40.88 BLEU-4: 41.87 CIDEr: 276.14 Detection: 1.40 METEOR: 34.61 ROUGE-L: 55.90 SPICE: 40.58 |
| long-context-understanding-on-mmneedle | Gemini Pro 1.5 | 1 Image, 2*2 Stitching, Exact Accuracy: 90.34 1 Image, 4*4 Stitching, Exact Accuracy: 39.85 1 Image, 8*8 Stitching, Exact Accuracy: 29.81 10 Images, 1*1 Stitching, Exact Accuracy: 89.94 10 Images, 2*2 Stitching, Exact Accuracy: 45.21 10 Images, 4*4 Stitching, Exact Accuracy: 6.09 10 Images, 8*8 Stitching, Exact Accuracy: 0.62 |
| question-answering-on-newsqa | Google/Gemini 1.5 Flash | EM: 68.75 F1: 79.91 |
| temporal-relation-extraction-on-vinoground | Gemini-1.5-Pro (CoT) | Group Score: 12.4 Text Score: 37 Video Score: 27.6 |
| temporal-relation-extraction-on-vinoground | Gemini-1.5-Pro | Group Score: 10.2 Text Score: 35.8 Video Score: 22.6 |
| video-question-answering-on-tvbench | Gemini 1.5 Pro | Average Accuracy: 47.6 |
| visual-question-answering-on-mm-vet | Gemini 1.5 Pro (gemini-1.5-pro) | GPT-4 score: 65.8±0.1 |
| visual-question-answering-on-mm-vet | Gemini 1.5 Pro (gemini-1.5-pro-002) | GPT-4 score: 76.9±0.1 |
| visual-question-answering-on-mm-vet-v2 | Gemini 1.5 Pro | GPT-4 score: 66.9±0.2 |
| zero-shot-video-question-answer-on-video-mme | Gemini 1.5 Flash | Accuracy (%): 66.3 |
| zero-shot-video-question-answer-on-video-mme | Gemini 1.5 Pro | Accuracy (%): 71.9 |
| zero-shot-video-question-answer-on-video-mme-1 | Gemini 1.5 Pro | Accuracy (%): 81.3 |
| zero-shot-video-question-answer-on-video-mme-1 | Gemini 1.5 Flash | Accuracy (%): 75.0 |
| zero-shot-video-question-answer-on-zero-shot | Gemini 1.5 Pro | Accuracy (% ): 66.7 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.