5 months ago

Mixtral of Experts

Albert Q. Jiang; Alexandre Sablayrolles; Antoine Roux; Arthur Mensch; Blanche Savary; Chris Bamford; Devendra Singh Chaplot; Diego de las Casas; Emma Bou Hanna; Florian Bressand; Gianna Lengyel; Guillaume Bour; Guillaume Lample; Lélio Renard Lavaud; Lucile Saulnier; Marie-Anne Lachaux; Pierre Stock; Sandeep Subramanian; Sophia Yang; Szymon Antoniak; Teven Le Scao; Théophile Gervet; Thibaut Lavril; Thomas Wang; Timothée Lacroix; William El Sayed

Abstract

We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks. We also provide a model fine-tuned to follow instructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B - chat model on human benchmarks. Both the base and instruct models are released under the Apache 2.0 license.

Code Repositories

consequentai/fneval

Mentioned in GitHub

pwc-1/Paper-9/tree/main/2/mixtral

mindspore

jingyaogong/minimind

pytorch

Mentioned in GitHub

ymcui/chinese-mixtral

pytorch

Mentioned in GitHub

kamanphoebe/look-into-moes

pytorch

Mentioned in GitHub

hit-scir/chinese-mixtral-8x7b

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
code-generation-on-mbpp	Mixtral 8x7B (3-shot)	Accuracy: 60.7
common-sense-reasoning-on-arc-easy	Mistral 7B (0-shot)	Accuracy: 80.5
common-sense-reasoning-on-arc-easy	Mixtral 8x7B (0-shot)	Accuracy: 83.1
common-sense-reasoning-on-winogrande	Mistral 7B (0-shot)	Accuracy: 74.2
common-sense-reasoning-on-winogrande	Mixtral 8x7B (0-shot)	Accuracy: 77.2
math-word-problem-solving-on-math	Mixtral 8x7B (maj@4)	Accuracy: 28.4
math-word-problem-solving-on-math	Mistral 7B (maj@4)	Accuracy: 12.7 Parameters (Billions): 7
multi-task-language-understanding-on-mmlu	Mixtral 8x7B (5-shot)	Average (%): 70.6
multi-task-language-understanding-on-mmlu	Mistral 7B (5-shot)	Average (%): 62.5
question-answering-on-piqa	Mistral 7B (0-shot)	Accuracy: 82.2
question-answering-on-piqa	Mixtral 8x7B (0-shot)	Accuracy: 83.6

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Mixtral of Experts

Abstract

Code Repositories

Benchmarks

Build AI with AI

Hyper Newsletters