- Code: Mistral GitHub
- Webpage: Mistral News
Mixtral 8x7B is a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts).
In this paper, the authors did not provide any information about the data used for model training. It was only mentioned that the data is “multilingual.” This approach may not be scientific, but there might be some business intention behind it.
Notably, this model is smaller than LLama-70B and Gpt3.5 because it only uses 2 experts for each token at inference and still outperforms these models on most benchmarks.
Architecture - Decoder
Context size is huge - 32k tokens.
Each input vector is assigned to 2 of the 8 experts by a router. The layer’s output is the weighted sum of the outputs of the 2 experts. An expert is a standard feedforward block as in a vanilla transformer architecture.
Sparse Mixture of Experts
The output of the “MoE module” for a given input “x”, using “n”(8) expert networks {E0, Ei, …, En−1} and 1 gating network “Gi”:
Authors said, «the “G(x)i” denotes the n-dimensional output of the gating network for the i-th expert». But I believe it’s a mistake because “G(x)i” can’t be n-dimensional. Probably, G(x) is n-dimensional, and “G(x)i” is a scalar.
“G(x)” implementation is pretty simple, softmax over the Top-K logits of a linear layer. K = 2, in this case, because we use only 2 experts per token.
«Distinction between the model’s total parameter count (commonly referenced as the sparse parameter count), which grows with n, and the number of parameters used for processing an individual token (called the active parameter count), which grows with K up to n.»(c)
MoE layers can be run on single GPUs or Multiple GPUs.
In the first case - “Megablocks” casts the feed-forward network (FFN) operations of the MoE layer as large sparse matrix multiplications.
In the second case - using a strategy called Expert Parallelism.
We put each expert on a different GPU (W1, W2, W3), shard the model blocks, and use 1 G to decide which GPU to use per token.
So here is the final formula:
MoE layer is applied independently per token and replaces the feed-forward (FFN) sub-block of the transformer block. SwiGLU architecture is used as the expert function Ei(x) and K = 2. This means each token is routed to two SwiGLU sub-blocks with different sets of weights.
Comparison results
Not interesting…
Routing analysis
This part is about a small analysis on the expert selection by the router. In particular, authors are interested to see if during training some experts specialized in some specific domains (e.g. mathematics, biology, philosophy, etc.).
Conclusion - “Surprisingly, we do not observe obvious patterns in the assignment of experts based on the topic. For instance, at all layers, the distribution of expert assignment is very similar for ArXiv papers (written in Latex), for biology (PubMed Abstracts), and for Philosophy (PhilPapers) documents.”(c)