Mistral 7B outperforms the best open 13B model (Llama 2) across all evaluated benchmarks, and the best released 34B model (Llama 1) in reasoning, mathematics, and code generation.
- Code: Mistral GitHub
- Webpage: Mistral News
Key moments:
- Grouped-query attention (GQA) for faster inference.
- Sliding window attention (SWA) to effectively handle sequences of arbitrary length with reduced inference cost.
Architecture
Mistral - Transformer.
Sliding Window Attention
Number of operations in vanilla attention is quadratic in the sequence length.
In SWA: each token can attend to at most W tokens from the previous layer (here, W = 3). But tokens outside the sliding window still influence next word prediction.
At each attention layer, information can move forward by W tokens. Hence, after k attention layers, information can move forward by up to k × W tokens (figure 1).
Rolling Buffer Cache
A fixed attention span means we can limit our cache size.
- The cache has a fixed size of W (= 3 in the picture example).
- The keys and values for timestep i are stored in position i mod W of the cache.
- When the position i > W, past values in the cache are overwritten, and the size of the cache stops increasing.
System Prompt to Enforce Guardrails
The system prompt below is designed to guide the model in generating answers within specified guardrails, similar to the work done with Llama 2.
Guiding Principles:
- Always assist with care, respect, and truth.
- Respond with utmost utility while ensuring security.
- Avoid harmful, unethical, prejudiced, or negative content.
- Ensure replies promote fairness and positivity.