Paper Review 5: Mistral 7B

Mistral

Mistral 7B outperforms the best open 13B model (Llama 2) across all evaluated benchmarks, and the best released 34B model (Llama 1) in reasoning, mathematics, and code generation.

Code: Mistral GitHub
Webpage: Mistral News

Key moments:

Grouped-query attention (GQA) for faster inference.
Sliding window attention (SWA) to effectively handle sequences of arbitrary length with reduced inference cost.

Architecture

Mistral - Transformer.

Mistral Architecture

Sliding Window Attention

Number of operations in vanilla attention is quadratic in the sequence length.

In SWA: each token can attend to at most W tokens from the previous layer (here, W = 3). But tokens outside the sliding window still influence next word prediction.

Sliding Window Attention

At each attention layer, information can move forward by W tokens. Hence, after k attention layers, information can move forward by up to k × W tokens (figure 1).

Rolling Buffer Cache

A fixed attention span means we can limit our cache size.

The cache has a fixed size of W (= 3 in the picture example).
The keys and values for timestep i are stored in position i mod W of the cache.
When the position i > W, past values in the cache are overwritten, and the size of the cache stops increasing.

Rolling Buffer Cache

System Prompt to Enforce Guardrails

The system prompt below is designed to guide the model in generating answers within specified guardrails, similar to the work done with Llama 2.

Guiding Principles:

Always assist with care, respect, and truth.
Respond with utmost utility while ensuring security.
Avoid harmful, unethical, prejudiced, or negative content.
Ensure replies promote fairness and positivity.

Tags: LLM long input transformers