Paper Review 4: Self-attention Does Not Need O(n^2) Memory

How can the memory efficiency of self-attention be increased?

By Beksultan Sagyndyk

Screenshot 2024-01-26 at 17 56 17

In the paper “Self-attention Does Not Need O(n^2) Memory,” the Google team introduces simple algorithms for attention and self-attention that require only constant memory and logarithmic memory, respectively. At a sequence length of 16384, the approach can reduce the self-attention memory overhead by 59x for inference and by 32x for differentiation.

Intro

Screenshot 2024-01-04 at 14 43 40

Attention Algorithm:

  1. The attention operation on a single query produces a weighted sum of value vectors.
  2. The weights are determined by the softmax of the dot products between the query and the keys.

In the implementation of this, we have to:

  • First compute and remember S(i) for all i, -> an O(n) time and memory complexity for each query.
  • Transformers use self-attention, meaning for each element of the sequence, we need our query -> time and space complexity is O(n^2).

Main Algorithm

Single Attention Case

First Step:

Screenshot 2024-01-04 at 15 59 28

Second Step:

  1. Initialize vectors v* and scalar s* with 0.
  2. Loop over k_n = [k1, k2, ..kn] and v_n = [v1, v2, …vn]. For each k_i and v_i, compute s_i.
  3. For each s_i, update v* and s*.
  4. After all, divide v/s -> final result.

Screenshot 2024-01-04 at 15 57 59

Self-Attention Case

To extend this algorithm to self-attention, just compute the results for all queries sequentially.

Numerical Stability Problem

Default and new attention are not numerically stable when using floating-point arithmetic. For example, for scores ≥ 89, the exponentiation results in inf (for bfloat16 and float32).

To resolve this problem, they invented an additional scalar - m*:

  1. Which keeps track of the maximum score that the incremental algorithm has seen so far.
  2. Renormalize the sums of exponentiated values as needed.

Screenshot 2024-01-04 at 16 21 35

Results

Screenshot 2024-01-04 at 16 19 11

Share: LinkedIn