Rotary position embedding

The paper was published in 2021 as a preliminary report, and in late 2023 it has been updated to v5. The core idea is the same. The only thing that’s changed is basically evaluating on English benchmarks instead of Chinese data only.

This is probably one of the papers with the worst English writing skills I’ve read. It’s kinda iconic since this is an LLM paper, but LLM is not being used to audit the writing of the paper.

Why relative

This is my understanding, not what’s in the paper. We don’t need the absolute position information that’s done in Absolute position embedding. What we care about is the relative distance when doing attention. That means the y can generalize better on longer sequence (position 5 to 8 is the same as 500 to 503).

Formulation

Say we generate the positional embedding via function $f$ ,

q_{m} k_{n} v_{n} = f_{q} (x_{m}, m) = f_{k} (x_{n}, n) = f_{v} (x_{n}, n),

$k$ and $v$ are both on $n$ position because that’s a way of understanding the weighted sum. And then they go through softmax like this

a_{m, n} = \frac{exp ( \frac{q _{m}^{⊺} k _{n}}{d} )}{\sum _{j = 1}^{N} exp ( \frac{q _{m}^{⊺} k _{j}}{d} )}

For the traditional absolute position embedding, that’s

f_{t : t \in {q, k, v}} (x_{i}, i) := W_{t : t \in {q, k, v}} (x_{i} + p_{i}),

Since we want to capture the relative information and attention, it would be good if that $q_{m}^{T} k_{n}$ depend only on the relative position between $m$ and $n$ . So the problem becomes: can we find such an $f$ , such that

⟨ f_{q} (x_{m}, m), f_{k} (x_{n}, n)⟩ = g (x_{m}, x_{n}, m - n) .

One can find a solution to our formulation, when $d = 2$ is:

f_{q} (x_{m}, m) f_{k} (x_{n}, n) g (x_{m}, x_{n}, m - n) = (W_{q} x_{m}) e^{im θ} = (W_{k} x_{n}) e^{in θ} = Re [(W_{q} x_{m}) (W_{k} x_{n})^{*} e^{i (m - n) θ}]

where $Re [\cdot]$ is the real part of a complex number and $(W_{k} x_{n})^{*}$ represents the conjugate complex number of $(W_{k} x_{n})$ . $θ \in R$ is a preset non-zero constant. We can further write $f_{{} q, k}$ in a multiplication matrix:

f_{{q, k}} (x_{m}, m) = (cos m θ sin m θ - sin m θ cos m θ) (W_{{q, k}}^{(11)} W_{{q, k}}^{(21)} W_{{q, k}}^{(12)} W_{{q, k}}^{(22)}) (x_{m}^{(1)} x_{m}^{(2)})

We can see intuitively that this work since this is rotating the embedding, or, assigning them angles in 2D plane. A vector at $θ$ and a vector at $γ$ , when doing dot product, gives us $cos (θ - γ)$ .

In order to generalize our results in 2D to any $x_{i} \in R^{d}$ where $d$ is even, we divide the d-dimension space into $d /2$ sub-spaces and combine them in the merit of the linearity of the inner product, turning $f_{{q, k}}$ into:

f_{{q, k}} (x_{m}, m) = R_{Θ, m}^{d} W_{{q, k}} x_{m}

where

R_{Θ, m}^{d} = cos m θ_{1} sin m θ_{1} 00 ⋮ 00 - sin m θ_{1} cos m θ_{1} 00 ⋮ 00 00 cos m θ_{2} sin m θ_{2} ⋮ 00 00 - sin m θ_{2} cos m θ_{2} ⋮ 00 \dots \dots \dots \dots ⋱ \dots \dots 0000 ⋮ cos m θ_{d /2} sin m θ_{d /2} 0000 ⋮ - sin m θ_{d /2} cos m θ_{d /2}

is the rotary matrix with pre-defined parameters $Θ = {θ_{i} = 1000 0^{- 2 (i - 1) / d}, i \in [1, 2, ..., d /2]}$ . One can think about it as “encoding relative information for each pair of size 2 in the embedding”.

Properties of RoPE

Similar as the OG position embedding, it has a long term decay property (that’s because only the “going down” part of the cosine is the dominant part)
It can be used easily with linear attention

Appendix A: linearity of the inner product

Let’s consider two vectors, $x$ and $y$ , in a $d$ -dimensional space, where $d$ is an even number.

x = (x_{1}, x_{2}, x_{3}, x_{4}, \dots, x_{d - 1}, x_{d})

y = (y_{1}, y_{2}, y_{3}, y_{4}, \dots, y_{d - 1}, y_{d})

The standard inner product (dot product), denoted by $⟨ x, y ⟩$ , is defined as the sum of the element-wise products of their components:

⟨ x, y ⟩ = i = 1 \sum d x_{i} y_{i} = x_{1} y_{1} + x_{2} y_{2} + x_{3} y_{3} + \dots + x_{d} y_{d}

Now, RoPE treats the $d$ -dimensional vector as a concatenation of $d /2$ smaller, 2-dimensional vectors. Let’s denote these sub-vectors with a prime symbol (′):

$x^{'}_{1} = (x_{1}, x_{2})$ , $x^{'}_{2} = (x_{3}, x_{4})$ , …, $x^{'}_{d /2} = (x_{d - 1}, x_{d})$
$y^{'}_{1} = (y_{1}, y_{2})$ , $y^{'}_{2} = (y_{3}, y_{4})$ , …, $y^{'}_{d /2} = (y_{d - 1}, y_{d})$

Because of the basic rules of addition, we can simply regroup the terms in the original inner product sum:

⟨ x, y ⟩ = (x_{1} y_{1} + x_{2} y_{2}) + (x_{3} y_{3} + x_{4} y_{4}) + \dots + (x_{d - 1} y_{d - 1} + x_{d} y_{d})

Notice that each term in parentheses is just the inner product of the corresponding 2D sub-vectors:

$(x_{1} y_{1} + x_{2} y_{2}) = ⟨ x^{'}_{1}, y^{'}_{1} ⟩$
$(x_{3} y_{3} + x_{4} y_{4}) = ⟨ x^{'}_{2}, y^{'}_{2} ⟩$

This leads us to the core identity that RoPE exploits. The inner product in $d$ -dimensions is precisely the sum of the inner products in the $d /2$ constituent 2D subspaces:

⟨ x, y ⟩ = i = 1 \sum d /2 ⟨ x^{'}_{i}, y^{'}_{i} ⟩

Yanda's Random Notes

Explorer

Rotary position embedding

Why relative

Formulation

Properties of RoPE

Appendix A: linearity of the inner product

Graph View

Table of Contents