The paper was published in 2021 as a preliminary report, and in late 2023 it has been updated to v5. The core idea is the same. The only thing that’s changed is basically evaluating on English benchmarks instead of Chinese data only.

This is probably one of the papers with the worst English writing skills I’ve read. It’s kinda iconic since this is an LLM paper, but LLM is not being used to audit the writing of the paper.

Why relative

This is my understanding, not what’s in the paper. We don’t need the absolute position information that’s done in Absolute position embedding. What we care about is the relative distance when doing attention. That means the y can generalize better on longer sequence (position 5 to 8 is the same as 500 to 503).

Formulation

Say we generate the positional embedding via function ,

and are both on position because that’s a way of understanding the weighted sum. And then they go through softmax like this

For the traditional absolute position embedding, that’s

Since we want to capture the relative information and attention, it would be good if that depend only on the relative position between and . So the problem becomes: can we find such an , such that

One can find a solution to our formulation, when is:

where is the real part of a complex number and represents the conjugate complex number of . is a preset non-zero constant. We can further write in a multiplication matrix:

We can see intuitively that this work since this is rotating the embedding, or, assigning them angles in 2D plane. A vector at and a vector at , when doing dot product, gives us .

In order to generalize our results in 2D to any where is even, we divide the d-dimension space into sub-spaces and combine them in the merit of the linearity of the inner product, turning into:

where

is the rotary matrix with pre-defined parameters . One can think about it as “encoding relative information for each pair of size 2 in the embedding”.

  • mutations are the same idea as Absolute position embedding. It’s for differentiating dimensions. They can technically be the same and still satisfies RoPE property. They are still set as this since:
    • You want frequencies spread across many orders of magnitude so different heads/dimensions can attend to both local and long-range positional relationships.
    • You want the rotation to be slow enough in some dimensions that the model can generalize to longer sequences.
    • The predefined θ sequence is more of a prior / initialization of the frequency basis than a hard constraint. The model has a degree of freedom to work around it through the learned projections.
  • mutations is on position.

Comparison

Absolute PERoPE
What it encodesAbsolute positionRelative position
Where appliedInput embeddingsQ and K, inside attention
Why thereEnriches token identity before projectionMust survive to the dot product to get R_{j-i} factoring
Applied to V?Yes (implicitly, V gets the embedding)No — V doesn’t participate in the position-sensitive dot product

Properties of RoPE

  • Similar as the OG position embedding, it has a long term decay property (that’s because only the “going down” part of the cosine is the dominant part)
  • It can be used easily with linear attention

Implementation

This part is generated from my conversation with Claude Sonnet 4.6.

Rather than constructing the full block-diagonal rotation matrix, we exploit the fact that each 2D rotation only mixes adjacent pairs . This lets us decompose the operation into two elementwise multiplications.

Derivation (d=6)

For with frequency angles , the full rotation gives:

Split into two terms:

Why this split and not another?

We could have split differently — e.g. putting only even-indexed elements in term 1 and odd-indexed in term 2. But that would require strided slicing and interleaving back at the end, which are awkward tensor operations.

This split is chosen because both terms have the same “shape of access” — the full vector appears once in each term. That means term 2 only requires a cheap rearrangement of , not a gather/scatter.

Principle: the math admits many equivalent decompositions. Choose the one that maps onto cheap tensor operations.


Computing the Two Components

The cos/sin vectors go from shape [seq_len, d//2][seq_len, d] via repeat_interleave(..., repeats=2, dim=-1).

The rotate_half operation transforms into :


Final Formula


Intuitions

The "passengers" intuition

The rotation only acts on the dimension. Batch, heads, and seq are just passengers — they don’t participate in the logic. So you can derive everything by thinking about a single vector , then broadcast freely over the other dimensions.

A tensor of shape [seq_len, d] for cos/sin already holds the answer for every position simultaneously. There is no loop — broadcasting is the loop.

Tensor thinking: work backwards from target shape

The key question when vectorizing is: “what arrangement of elements would make this a simple elementwise operation?”

For rotate_half, we wanted . Working backwards: that’s a flatten of a [d//2, 2] matrix where columns are swapped and column 0 is negated. Swap → flip. Negate column 0 → multiply by [-1, 1]. The loop is never written; it dissolves into shape manipulation.

My own implementation:

class RoPE(nn.Module):
    def __init__(self, theta: float, d_k: int, max_seq_len: int):
        super().__init__()
        # In the original paper they use m in place of pos,and i in place of k.
        pos_s = torch.arange(max_seq_len)
        k_half_s = torch.arange(d_k // 2)
 
        theta_p_k: Num[Tensor, "seq_len d_k_half"] = torch.outer(pos_s, (theta ** (-2 * k_half_s / d_k)))
        cos_theta_p_k: Num[Tensor, "seq_len d_k_half"] = torch.cos(theta_p_k)
        sin_theta_p_k: Num[Tensor, "seq_len d_k_half"] = torch.sin(theta_p_k)
 
        # seq_len dimensions are passers. The important stuff is the k dim.
        # Take an example of last_dim=6
        # For each pair we do ratation
        # Reorder them so there's no interleaving
        term_one: Num[Tensor, "seq_len d_k"] = torch.repeat_interleave(cos_theta_p_k, repeats=2, dim=-1)
        term_two: Num[Tensor, "seq_len d_k"] = torch.repeat_interleave(sin_theta_p_k, repeats=2, dim=-1)
        self.register_buffer("term_one", term_one, persistent=False)
        self.register_buffer("term_two", term_two, persistent=False)
 
    def forward(
        self, x: Num[Tensor, "... seq_len d_k"], token_positions: Integer[Tensor, "... seq_len"]
    ) -> Num[Tensor, "... seq_len d_k"]:
        # For non packed data that token_positions is an arange in training time
        # x1 part is easy. focusing on x2...
        xx: Num[Tensor, "... d_k_half 2"] = einx.rearrange("... (d_k_half two) -> ... d_k_half two", x, two=2)
        xx = xx * torch.tensor([1, -1], device=self.term_one.device)
        xx = xx.flip([-1])
        # Can also just do this, may be more intuitive
        # xx = torch.stack([-x[..., 1::2], x[..., 0::2]], dim=-1)
        xx = einx.rearrange("... d_k_half two -> ... (d_k_half two)", xx, two=2)
        return self.term_one[token_positions] * x + self.term_two[token_positions] * xx
 

Appendix A: linearity of the inner product

Let’s consider two vectors, and , in a -dimensional space, where is an even number.

The standard inner product (dot product), denoted by , is defined as the sum of the element-wise products of their components:

Now, RoPE treats the -dimensional vector as a concatenation of smaller, 2-dimensional vectors. Let’s denote these sub-vectors with a prime symbol (′):

  • , , …,
  • , , …,

Because of the basic rules of addition, we can simply regroup the terms in the original inner product sum:

Notice that each term in parentheses is just the inner product of the corresponding 2D sub-vectors:

This leads us to the core identity that RoPE exploits. The inner product in -dimensions is precisely the sum of the inner products in the constituent 2D subspaces: