Use real-valued instead of complex tensors in Wan2.1 RoPE #11649

mjkvaak-amd · 2025-06-03T10:55:08Z

What does this PR do?

Avoids the complex tensors in Wan2.1 RoPE by using the real-valued cosine and sine instead. This boosts the performance of compiled models (inductor), where complex tensors are not supported.

Fixes # (issue)

Current RoPE causes UserWarning: Torchinductor does not support code generation for complex operators. Performance may be worse than eager. (ref how compile the model Wan-Video/Wan2.1#332). Using real-valued cosine and sine removes this warning and provides a noticeable boost in the throughput.
May also fix some of the compatibility issues with MPS Diffusers Transformer Pipeline Produces ComplexDouble Tensors on MPS, Causing Conversion Error #10986

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

To verify that the proposed RoPE and utils result in identical and stable behavior compared to the original, I ran a 100-step training of Wan2.1 (image-to-video) with both the proposed (orange) and the original (blue) implementations - the losses are on top of each other, but you can see there are two identical curves from the hovering tooltip.

Please also find the standalone tests for checking the equivalence below:

import torch
from diffusers.models.embeddings import get_1d_rotary_pos_embed
from typing import *
from torch import nn


class WanRotaryPosEmbed(nn.Module):
    def __init__(
        self,
        attention_head_dim: int,
        patch_size: Tuple[int, int, int],
        max_seq_len: int,
        theta: float = 10000.0,
    ):
        super().__init__()

        self.attention_head_dim = attention_head_dim
        self.patch_size = patch_size
        self.max_seq_len = max_seq_len

        h_dim = w_dim = 2 * (attention_head_dim // 6)
        t_dim = attention_head_dim - h_dim - w_dim
        freqs_dtype = (
            torch.float32 if torch.backends.mps.is_available() else torch.float64
        )

        freqs_cos = []
        freqs_sin = []

        for dim in [t_dim, h_dim, w_dim]:
            freq_cos, freq_sin = get_1d_rotary_pos_embed(
                dim,
                max_seq_len,
                theta,
                use_real=True,
                repeat_interleave_real=True,
                freqs_dtype=freqs_dtype,
            )
            freqs_cos.append(freq_cos)
            freqs_sin.append(freq_sin)

        self.freqs_cos = torch.cat(freqs_cos, dim=1)
        self.freqs_sin = torch.cat(freqs_sin, dim=1)

    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
        batch_size, num_channels, num_frames, height, width = hidden_states.shape
        p_t, p_h, p_w = self.patch_size
        ppf, pph, ppw = num_frames // p_t, height // p_h, width // p_w

        self.freqs_cos = self.freqs_cos.to(hidden_states.device)
        self.freqs_sin = self.freqs_sin.to(hidden_states.device)

        split_sizes = [
            self.attention_head_dim - 2 * (self.attention_head_dim // 3),
            self.attention_head_dim // 3,
            self.attention_head_dim // 3,
        ]

        freqs_cos = self.freqs_cos.split(split_sizes, dim=1)
        freqs_sin = self.freqs_sin.split(split_sizes, dim=1)

        freqs_cos_f = freqs_cos[0][:ppf].view(ppf, 1, 1, -1).expand(ppf, pph, ppw, -1)
        freqs_cos_h = freqs_cos[1][:pph].view(1, pph, 1, -1).expand(ppf, pph, ppw, -1)
        freqs_cos_w = freqs_cos[2][:ppw].view(1, 1, ppw, -1).expand(ppf, pph, ppw, -1)

        freqs_sin_f = freqs_sin[0][:ppf].view(ppf, 1, 1, -1).expand(ppf, pph, ppw, -1)
        freqs_sin_h = freqs_sin[1][:pph].view(1, pph, 1, -1).expand(ppf, pph, ppw, -1)
        freqs_sin_w = freqs_sin[2][:ppw].view(1, 1, ppw, -1).expand(ppf, pph, ppw, -1)

        freqs_cos = torch.cat([freqs_cos_f, freqs_cos_h, freqs_cos_w], dim=-1).reshape(
            1, 1, ppf * pph * ppw, -1
        )
        freqs_sin = torch.cat([freqs_sin_f, freqs_sin_h, freqs_sin_w], dim=-1).reshape(
            1, 1, ppf * pph * ppw, -1
        )

        return freqs_cos, freqs_sin


def apply_rotary_emb(
    hidden_states: torch.Tensor,
    freqs_cos: torch.Tensor,
    freqs_sin: torch.Tensor,
):
    dtype = torch.float32 if hidden_states.device.type == "mps" else torch.float64
    x = hidden_states.view(*hidden_states.shape[:-1], -1, 2).to(dtype)
    x1, x2 = x[..., 0], x[..., 1]
    cos = freqs_cos[..., 0::2]
    sin = freqs_sin[..., 1::2]
    out = torch.empty_like(hidden_states)
    out[..., 0::2] = x1 * cos - x2 * sin
    out[..., 1::2] = x1 * sin + x2 * cos
    return out


class WanRotaryPosEmbedOriginal(nn.Module):
    def __init__(
        self,
        attention_head_dim: int,
        patch_size: Tuple[int, int, int],
        max_seq_len: int,
        theta: float = 10000.0,
    ):
        super().__init__()

        self.attention_head_dim = attention_head_dim
        self.patch_size = patch_size
        self.max_seq_len = max_seq_len

        h_dim = w_dim = 2 * (attention_head_dim // 6)
        t_dim = attention_head_dim - h_dim - w_dim

        freqs = []
        freqs_dtype = (
            torch.float32 if torch.backends.mps.is_available() else torch.float64
        )
        for dim in [t_dim, h_dim, w_dim]:
            freq = get_1d_rotary_pos_embed(
                dim,
                max_seq_len,
                theta,
                use_real=False,
                repeat_interleave_real=False,
                freqs_dtype=freqs_dtype,
            )
            freqs.append(freq)
        self.freqs = torch.cat(freqs, dim=1)

    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
        batch_size, num_channels, num_frames, height, width = hidden_states.shape
        p_t, p_h, p_w = self.patch_size
        ppf, pph, ppw = num_frames // p_t, height // p_h, width // p_w

        freqs = self.freqs.to(hidden_states.device)
        freqs = freqs.split_with_sizes(
            [
                self.attention_head_dim // 2 - 2 * (self.attention_head_dim // 6),
                self.attention_head_dim // 6,
                self.attention_head_dim // 6,
            ],
            dim=1,
        )

        freqs_f = freqs[0][:ppf].view(ppf, 1, 1, -1).expand(ppf, pph, ppw, -1)
        freqs_h = freqs[1][:pph].view(1, pph, 1, -1).expand(ppf, pph, ppw, -1)
        freqs_w = freqs[2][:ppw].view(1, 1, ppw, -1).expand(ppf, pph, ppw, -1)
        freqs = torch.cat([freqs_f, freqs_h, freqs_w], dim=-1).reshape(
            1, 1, ppf * pph * ppw, -1
        )
        return freqs


def apply_rotary_emb_original(hidden_states: torch.Tensor, freqs: torch.Tensor):
    dtype = torch.float32 if hidden_states.device.type == "mps" else torch.float64
    x_rotated = torch.view_as_complex(hidden_states.to(dtype).unflatten(3, (-1, 2)))
    x_out = torch.view_as_real(x_rotated * freqs).flatten(3, 4)
    return x_out.type_as(hidden_states)


def test_rotary_pos_embed_value_equivalence():
    attention_head_dim = 12
    patch_size = (2, 2, 2)
    max_seq_len = 16
    batch, channels, frames, height, width = 1, attention_head_dim, 8, 8, 8
    hidden_states = torch.randn(batch, channels, frames, height, width)

    rope = WanRotaryPosEmbed(attention_head_dim, patch_size, max_seq_len)
    rope_orig = WanRotaryPosEmbedOriginal(attention_head_dim, patch_size, max_seq_len)

    # New returns (cos, sin), original returns complex
    cos, sin = rope(hidden_states)
    orig = rope_orig(hidden_states)  # shape: (1, 1, N, D)

    # Remove batch dims for comparison
    cos = cos.squeeze(0).squeeze(0)  # (N, D)
    sin = sin.squeeze(0).squeeze(0)  # (N, D)
    orig = orig.squeeze(0).squeeze(0)  # (N, D/2), complex
    cos_real = cos[:, 0::2]
    sin_real = sin[:, 1::2]

    # Reconstruct complex tensor
    recon = cos_real + 1j * sin_real

    # Compare real and imaginary parts
    assert torch.allclose(recon.real.float(), orig.real.float(), atol=1e-5)
    assert torch.allclose(recon.imag.float(), orig.imag.float(), atol=1e-5)


def test_rotary_emb_equivalence():
    attention_head_dim = 12
    patch_size = (2, 2, 2)
    max_seq_len = 16
    batch, channels, frames, height, width = 1, attention_head_dim, 8, 8, 8
    hidden_states = torch.randn(batch, channels, frames, height, width)

    rope = WanRotaryPosEmbed(attention_head_dim, patch_size, max_seq_len)
    rope_orig = WanRotaryPosEmbedOriginal(attention_head_dim, patch_size, max_seq_len)

    # Get rotary embeddings
    cos, sin = rope(hidden_states)
    freqs = rope_orig(hidden_states)

    # Prepare a fake attention input (B, H, N, D)
    B, H, N, D = cos.shape
    x = torch.randn(B, H, N, D, dtype=torch.float32)

    # Apply both rotary embeddings
    out_orig = apply_rotary_emb_original(x, freqs)
    out_real = apply_rotary_emb(x, cos, sin)

    # Check equivalence
    assert torch.allclose(
        out_real, out_orig, atol=1e-5
    ), "Real-valued rotary embedding does not match original complex version"

a-r-r-o-w

Wow, awesome work @mjkvaak-amd and thank you! Coincidentally, I was working on refactoring some of the rope code as well this week for compile compatibility, but you beat me to it :)

The changes looks good to me visually, but I'll quickly verify the numeric values ourselves as well.

Maybe returning a tuple from the rope layer can cause some issues with specific research repos that copy transformer implementation from diffusers but import internal layers directly, or folks using custom attention processor and expecting complex rope tensor (once this change is in main and next release). I think it should be fine as it'll be in a new release but LMK your thoughts @DN6

a-r-r-o-w · 2025-06-03T11:06:35Z

src/diffusers/models/transformers/transformer_wan.py

-                self.attention_head_dim // 6,
-            ],
-            dim=1,
+        self.freqs_cos = self.freqs_cos.to(hidden_states.device)


I think doing it this way will cause a recompilation. We could probably just store as non-persistent buffer though with this refactor. The reason for not using buffer before was because it was a complex tensor

Good thinking! I have added the proposed changes now.

a-r-r-o-w · 2025-06-03T12:41:12Z

On my end, I can confirm that the numerical outputs match on many arbitrary shapes. However, I do get different final results on full inference when comparing this branch to main.

output.mp4	output2.mp4

(left is this branch, right is diffusers:main; both use the example pipeline code with same seed)

Trying to look into what could be the problem (possibly just something on my end)

mjkvaak-amd added 2 commits June 3, 2025 07:38

use real instead of complex tensors in Wan2.1 RoPE

0911af7

remove the redundant type conversion

e64770b

a-r-r-o-w approved these changes Jun 3, 2025

View reviewed changes

unpack rotary_emb

6867768

register rotary embedding frequencies as non-persistent buffers

c19454c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use real-valued instead of complex tensors in Wan2.1 RoPE #11649

Use real-valued instead of complex tensors in Wan2.1 RoPE #11649

Uh oh!

mjkvaak-amd commented Jun 3, 2025 •

edited

Loading

Uh oh!

a-r-r-o-w left a comment •

edited

Loading

Uh oh!

a-r-r-o-w Jun 3, 2025

Uh oh!

mjkvaak-amd Jun 3, 2025

Uh oh!

a-r-r-o-w commented Jun 3, 2025

Uh oh!

Uh oh!

Use real-valued instead of complex tensors in Wan2.1 RoPE #11649

Are you sure you want to change the base?

Use real-valued instead of complex tensors in Wan2.1 RoPE #11649

Uh oh!

Conversation

mjkvaak-amd commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Uh oh!

a-r-r-o-w left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

a-r-r-o-w Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

mjkvaak-amd Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

a-r-r-o-w commented Jun 3, 2025

Uh oh!

Uh oh!

mjkvaak-amd commented Jun 3, 2025 •

edited

Loading

a-r-r-o-w left a comment •

edited

Loading