Triple-Buffered GEMM blocked by conservative vmcnt(0) insertion #534

adedespirlet · 2025-12-05T13:04:32Z

This PR provides a minimal reproducer for an issue with the triple-buffered GEMM, where the backend inserts conservative s_waitcnt vmcnt(0) instructions. These waits block the pipeline until all outstanding global memory operations complete, preventing the intended overlap between compute and global loads. As a result, the triple-buffering schedule does not yield the expected reduction in execution time.

Context

In the triple-buffered GEMM (256x256x232) I generate, the prologue issues 6 global vector loads in this order: A0, B0, B0_2, A1, B1, B1_2
The first three loads (A0, B0, B0_2) fill buffer i, used for iteration i.
The next three loads (A1, B1, B1_2) fill a different buffer, used only for iteration i+1.
During iteration i, compute depends only on the first three loads. The A1/B1/B1_2 loads should remain in flight while iteration i is computing.
Ideally, we want to wait only for the completion of the first three global loads before starting the first compute iteration and not wait for all of them.... In theory this could be encoded as: s_waitcnt vmcnt(3) which waits until =< 3 memory operations remain outstanding.

Problems:

s_waitcnt vmcnt(3) only guarantees that some 3 memory ops have finished and not necessarily the first three issued, since completion order can differ from issue order (cache hits/misses etc).
However : in my specific gemm case here, the last three global loads touch addresses that are just a K×2-byte offset away in memory, so I’d expect them to have very similar latency and to complete roughly in issue order. So we could assume it is safe to use vmcnt(3)
Even if I emit a relaxed vmcnt(3) (through the usage of rocdl.s.waitcnt 16371), the backend currently inserts a conservative vmcnt(0), forcing a full barrier and preventing overlap , hence defeating the purpose of adding an extra pipelining depth.
This behavior can be seen clearly in the attached rocprofv3 trace: all 6 loads are tied to the same vmcnt(0), and compute stalls until the entire batch completes.

Questions / Possible Directions

In case we assume that global memory ops will finish in order they have been issued:

Emit assembly directly from mlir and bypass the backend which conservatively inserts vmcnt(0) instructions and emit vmcnt(3) instead.

In case it is not safe to assume that global memory ops will finish in the order they have been issued:

Then I’m trying to determine whether amd gpus provide any mechanism to wait on specific global vector memory ops rather than the outstanding count model of vmcnt.

The tripple buffered GEMM kernel IR is located in gemm_test.py , under asm_256x256x64_tile_64x32x64_pingpong_mfma_16x16x32_triple_buffer

Signed-off-by: Aurore De Spirlet <[email protected]>

add trace and kernel for triple-buffered GEMM (256x256x232)

7394ed6

Signed-off-by: Aurore De Spirlet <[email protected]>

adedespirlet force-pushed the gemm_multibuffer branch from c900d41 to 7394ed6 Compare December 5, 2025 13:07

adedespirlet requested review from ftynse, harsh-nod, martin-luecke and tyb0807 December 5, 2025 13:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Triple-Buffered GEMM blocked by conservative vmcnt(0) insertion #534

Triple-Buffered GEMM blocked by conservative vmcnt(0) insertion #534

Uh oh!

adedespirlet commented Dec 5, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Triple-Buffered GEMM blocked by conservative vmcnt(0) insertion #534

Are you sure you want to change the base?

Triple-Buffered GEMM blocked by conservative vmcnt(0) insertion #534

Uh oh!

Conversation

adedespirlet commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Problems:

Questions / Possible Directions

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

adedespirlet commented Dec 5, 2025 •

edited

Loading