Skip to content

Conversation

@adedespirlet
Copy link
Contributor

@adedespirlet adedespirlet commented Dec 5, 2025

This PR provides a minimal reproducer for an issue with the triple-buffered GEMM, where the backend inserts conservative s_waitcnt vmcnt(0) instructions. These waits block the pipeline until all outstanding global memory operations complete, preventing the intended overlap between compute and global loads. As a result, the triple-buffering schedule does not yield the expected reduction in execution time.

Context

In the triple-buffered GEMM (256x256x232) I generate, the prologue issues 6 global vector loads in this order: A0, B0, B0_2, A1, B1, B1_2
The first three loads (A0, B0, B0_2) fill buffer i, used for iteration i.
The next three loads (A1, B1, B1_2) fill a different buffer, used only for iteration i+1.
During iteration i, compute depends only on the first three loads. The A1/B1/B1_2 loads should remain in flight while iteration i is computing.
Ideally, we want to wait only for the completion of the first three global loads before starting the first compute iteration and not wait for all of them.... In theory this could be encoded as: s_waitcnt vmcnt(3) which waits until =< 3 memory operations remain outstanding.

Problems:

  1. s_waitcnt vmcnt(3) only guarantees that some 3 memory ops have finished and not necessarily the first three issued, since completion order can differ from issue order (cache hits/misses etc).
    However : in my specific gemm case here, the last three global loads touch addresses that are just a K×2-byte offset away in memory, so I’d expect them to have very similar latency and to complete roughly in issue order. So we could assume it is safe to use vmcnt(3)
  2. Even if I emit a relaxed vmcnt(3) (through the usage of rocdl.s.waitcnt 16371), the backend currently inserts a conservative vmcnt(0), forcing a full barrier and preventing overlap , hence defeating the purpose of adding an extra pipelining depth.
    This behavior can be seen clearly in the attached rocprofv3 trace: all 6 loads are tied to the same vmcnt(0), and compute stalls until the entire batch completes.

Questions / Possible Directions

  • In case we assume that global memory ops will finish in order they have been issued:

Emit assembly directly from mlir and bypass the backend which conservatively inserts vmcnt(0) instructions and emit vmcnt(3) instead.

  • In case it is not safe to assume that global memory ops will finish in the order they have been issued:

Then I’m trying to determine whether amd gpus provide any mechanism to wait on specific global vector memory ops rather than the outstanding count model of vmcnt.

The tripple buffered GEMM kernel IR is located in gemm_test.py , under asm_256x256x64_tile_64x32x64_pingpong_mfma_16x16x32_triple_buffer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant