Triple-Buffered GEMM blocked by conservative vmcnt(0) insertion #534
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR provides a minimal reproducer for an issue with the triple-buffered GEMM, where the backend inserts conservative s_waitcnt vmcnt(0) instructions. These waits block the pipeline until all outstanding global memory operations complete, preventing the intended overlap between compute and global loads. As a result, the triple-buffering schedule does not yield the expected reduction in execution time.
Context
In the triple-buffered GEMM (256x256x232) I generate, the prologue issues 6 global vector loads in this order: A0, B0, B0_2, A1, B1, B1_2
The first three loads (A0, B0, B0_2) fill buffer i, used for iteration i.
The next three loads (A1, B1, B1_2) fill a different buffer, used only for iteration i+1.
During iteration i, compute depends only on the first three loads. The A1/B1/B1_2 loads should remain in flight while iteration i is computing.
Ideally, we want to wait only for the completion of the first three global loads before starting the first compute iteration and not wait for all of them.... In theory this could be encoded as: s_waitcnt vmcnt(3) which waits until =< 3 memory operations remain outstanding.
Problems:
However : in my specific gemm case here, the last three global loads touch addresses that are just a K×2-byte offset away in memory, so I’d expect them to have very similar latency and to complete roughly in issue order. So we could assume it is safe to use vmcnt(3)
This behavior can be seen clearly in the attached rocprofv3 trace: all 6 loads are tied to the same vmcnt(0), and compute stalls until the entire batch completes.
Questions / Possible Directions
Emit assembly directly from mlir and bypass the backend which conservatively inserts vmcnt(0) instructions and emit vmcnt(3) instead.
Then I’m trying to determine whether amd gpus provide any mechanism to wait on specific global vector memory ops rather than the outstanding count model of vmcnt.
The tripple buffered GEMM kernel IR is located in gemm_test.py , under asm_256x256x64_tile_64x32x64_pingpong_mfma_16x16x32_triple_buffer