-
-
Notifications
You must be signed in to change notification settings - Fork 10.4k
[Hybrid]: Decouple Logical Block Size from Physical Page Size #24486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a hybrid cache architecture to decouple logical and physical block sizes, which is a significant enhancement for memory management. The changes span configuration, platform-specific code, and the core block table management. The implementation in block_table.py
appears solid. However, I've identified some critical issues in the tests intended to validate this new functionality. The tests are flawed and do not correctly verify the hybrid block logic, which could mask bugs. Additionally, there's a piece of logic in the GPUModelRunner
that could be made more robust. My review focuses on fixing these test and implementation issues to ensure the new feature is reliable and well-tested.
4e3eeca
to
8f2ee3d
Compare
Also CC @tdoublep |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed with @zhiyuan1i offline. Two major concerns:
- I prefer to calculate kernel block size for each attention backend in gpu_model_runner
- would be great if
BlockTable.block_table
andBlockTable.physical_block_table
can be merged into one tensor.
954ade4
to
0e0823a
Compare
@heheda12345 Thanks for the prompt feedback! I’ve addressed suggestion2 and merged BlockTable.block_table and BlockTable.physical_block_table into a single tensor as recommended. :) |
6d1735e
to
0b544bf
Compare
62e5072
to
55e2235
Compare
CC @gshtras @hongxiayang as this also affect ROCm |
55e2235
to
b0e1d3b
Compare
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: lizhiyuan <[email protected]>
Signed-off-by: lizhiyuan <[email protected]>
Signed-off-by: lizhiyuan <[email protected]>
Signed-off-by: lizhiyuan <[email protected]>
Signed-off-by: lizhiyuan <[email protected]>
16a6dc0
to
771ef36
Compare
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: lizhiyuan <[email protected]>
771ef36
to
a51673b
Compare
Signed-off-by: lizhiyuan <[email protected]>
Signed-off-by: lizhiyuan <[email protected]>
Signed-off-by: lizhiyuan <[email protected]>
Purpose
This PR introduces a hybrid cache architecture that separates logical kernel block size from
physical page size, enabling more flexible memory management. Key changes include:
This hybrid model decoupling enables independent development of high-performance operators
without being constrained by linear attention mechanisms like Mamba, addressing performance
bottlenecks discussed in issues #24280 and
#23161.
Test Plan
Added comprehensive tests in tests/v1/worker/test_gpu_model_runner.py to verify:
Test Result
pytest tests/v1/worker/test_gpu_model_runner.py - 20 passes
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.