-
-
Notifications
You must be signed in to change notification settings - Fork 10.4k
[torch.compile]: Add VLLM_DEBUG_DUMP_PATH environment variable #25651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: zjy0516 <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces an environment variable VLLM_DEBUG_DUMP_PATH
to allow users to specify a directory for dumping debug information, which can be helpful for debugging and analysis. The changes involve adding the environment variable definition in vllm/envs.py
and utilizing it in vllm/config/__init__.py
to set the debug_dump_path
in the compilation configuration.
Signed-off-by: zjy0516 <[email protected]>
Signed-off-by: zjy0516 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work, please append rank to path in config init, also add overriding behavior to config docstring and change type of property to Path
self.scheduler_config.disable_hybrid_kv_cache_manager = True | ||
|
||
if envs.VLLM_DEBUG_DUMP_PATH is not None: | ||
self.compilation_config.debug_dump_path = \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we should warn if the compilation config property is set already (because it means we're overriding)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, after the override, we should append rank to the path here so that we don't have to append it everywhere else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot get rank here because process group has not been initialized
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, could you add a helper function VllmConfig.compile_debug_dump_path()
that computes the path using the rank?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think it's appropriate to put compile_debug_dump_path()
in VllmConfig
. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better than recomputing it in this many places. You can also add a method to CompilationConfig
that accepts vllm_config as a param if that seems better to you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Pydantic not auto cast str
to Path
on initialisation here? The cast to Path
might be redundant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Pydantic not auto cast
str
toPath
on initialisation here? The cast toPath
might be redundant
You are right. Thanks for your reminding.
Co-authored-by: Luka Govedič <[email protected]> Signed-off-by: Jiangyun Zhu <[email protected]>
Signed-off-by: zjy0516 <[email protected]>
Currently, |
Why not an |
Signed-off-by: zjy0516 <[email protected]>
Signed-off-by: zjy0516 <[email protected]>
I think this is ready. PTAL @ProExpertProg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still more of a fan of VllmConfig.compile_debug_dump_path()
- that way we can simplify this code tremendously which is the whole goal. There is no place where we only have access to CompilationConfig
and not VllmConfig
so I think we should go with that approach
vllm/compilation/monitor.py
Outdated
path = os.path.join(compilation_config.debug_dump_path, | ||
f"rank_{vllm_config.parallel_config.rank}") | ||
rank = torch.distributed.get_rank() | ||
path = compilation_config.compile_debug_dump_path(rank) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're passing rank anyway, wy not use ParallelConfig.rank
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because ParallelConfig.rank
are always zero at that time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this is successfully used in main
, I think ParallelConfig.rank
is initialized by this point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use ParallelConfig.rank
like main
, dp mode will always get 0
vllm serve /data/datasets/models-hf/Qwen3-4B/ -dp 2 -O.debug_dump_path "./compile_dump"
ls compile_dump/
rank_0/
vllm serve /data/datasets/models-hf/Qwen3-4B/ -tp 2 -O.debug_dump_path "./compile_dump"
ls compile_dump/
rank_0/ rank_1/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, nice find. In think case could we just append dp rank separately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean like rank_0_dp_0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep! But only if dp world size is > 1
vllm/config/compilation.py
Outdated
logger.warning_once("Op '%s' %s, %s with '%s' has no effect", | ||
op_name, missing_str, enable_str, op) | ||
|
||
def compile_debug_dump_path(self, rank: int) -> Path: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should return Optional[Path]
and return None if the path is not set rather than assert so that it can be used in places we have to check whether a path exists or not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this only gets called if debug_dump_path
is not None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah but that makes the code more complex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current code consistently checks if debug_dump_path
is set (not None
) before proceeding. We skip the operation if it's None.
if not debug_dump_path:
return
debug_dump_path = compile_debug_dump_path(
rank)
debug_dump_path.mkdir(parents=True, exist_ok=True)
if compilation_config.level == CompilationLevel.PIECEWISE and \
compilation_config.debug_dump_path:
import depyf
rank = torch.distributed.get_rank()
path = vllm_config.compile_debug_dump_path(rank)
path.mkdir(parents=True, exist_ok=True)
Returning Optional[Path]
contradicts this pattern. If we did, the type checker would rightly flag that mkdir
is not a valid method for a None
object whenever we use the path without a prior null check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but this means that we first check debug_dump_path
, and if not none we call compile_debug_dump_path
. That means users need to know about both. Instead users should just call compile_debug_dump_path
and THEN do the None check, and never interact with debug_dump_path directly.
I'm okay with this change, but I'm concerned that adding too many functions directly to VllmConfig could lead to a bloated class that's difficult to maintain. |
If that happens we can always move the function out of VllmConfig and pass VllmConfig as a parameter. For now, I think it makes sense because it's basically a "computed property". |
Signed-off-by: zjy0516 <[email protected]>
I have moved it to |
Signed-off-by: zjy0516 <[email protected]>
Signed-off-by: zjy0516 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor notes, thanks for this cleanup!
Co-authored-by: Luka Govedič <[email protected]> Signed-off-by: Jiangyun Zhu <[email protected]>
Co-authored-by: Luka Govedič <[email protected]> Signed-off-by: Jiangyun Zhu <[email protected]>
Co-authored-by: Luka Govedič <[email protected]> Signed-off-by: Jiangyun Zhu <[email protected]>
Signed-off-by: zjy0516 <[email protected]>
Purpose
FIX #25612
CC @ProExpertProg
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.