Skip to content

Conversation

ZJY0516
Copy link
Contributor

@ZJY0516 ZJY0516 commented Sep 25, 2025

Purpose

FIX #25612
CC @ProExpertProg

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: zjy0516 <[email protected]>
Copy link

mergify bot commented Sep 25, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ZJY0516.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 25, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an environment variable VLLM_DEBUG_DUMP_PATH to allow users to specify a directory for dumping debug information, which can be helpful for debugging and analysis. The changes involve adding the environment variable definition in vllm/envs.py and utilizing it in vllm/config/__init__.py to set the debug_dump_path in the compilation configuration.

@mergify mergify bot removed the needs-rebase label Sep 25, 2025
Signed-off-by: zjy0516 <[email protected]>
@ZJY0516 ZJY0516 requested a review from zou3519 as a code owner September 25, 2025 08:40
Copy link
Collaborator

@ProExpertProg ProExpertProg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work, please append rank to path in config init, also add overriding behavior to config docstring and change type of property to Path

self.scheduler_config.disable_hybrid_kv_cache_manager = True

if envs.VLLM_DEBUG_DUMP_PATH is not None:
self.compilation_config.debug_dump_path = \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we should warn if the compilation config property is set already (because it means we're overriding)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, after the override, we should append rank to the path here so that we don't have to append it everywhere else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot get rank here because process group has not been initialized

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, could you add a helper function VllmConfig.compile_debug_dump_path() that computes the path using the rank?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think it's appropriate to put compile_debug_dump_path() in VllmConfig . What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better than recomputing it in this many places. You can also add a method to CompilationConfig that accepts vllm_config as a param if that seems better to you

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Pydantic not auto cast str to Path on initialisation here? The cast to Path might be redundant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Pydantic not auto cast str to Path on initialisation here? The cast to Path might be redundant

You are right. Thanks for your reminding.

Co-authored-by: Luka Govedič <[email protected]>
Signed-off-by: Jiangyun Zhu <[email protected]>
Signed-off-by: zjy0516 <[email protected]>
@ZJY0516
Copy link
Contributor Author

ZJY0516 commented Sep 25, 2025

Currently, debug_dump_path is typed and initialized as a str. I'm hesitant to convert it to a pathlib.Path within __post_init__

@ProExpertProg
Copy link
Collaborator

Currently, debug_dump_path is typed and initialized as a str. I'm hesitant to convert it to a pathlib.Path within __post_init__

Why not an Optional[pathlib.Path]?

@ProExpertProg ProExpertProg moved this from In progress to In review in torch.compile integration Sep 25, 2025
@ZJY0516
Copy link
Contributor Author

ZJY0516 commented Sep 26, 2025

I think this is ready. PTAL @ProExpertProg

Copy link
Collaborator

@ProExpertProg ProExpertProg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still more of a fan of VllmConfig.compile_debug_dump_path() - that way we can simplify this code tremendously which is the whole goal. There is no place where we only have access to CompilationConfig and not VllmConfig so I think we should go with that approach

path = os.path.join(compilation_config.debug_dump_path,
f"rank_{vllm_config.parallel_config.rank}")
rank = torch.distributed.get_rank()
path = compilation_config.compile_debug_dump_path(rank)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're passing rank anyway, wy not use ParallelConfig.rank?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because ParallelConfig.rank are always zero at that time

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this is successfully used in main, I think ParallelConfig.rank is initialized by this point

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use ParallelConfig.rank like main, dp mode will always get 0

vllm serve /data/datasets/models-hf/Qwen3-4B/ -dp 2 -O.debug_dump_path "./compile_dump"

ls compile_dump/
rank_0/

vllm serve /data/datasets/models-hf/Qwen3-4B/ -tp 2 -O.debug_dump_path "./compile_dump"

ls compile_dump/
rank_0/  rank_1/

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, nice find. In think case could we just append dp rank separately?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean like rank_0_dp_0 ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! But only if dp world size is > 1

logger.warning_once("Op '%s' %s, %s with '%s' has no effect",
op_name, missing_str, enable_str, op)

def compile_debug_dump_path(self, rank: int) -> Path:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should return Optional[Path] and return None if the path is not set rather than assert so that it can be used in places we have to check whether a path exists or not

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this only gets called if debug_dump_path is not None.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah but that makes the code more complex

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current code consistently checks if debug_dump_path is set (not None) before proceeding. We skip the operation if it's None.

if not debug_dump_path:
            return
debug_dump_path = compile_debug_dump_path(
            rank)
debug_dump_path.mkdir(parents=True, exist_ok=True)
if compilation_config.level == CompilationLevel.PIECEWISE and \
        compilation_config.debug_dump_path:
        import depyf
        rank = torch.distributed.get_rank()
        path = vllm_config.compile_debug_dump_path(rank)
        path.mkdir(parents=True, exist_ok=True)

Returning Optional[Path] contradicts this pattern. If we did, the type checker would rightly flag that mkdir is not a valid method for a None object whenever we use the path without a prior null check.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but this means that we first check debug_dump_path, and if not none we call compile_debug_dump_path. That means users need to know about both. Instead users should just call compile_debug_dump_path and THEN do the None check, and never interact with debug_dump_path directly.

@ZJY0516
Copy link
Contributor Author

ZJY0516 commented Sep 26, 2025

I am still more of a fan of VllmConfig.compile_debug_dump_path() - that way we can simplify this code tremendously which is the whole goal. There is no place where we only have access to CompilationConfig and not VllmConfig so I think we should go with that approach

I'm okay with this change, but I'm concerned that adding too many functions directly to VllmConfig could lead to a bloated class that's difficult to maintain.

@ProExpertProg
Copy link
Collaborator

ProExpertProg commented Sep 26, 2025

I'm okay with this change, but I'm concerned that adding too many functions directly to VllmConfig could lead to a bloated class that's difficult to maintain.

If that happens we can always move the function out of VllmConfig and pass VllmConfig as a parameter. For now, I think it makes sense because it's basically a "computed property".

@ZJY0516
Copy link
Contributor Author

ZJY0516 commented Sep 27, 2025

I'm okay with this change, but I'm concerned that adding too many functions directly to VllmConfig could lead to a bloated class that's difficult to maintain.

If that happens we can always move the function out of VllmConfig and pass VllmConfig as a parameter. For now, I think it makes sense because it's basically a "computed property".

I have moved it to VllmConfig

Copy link
Collaborator

@ProExpertProg ProExpertProg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor notes, thanks for this cleanup!

ZJY0516 and others added 4 commits September 27, 2025 22:11
Co-authored-by: Luka Govedič <[email protected]>
Signed-off-by: Jiangyun Zhu <[email protected]>
Co-authored-by: Luka Govedič <[email protected]>
Signed-off-by: Jiangyun Zhu <[email protected]>
Co-authored-by: Luka Govedič <[email protected]>
Signed-off-by: Jiangyun Zhu <[email protected]>
Signed-off-by: zjy0516 <[email protected]>
@ProExpertProg ProExpertProg enabled auto-merge (squash) September 27, 2025 14:24
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 27, 2025
@ProExpertProg ProExpertProg merged commit c0ec818 into vllm-project:main Sep 27, 2025
42 checks passed
@github-project-automation github-project-automation bot moved this from In review to Done in torch.compile integration Sep 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed torch.compile
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[Feature][torch.compile]: Add VLLM_DEBUG_DUMP_PATH environment variable
3 participants