-
Notifications
You must be signed in to change notification settings - Fork 939
Cpu set v40x #7034
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cpu set v40x #7034
Conversation
Examples of what this feature wants to enable:
% mpirun -np 2 --report-bindings \
--bind-to hwthread --map-by hwthread --cpu-set 6,7 \
hostname
% mpirun -np 2 --report-bindings \
--bind-to hwthread --map-by ppr:2:node,pe=2 --cpu-set 6,7,12,13 \
hostname
I'll also provide here an examle of some "legacy" behavior that I'm
trying not to break with this checkin. Based on recent talks the legacy
behavior might go away at some point, but this checkin is leaving that
part alone:
% mpirun -np 2 --report-bindings --use-hwthread-cpus \
--bind-to cpulist:ordered --map-by hwthread --cpu-list 6,7 \
hostname
which just round robins over the --cpu-list list.
Example output which seems fine to me:
> MCW rank 0: [..../..B./..../..../][..../..../..../..../]
> MCW rank 1: [..../...B/..../..../][..../..../..../..../]
The first category of issue is overly aggressive error checking that
reject the top two command lines above.
The first command above errors that
> Conflicting directives for mapping policy are causing the policy
> to be redefined:
> New policy: RANK_FILE
> Prior policy: BYHWTHREAD
I think the error check in orte_rmaps_rank_file_open() is too aggressive.
The intent seems to be that any option like "--map-by whatever" will
check to see if a rankfile is in use, and report that mapping via rmaps
while also using an explicit rankfile is a conflict.
But at some point in the past the check was expanded to not just check
NULL != orte_rankfile
but also errors out if
(NULL != opal_hwloc_base_cpu_list &&
!OPAL_BIND_ORDERED_REQUESTED(opal_hwloc_binding_policy))
which seems to be only considering -cpu-list as a simple round-robin
binding option and ignoring the possibility of -cpu-set being used
as a cgroup analog (eg it's only considering the "legacy" behavior
from my examples).
I've changed the
NULL != opal_hwloc_base_cpu_list
to a more targeted
OPAL_BIND_TO_CPUSET == OPAL_GET_BINDING_POLICY(
opal_hwloc_binding_policy)
which I believe will only cause this to error out if -cpu-list is being
used as a simple round-robin binding method.
Related to this is hwloc_base_frame.c where it has
/* did the user provide a slot list? */
if (NULL != opal_hwloc_base_cpu_list) {
OPAL_SET_BINDING_POLICY(opal_hwloc_binding_policy,
OPAL_BIND_TO_CPUSET);
}
which seems to make --cpu-set exclusively have the "legacy" meaning. I
think it makes more sense to use the former behavior where we only set
that if
!OPAL_BINDING_POLICY_IS_SET(opal_hwloc_binding_policy)
These two changes allow --cpu-set to be used in the "new" cgroup-like
manner without triggering error detection about allegedly conflicting
settings.
That brings us past the error detection and into the real functionality.
And the easiest way to implement this I think is to make it take the
same path as cgroups do.
The --cpu-set option is logically similar to running under a cgroup
just without the OS-level enforcement that comes with a cgroup. For
cgroups the 4.x code loads the topology without the WHOLE_SYSTEM
flag so the tree only contains what's in the cgroup. We can do the
same thing with a hwloc_restrict_topology() call to constrain the
topology to whatever --cpu-set the user enters.
This is done now in opal/mca/hwloc/base/hwloc_base_util.c at
the bottom of opal_hwloc_base_filter_cpus().
Other examples of commands that demonstrate this functionality:
hardware: [..../..../..../....] numbered sequentially 0-15
% mpirun -np 2 --report-bindings --bind-to hwthread \
--map-by hwthread --cpu-set 6,7 hostname
> MCW rank 0 [..../..B./..../....]
> MCW rank 1 [..../...B/..../....]
% mpirun -np 2 --report-bindings --bind-to hwthread \
--map-by ppr:2:node:pe=2 --cpu-set 6,7,12,13 hostname
> MCW rank 0 [..../..BB/..../....]
> MCW rank 1 [..../..../..../BB..]
% mpirun -np 2 --report-bindings --bind-to hwthread \
--map-by ppr:2:node:pe=3 --cpu-set 4,5,9,11,14,15 ./x
> MCW rank 0 [..../BB../.B../....]
> MCW rank 1 [..../..../...B/..BB]
Signed-off-by: Mark Allen <[email protected]>
These came from Ralph Castain attached to a previous version of my -cpu-set fix/enhancement. Signed-off-by: Mark Allen <[email protected]>
|
Unfortunately this is too late for v4.0.2. |
gpaulsen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please PR to master before cherry-picking back.
|
@markalle Is this more of a bugfix or a new feature. |
|
In our weekly Web-ex 12/17 we decided to focus our energy on releasing v5.0, rather than trying to release a v4.1 to get this enhancement released. |
This is a redo of
#6755
just for 4.x.
From our phone call about affinity my understanding is we want to keep the --cpu-set option where it functions like a lightweight cgroup. And I'd prefer to consider this a bugfix that restores the old feature rather than an enhancement that's adding a new feature.
I admittedly had to go back to a pretty old OMPI tree (2.1.6) to find one that had that feature working.
The timeline was roughly
2.x : --cpu-set could be used like a cgroup, the implementation was complex and had the whole hwloc tree present, but every mapping option that involved iterating over the tree elements had extra logic in place to skip over elements that weren't inside the allowed mask
3.x : the core functionality of skipping over disallowed hwloc tree elements was still in place, but the error checking was made too aggressive so any command line that tried to use --cpu-set along side other misc binding/mapping options was rejected
4.x : with the switch to hwloc2 a bunch of code was simplified and the core functionality of skipping over disallowed elements was removed because it was unnecessary for cases like cgroups because the hwloc tree itself was being loaded without WHOLE_SYSTEM so it would already only contain the right elements
This PR changes the error checking back so that --cpu-set is allowed with the other mapping options, and uses hwloc_topology_restrict() to make the --cpu-set mask get used the same way that a cgroup would: the hwloc tree gets whittled to only contain elements in the allowed mask.