-
Notifications
You must be signed in to change notification settings - Fork 939
make topology signature include a compressed list of available PUs #7036
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| npu = 0; | ||
| while ((obj = hwloc_get_next_obj_by_type(topo, HWLOC_OBJ_PU, obj)) != NULL) { | ||
| avail_pus[npu++] = obj->os_index; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're basically duplicating the topology entire cpuset into your own array of indexes here. The hwloc bitmap likely has everything needed to replace this. For instance, you may iterate over PU physical indexes with:
hwloc_bitmap_foreach_begin(i, hwloc_topology_get_topology_cpuset(topo))
{ /* use physical index i */
} hwloc_bitmap_foreach_end();
and there are functions to find the first, last, next bits, the number of bits, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we had a discussion on a telecon about this - we don't want to intro a new regex method into the system, especially a release branch. We just want to use the hwloc print methods and attach the result to the existing signature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't I need the os_index for each "i" though? So either I'd have to construct those into a mask or do a lookup on each logical "i" for its os_index. That seems quite a bit heavier than "a[i]" to access the desired data.
Plus the hwloc-awareness would have to extend all the way into both "mycompress" functions which are currently written generically for an integer array.
When mpirun collects topologies it first asks for signatures and then
only asks a host to send its topologif the signature is something new.
But with heterogeneous cgroups the signature doesn't have enough data
to notice differences that matter.
An example of the previous signature on one of our machines is
2N:2S:22L3:22L2:44L1:44C:176H:ppc64le:le
which is just a count of how many numa nodes, sockets, cores, PUs etc
the host has. If I create a small cgroup with just 4 PUs from one of
the cores, an example of the new signature is
1N:1S:1L3:1L2:1L1:1C:4H:ppc64le:le
But if I did that on two different machines and used different cores
on each, the signature would look the same. One way to distinguish
such signatures is to also list the available PUs by os_index (I might
be okay with logical indexes if it was logical under a WHOLE-SYSTEM
topology, but in the non-WHOLE-SYSTEM topologies we're using I don't
think the logical indexes would be good enough).
I think we should include a compression routine that recognizes
more patterns, but since this is a release branch I'm just using
hwloc_bitmap_list_asprintf()
I've left in some commented out code I'd propose for future
consideration for string compression. Patterns like filling the
second half of each socket might make hex masks like f0f0f0f0f0f0f0f0
which the above wouldn't compress by much, that would still be a
longish list of 5-8,13-16,21-24,...
signature strings end up things like
2N:2S:22L3:22L2:44L1:44C:176H:H0-175:ppc64le:le
here "0-175" is the hwloc_bitmap_list_asprintf output.
Signed-off-by: Mark Allen <[email protected]>
f90e048 to
c12f1c9
Compare
|
Okay, I switched it to hwloc_bitmap_list_asprintf() since it's a release branch. I'd like to have a bit more compression than that but I guess it's good enough |
gpaulsen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please PR to master before cherry-picking back.
|
In our weekly Web-ex 12/17 we decided to focus our energy on releasing v5.0, rather than trying to release a v4.1 to get this enhancement released. |
From the OMPI phone call about affinity my understanding is Ralph has a different solution in mind for 5.x. But for 4.x I think this solution fits reasonably within the current design.
Note: I don't actually have a concrete case where the original code fails. It's easy to make two very different cgroups that have the same signature, and have mpirun in orte_plm_base_daemon_callback() incorrectly set
daemon->node->topology
for hostB to a copy of hostA's topology rather than requesting a full topology from hostB. However I haven't actually succeeded in making a --map-by something --bind-to something produce wrong results despite having
But even though I don't have a concrete bug to point at, I believe this is still considered a real issue in a cluster with heterogeneous cgroups
When mpirun collects topologies it first asks for signatures and then
only asks a host to send its topologif the signature is something new.
But with heterogeneous cgroups the signature doesn't have enough data
to notice differences that matter.
An example of the previous signature on one of our machines is
2N:2S:22L3:22L2:44L1:44C:176H:ppc64le:le
which is just a count of how many numa nodes, sockets, cores, PUs etc
the host has. If I create a small cgroup with just 4 PUs from one of
the cores, an example of the new signature is
1N:1S:1L3:1L2:1L1:1C:4H:ppc64le:le
But if I did that on two different machines and used different cores
on each, the signature would look the same. One way to distinguish
such signatures is to also list the available PUs by os_index (I might
be okay with logical indexes if it was logical under a WHOLE-SYSTEM
topology, but in the non-WHOLE-SYSTEM topologies we're using I don't
think the logical indexes would be good enough).
This list could be large so I've included a couple different
compressions of it and use whichever comes out shorter.
The first is a comma separated list that compresses simple ranges like
10-15 and also detects patterns analogous to MPI_Type_vector(), eg
{16,17,18,19,20,21,22,23} = 16-23
{1,2,3,4,5, 11,12,13,14,15, 21,22,23,24,25} = 1-5+103
{2,3, 10,11, 18,19, 26,27,
102,103, 110,111, 118,119,
200,201,202,203,204,205,206,207,208 } = 2-3+84,102-103+8*3,200-208
{1,3,6,7,9,11,12} = 1,3,6,7,9,11,12
The second compression is a hex string containing the indexes, and
further shrunk by noticing if the same char is repeated 5+ times, so
{16,17,18,19,20,21,22,23} = 0000ff
{1,2,3,4,5, 11,12,13,14,15, 21,22,23,24,25} = 7c1f07c
{2,3, 10,11, 18,19, 26,27,
102,103, 110,111, 118,119,
200,201,202,203,204,205,206,207,208 } = 3030303018,30303020,ff8
{1,3,6,7,9,11,12} = 5358
So final signature strings end up things like
example with PUs 0-175:
2N:2S:22L3:22L2:44L1:44C:176H:HXf44:ppc64le:le
here "f44" won the compression contest against "0-175"
example with PUs 12-15:
1N:1S:1L3:1L2:1L1:1C:4H:HX000f:ppc64le:le
here "000f" won against "12-15"
example with every other block of 4 PUs:
2N:2S:22L3:22L2:22L1:22C:88H:HL0-3+822:ppc64le:le
here "0-3+822" won against "f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f"
Signed-off-by: Mark Allen [email protected]