Skip to content

Conversation

@markalle
Copy link
Contributor

@markalle markalle commented Oct 4, 2019

From the OMPI phone call about affinity my understanding is Ralph has a different solution in mind for 5.x. But for 4.x I think this solution fits reasonably within the current design.

Note: I don't actually have a concrete case where the original code fails. It's easy to make two very different cgroups that have the same signature, and have mpirun in orte_plm_base_daemon_callback() incorrectly set
daemon->node->topology
for hostB to a copy of hostA's topology rather than requesting a full topology from hostB. However I haven't actually succeeded in making a --map-by something --bind-to something produce wrong results despite having

But even though I don't have a concrete bug to point at, I believe this is still considered a real issue in a cluster with heterogeneous cgroups


When mpirun collects topologies it first asks for signatures and then
only asks a host to send its topologif the signature is something new.
But with heterogeneous cgroups the signature doesn't have enough data
to notice differences that matter.

An example of the previous signature on one of our machines is
2N:2S:22L3:22L2:44L1:44C:176H:ppc64le:le
which is just a count of how many numa nodes, sockets, cores, PUs etc
the host has. If I create a small cgroup with just 4 PUs from one of
the cores, an example of the new signature is
1N:1S:1L3:1L2:1L1:1C:4H:ppc64le:le

But if I did that on two different machines and used different cores
on each, the signature would look the same. One way to distinguish
such signatures is to also list the available PUs by os_index (I might
be okay with logical indexes if it was logical under a WHOLE-SYSTEM
topology, but in the non-WHOLE-SYSTEM topologies we're using I don't
think the logical indexes would be good enough).

This list could be large so I've included a couple different
compressions of it and use whichever comes out shorter.

The first is a comma separated list that compresses simple ranges like
10-15 and also detects patterns analogous to MPI_Type_vector(), eg
{16,17,18,19,20,21,22,23} = 16-23
{1,2,3,4,5, 11,12,13,14,15, 21,22,23,24,25} = 1-5+103
{2,3, 10,11, 18,19, 26,27,
102,103, 110,111, 118,119,
200,201,202,203,204,205,206,207,208 } = 2-3+8
4,102-103+8*3,200-208
{1,3,6,7,9,11,12} = 1,3,6,7,9,11,12

The second compression is a hex string containing the indexes, and
further shrunk by noticing if the same char is repeated 5+ times, so
{16,17,18,19,20,21,22,23} = 0000ff
{1,2,3,4,5, 11,12,13,14,15, 21,22,23,24,25} = 7c1f07c
{2,3, 10,11, 18,19, 26,27,
102,103, 110,111, 118,119,
200,201,202,203,204,205,206,207,208 } = 3030303018,30303020,ff8
{1,3,6,7,9,11,12} = 5358

So final signature strings end up things like

example with PUs 0-175:
2N:2S:22L3:22L2:44L1:44C:176H:HXf44:ppc64le:le
here "f
44" won the compression contest against "0-175"

example with PUs 12-15:
1N:1S:1L3:1L2:1L1:1C:4H:HX000f:ppc64le:le
here "000f" won against "12-15"

example with every other block of 4 PUs:
2N:2S:22L3:22L2:22L1:22C:88H:HL0-3+822:ppc64le:le
here "0-3+8
22" won against "f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f"

Signed-off-by: Mark Allen [email protected]

npu = 0;
while ((obj = hwloc_get_next_obj_by_type(topo, HWLOC_OBJ_PU, obj)) != NULL) {
avail_pus[npu++] = obj->os_index;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're basically duplicating the topology entire cpuset into your own array of indexes here. The hwloc bitmap likely has everything needed to replace this. For instance, you may iterate over PU physical indexes with:

hwloc_bitmap_foreach_begin(i, hwloc_topology_get_topology_cpuset(topo))
{ /* use physical index i */
} hwloc_bitmap_foreach_end();

and there are functions to find the first, last, next bits, the number of bits, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we had a discussion on a telecon about this - we don't want to intro a new regex method into the system, especially a release branch. We just want to use the hwloc print methods and attach the result to the existing signature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't I need the os_index for each "i" though? So either I'd have to construct those into a mask or do a lookup on each logical "i" for its os_index. That seems quite a bit heavier than "a[i]" to access the desired data.

Plus the hwloc-awareness would have to extend all the way into both "mycompress" functions which are currently written generically for an integer array.

@gpaulsen gpaulsen added this to the v4.0.3 milestone Oct 7, 2019
When mpirun collects topologies it first asks for signatures and then
only asks a host to send its topologif the signature is something new.
But with heterogeneous cgroups the signature doesn't have enough data
to notice differences that matter.

An example of the previous signature on one of our machines is
    2N:2S:22L3:22L2:44L1:44C:176H:ppc64le:le
which is just a count of how many numa nodes, sockets, cores, PUs etc
the host has.  If I create a small cgroup with just 4 PUs from one of
the cores, an example of the new signature is
    1N:1S:1L3:1L2:1L1:1C:4H:ppc64le:le

But if I did that on two different machines and used different cores
on each, the signature would look the same. One way to distinguish
such signatures is to also list the available PUs by os_index (I might
be okay with logical indexes if it was logical under a WHOLE-SYSTEM
topology, but in the non-WHOLE-SYSTEM topologies we're using  I don't
think the logical indexes would be good enough).

I think we should include a compression routine that recognizes
more patterns, but since this is a release branch I'm just using
    hwloc_bitmap_list_asprintf()
I've left in some commented out code I'd propose for future
consideration for string compression.  Patterns like filling the
second half of each socket might make hex masks like f0f0f0f0f0f0f0f0
which the above wouldn't compress by much, that would still be a
longish list of 5-8,13-16,21-24,...

signature strings end up things like
    2N:2S:22L3:22L2:44L1:44C:176H:H0-175:ppc64le:le
here "0-175" is the hwloc_bitmap_list_asprintf output.

Signed-off-by: Mark Allen <[email protected]>
@markalle markalle force-pushed the topo_signature_collision_v40x branch from f90e048 to c12f1c9 Compare October 11, 2019 21:14
@markalle
Copy link
Contributor Author

Okay, I switched it to hwloc_bitmap_list_asprintf() since it's a release branch. I'd like to have a bit more compression than that but I guess it's good enough

Copy link
Member

@gpaulsen gpaulsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please PR to master before cherry-picking back.

@gpaulsen
Copy link
Member

In our weekly Web-ex 12/17 we decided to focus our energy on releasing v5.0, rather than trying to release a v4.1 to get this enhancement released.
We will close this PR soon.
Please port this PR to pmix/PRRTE which will be the default launcher for Open MPI v5.0

@gpaulsen gpaulsen closed this Dec 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants