Skip to content

Commit 9c30b7c

Browse files
XbaoWuJackyTYang
authored andcommitted
Add sra policy for ResourceStrategyFit Plugin
Signed-off-by: wuxiaobao <[email protected]>
1 parent c311d30 commit 9c30b7c

File tree

10 files changed

+958
-182
lines changed

10 files changed

+958
-182
lines changed

docs/design/proportional.md

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -29,14 +29,16 @@ Firstly set the proportion binding in volcano-scheduler.conf:
2929
actions: "enqueue, allocate, backfill"
3030
tiers:
3131
- plugins:
32-
- name: predicates
32+
- name: resource-strategy-fit
3333
arguments:
34-
predicate.ProportionalEnable: true
35-
predicate.resources: nvidia.com/gpu,nvidia.com/v100-sxm2-16gb
36-
predicate.resources.nvidia.com/gpu.cpu: 8
37-
predicate.resources.nvidia.com/gpu.memory: 8
38-
predicate.resources.nvidia.com/v100-sxm2-16gb.cpu: 16
39-
predicate.resources.nvidia.com/v100-sxm2-16gb.memory: 16
34+
proportional:
35+
enable: true
36+
resources: nvidia.com/gpu,nvidia.com/v100-sxm2-16gb
37+
resourceProportion:
38+
nvidia.com/gpu.cpu: 8
39+
nvidia.com/gpu.memory: 8
40+
nvidia.com/v100-sxm2-16gb.cpu: 16
41+
nvidia.com/v100-sxm2-16gb.memory: 16
4042
```
4143
4244
The proportion is GPU:CPU:MEMORY=1:8:8, and let the test scenario just as above:
@@ -50,8 +52,8 @@ Job | Pod | Resource | Node | NodeAllocatable | NodeIdle
5052
default/single-1000-0 | single-1000-0 | cpu 8, memory 8G, nvidia.com/gpu 0 | nodeC0-0 | cpu 74, memory 128G, nvidia.com/gpu 8 | cpu 66, memory 120G, nvidia.com/gpu 8 |
5153
default/single-1000-1 | single-1000-1 | cpu 8, memory 8G, nvidia.com/gpu 0 | - | - | - |
5254
53-
After job single-1000-0 is scheduled, the Idel resouce is 8GPUs, 66CPUs, 120G memory. During the predicate phase, this plugin caculates the resource left if job single-1000-1 is scheduled`(node.Idel.CPU - task.Resreq.CPU < node.Idel.GPU * cpuRatio ||
54-
node.Idel.Memory - task.Resreq.Memory < node.Idel.GPU * memoryRatio)`; the result is 8GPUs, 58CPUs, 112G memory, that unsatisfies the 1:8:8 proportion. Therefore nodeC0-0 is removed from the predicateNodes, and NodeIdle remains 8GPUs, 66CPUs, 120G memory.
55+
After job single-1000-0 is scheduled, the Idle resource is 8GPUs, 66CPUs, 120G memory. During the predicate phase, this plugin calculates the resource left if job single-1000-1 is scheduled `(node.Idle.CPU - task.Resreq.CPU < node.Idle.GPU * cpuRatio ||
56+
node.Idle.Memory - task.Resreq.Memory < node.Idle.GPU * memoryRatio)` ; the result is 8GPUs, 58CPUs, 112G memory, that unsatisfied the 1:8:8 proportion. Therefore, nodeC0-0 is removed from the predicateNodes, and NodeIdle remains 8GPUs, 66CPUs, 120G memory.
5557

5658

5759

docs/design/resource-strategy-fit-scheduling.md

Lines changed: 164 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -2,25 +2,36 @@
22

33
## Summary
44

5-
The native k8s ResourceStrategyFit plug-in can only adopt one type of strategy for all resources, such as MostRequestedPriority and LeastRequestedPriority. However, in industrial practice, this design is not applicable in some scenarios. For example: in AI scenarios, we usually disperse CPU tasks in CPU machine groups to reduce hot spots. GPU tasks are gathered in GPU machine groups to reduce GPU fragmentation. Therefore, we need to expand a scheduling strategy to meet the needs of this scenario.
5+
At present, there are few scheduling strategies for resource types. When users want to schedule according to resource types in some special scenarios, there is a lack of refined strategies to choose from. This plugin attempts to improve this part of the strategy to cover more application scenarios for users.
66

77
## Motivation
88

9-
- Different resource types can be configured with different aggregation or dispersion strategies, and weights can be used to distinguish priorities
9+
- **resourceStrategyFit**: The native k8s NodeResourcesFit plug-in can only adopt one type of strategy for all resources, such as MostRequestedPriority and LeastRequestedPriority. However, in industrial practice, this design is not applicable in some scenarios. For example: in AI scenarios, we usually disperse CPU tasks in CPU machine groups to reduce hot spots. GPU tasks are gathered in GPU machine groups to reduce GPU fragmentation. Therefore, we need to expand a scheduling strategy to meet the needs of this scenario.
10+
11+
- **sra**: In a cluster where GPU nodes and CPU nodes are deployed together, there is a risk that a task submitted to the cluster, which does not require GPU resources, may be scheduled to a GPU node. This can lead to inefficient resource utilization, as a job requiring both CPU and GPU resources might be pending due to insufficient CPU resources on the GPU node. This scenario is not ideal, as it can result in under utilization of resources and potential delays in job execution. Therefore, sra is proposed to avoid nodes with critical resources (such as GPUs) for certain CPU jobs and to improve overall resource utilization.
12+
13+
- **proportional**: In a cluster where CPU tasks and GPU tasks are mixed, sometimes some CPU tasks are scheduled to GPU nodes, resulting in some GPU tasks being in a long-term Pending state due to lack of CPU resources. In this regard, we may specify a 'primary' resource(e.g., GPU in deep learning), and preserve the amount of associated 'secondary' resources by a pre-set proportion. This policy will play a role in the predicate stage, and is committed to ensuring that the idle resources of the node will meet the secondary resources required by the scare resources of the node according to the proportion.
1014

1115
### Goals
1216

13-
- Different types of resources can be configured with different strategies to prioritize them in the form of weights
17+
- Different types of resources can be configured with different strategies to prioritize them in the form of weights.
18+
19+
- Different resources can be configured with different weights, which are used to indicate the scarcity of resources.
20+
21+
- Different resources can set the proportion of secondary resources to ensure that the idle secondary resources on the node are sufficient.
1422

1523
### Non-Goals
1624

1725
- None.
1826

1927
## Proposal
2028

21-
Extend one plug-ins to meet the above needs
29+
Extend one plugin (ResourceStrategyFit) to meet the above needs
30+
- **ResourceStrategyFit**: Provide users with **MostAllocated** and **LeastAllocated** these two ways to facilitate the user to decide whether the task should be distributed or aggregated according to the different needs of the task.
31+
32+
- **sra**: Provide users with a weighted way for each resource, avoid scheduling common tasks to scarce resource nodes, thereby improving the utilization of scare resources.
2233

23-
- ResourceStrategyFit
34+
- **proportional**: Provide users with the way of proportion to avoid common task being scheduled to the presence of scare resource nodes, thereby improving the utilization of scare resources.
2435

2536
## User Story
2637

@@ -35,20 +46,20 @@ Extend one plug-ins to meet the above needs
3546
### ResourceStrategyFit
3647

3748
config:
38-
```
39-
actions: "enqueue, allocate, backfill, reclaim, preempt"
40-
tiers:
41-
- plugins:
42-
- name: resource-strategy-fit
43-
arguments:
44-
resourceStrategyFitWeight: 10
45-
resources:
46-
nvidia.com/gpu:
47-
type: MostAllocated
48-
weight: 2
49-
cpu:
50-
type: LeastAllocated
51-
weight: 1
49+
```yaml
50+
actions: "enqueue, allocate, backfill, reclaim, preempt"
51+
tiers:
52+
- plugins:
53+
- name: resource-strategy-fit
54+
arguments:
55+
resourceStrategyFitWeight: 10
56+
resources:
57+
nvidia.com/gpu:
58+
type: MostAllocated
59+
weight: 2
60+
cpu:
61+
type: LeastAllocated
62+
weight: 1
5263
```
5364
config description:
5465
@@ -79,6 +90,112 @@ node score:
7990
```
8091
finalScoreNode = [(weight1 * resource1) + (weight2 * resource2) + … + (weightN* resourceN)] /(weight1+weight2+ … +weightN)
8192
```
93+
### Scarce Resource Avoidance (SRA)
94+
#### Policy Description
95+
- `sra`: Give each scarce resource a weight. Higher weight means more scarce. Jobs that don’t need the scarce resource will try to stay off nodes that have it, so the scarce resource can be retained.
96+
> **Notes**: The `sra` policy is different from the `proportional` policy, because the `proportional` policy is a "**hard**" policy by setting the proportion of secondary resources to prevent task scheduling to nodes that do not meet the proportion of resources, while the `sra` policy is a "**soft**" policy by setting weights to important resources to try to guide task scheduling to nodes that do not have scare resources.
97+
98+
#### Solution
99+
1. In sra policy, arguments `sra.resources` is provided to configure important resources in the cluster.
100+
2. Based on the significance of different resources, `sra.resourceWeight.[ResourceName]` can assign varying weights for resource allocation. A higher weight signifies greater importance, and tasks that do not require this resource will, to the extent possible, be scheduled away from nodes with such resources
101+
3. For all tasks, user can set `sra.resources` and `sra.resourceWeight.[ResourceName]` field in `resource-strategy-fit` arguments via `volcano-scheduler-configmap` in following format:
102+
103+
```yaml
104+
actions: "enqueue, reclaim, allocate, backfill, preempt"
105+
tiers:
106+
- plugins:
107+
- name: resource-strategy-fit
108+
arguments:
109+
sra:
110+
enable: true
111+
resources: nvidia.com/t4, nvidia.com/a10
112+
weight: 2
113+
resourceWeight:
114+
nvidia.com/t4: 1
115+
nvidia.com/a10: 1
116+
```
117+
118+
4. `sra` policy will affect the score result of node order. The higher the node score is, the higher the priority is. The score result of sra policy is calculated as follows :
119+
```
120+
the resource requested by pod is not on the node, sraScore = 0
121+
the resource requested by pod is on the node, sraScore = MaxNodeScore * sraWeight * (weight_1 + weight_2 + ··· + weight_n) / weightSum
122+
```
123+
> `sraScore`: the node score of sra policy. \
124+
> `MaxNodeScore`: the maximum score of node. default is 100. \
125+
> `sraWeight`: the weight of sra policy. \
126+
> `weightSum`: the sum of all sra resources weights. \
127+
> `weight_x`: the weight of the xth sra resource that does not exist on the node.
128+
129+
#### Score Calculation
130+
1. Now, we have tasks:
131+
132+
| Task Name | Task request resource |
133+
|------------|--------------------------------------------------------------|
134+
| cpu-task-0 | `{cpu: 2, memory: 4Gi}` |
135+
| gpu-task-0 | `{cpu: 2, memory: 4Gi, nvidia.com/t4: 2}` |
136+
| gpu-task-1 | `{cpu: 2, memory: 4Gi, nvidia.com/t4: 1, nvidia.com/a10: 2}` |
137+
138+
2. Suppose there are 3 nodes available in cluster:
139+
140+
| Node Name | Resource capacity on node |
141+
|-----------|-----------------------------------------------------------------|
142+
| node1 | `{cpu: 32, memory: 64Gi}` |
143+
| node2 | `{cpu: 16, memory: 32Gi, nvidia.com/t4: 10}` |
144+
| node3 | `{cpu: 16, memory: 32Gi, nvidia.com/t4: 5, nvidia.com/a10: 10}` |
145+
146+
3. Through the sra policy we will get the following results:
147+
148+
| Task | Node | Score result (sra) | Notes |
149+
|------------|-------|--------------------|--------------------------------------------------------------|
150+
| cpu-task-0 | node1 | 200 | node resources meet the task request and no scarce resources |
151+
| cpu-task-0 | node2 | 100 | node resources meet the task request but have t4 |
152+
| cpu-task-0 | node3 | 0 | node resources meet the task request but have t4, a10 |
153+
| gpu-task-0 | node1 | 0 | node resources don't meet the task request |
154+
| gpu-task-0 | node2 | 100 | node resources meet the task request and have t4 |
155+
| gpu-task-0 | node3 | 0 | node resources meet the task request but have a10 |
156+
| gpu-task-1 | node1 | 0 | node resources don't meet the task request |
157+
| gpu-task-1 | node2 | 0 | node resources don't meet the task request |
158+
| gpu-task-1 | node3 | 0 | node resources meet the task request and have t4, a10 |
159+
160+
### proportional
161+
#### Policy Description
162+
- `proportional`: By specifying 'primary' scarce resources (e.g. GPU in deep learning) and preserve the amount of associated 'secondary' resources by a pre-set proportion.
163+
164+
![](./images/proportional-diagram.png)
165+
#### Solution
166+
Firstly set the proportion binding in `volcano-scheduler.conf`:
167+
168+
```yaml
169+
actions: "enqueue, reclaim, allocate, backfill, preempt"
170+
tiers:
171+
- plugins:
172+
- name: resource-strategy-fit
173+
arguments:
174+
proportional:
175+
enable: true
176+
resources: nvidia.com/gpu,nvidia.com/v100-sxm2-16gb
177+
resourceProportion:
178+
nvidia.com/gpu.cpu: 8
179+
nvidia.com/gpu.memory: 8
180+
nvidia.com/v100-sxm2-16gb.cpu: 16
181+
nvidia.com/v100-sxm2-16gb.memory: 16
182+
```
183+
184+
The proportion is GPU:CPU:MEMORY=1:8:8, and let the test scenario just as above:
185+
186+
| Node | NodeAllocatable | NodeIdle |
187+
|----------|---------------------------------------|---------------------------------------|
188+
| nodeC0-0 | cpu 74, memory 128G, nvidia.com/gpu 8 | cpu 74, memory 128G, nvidia.com/gpu 8 |
189+
190+
| Job | Pod | Resource | Node | NodeAllocatable | NodeIdle |
191+
|-----------------------|---------------|------------------------------------|----------|---------------------------------------|---------------------------------------|
192+
| default/single-1000-0 | single-1000-0 | cpu 8, memory 8G, nvidia.com/gpu 0 | nodeC0-0 | cpu 74, memory 128G, nvidia.com/gpu 8 | cpu 66, memory 120G, nvidia.com/gpu 8 |
193+
| default/single-1000-1 | single-1000-1 | cpu 8, memory 8G, nvidia.com/gpu 0 | - | - | - |
194+
195+
After job single-1000-0 is scheduled, the Idle resource is 8GPUs, 66CPUs, 120G memory. During the predicate phase, this plugin calculates the resource left if job single-1000-1 is scheduled `(node.Idle.CPU - task.Resreq.CPU < node.Idle.GPU * cpuRatio ||
196+
node.Idle.Memory - task.Resreq.Memory < node.Idle.GPU * memoryRatio)` ; the result is 8GPUs, 58CPUs, 112G memory, that unsatisfied the 1:8:8 proportion. Therefore, nodeC0-0 is removed from the predicateNodes, and NodeIdle remains 8GPUs, 66CPUs, 120G memory.
197+
198+
For more details, please refer to: [proportional design](./proportional.md)
82199

83200
## Syntax Rules
84201
### Wildcard Syntax Support
@@ -115,6 +232,7 @@ resources:
115232
- **Configuration-time validation**: Invalid wildcard patterns are filtered during plugin initialization with warning logs
116233
- **Runtime matching**: Uses O(n) prefix matching algorithm with exact match optimization
117234
- **Backward compatibility**: Existing exact match configurations continue to work unchanged
235+
- **Support Policy**: Only `ResourceStrategyFit` is supported; `proportional` and `sra` are not.
118236

119237
## Performance Considerations
120238

@@ -139,4 +257,30 @@ resources:
139257
## Alternatives
140258

141259
### Binpack VS ResourceStrategyFit
142-
If you want to use the clustering strategy for all resource types, you can choose the Binpack plugin. If you need to configure different clustering or scattering strategies for different resource types, you can choose the ResourceStrategyFit plugin. ResourceStrategyFit can also achieve the same results as Binpack by adjusting configuration parameters.
260+
If you want to use aggregated strategy for all resource types, you can choose the Binpack plugin. If you need to configure different aggregated or dispersed strategies for different resource types, you can choose the ResourceStrategyFit plugin. ResourceStrategyFit can also achieve the same results as Binpack by adjusting configuration parameters.
261+
262+
## Best Practices
263+
### AI scenario
264+
In some AI scenarios, CPU tasks are usually dispersed into CPU machine groups to reduce hot spots. GPU tasks are gathered in the GPU machine group to reduce GPU fragmentation. At the same time, it is also necessary to avoid the situation that CPU tasks are assigned to GPU nodes, resulting in long-term waiting of GPU tasks due to insufficient CPU or MEM resources of nodes. In this scenario, we can combine **resourceStrategyFit** and **sra policy** to deal with this scenario. The corresponding example configuration is as follows:
265+
266+
```yaml
267+
actions: "enqueue, allocate, backfill, reclaim, preempt"
268+
tiers:
269+
- plugins:
270+
- name: resource-strategy-fit
271+
arguments:
272+
resourceStrategyFitWeight: 10
273+
resources:
274+
nvidia.com/gpu:
275+
type: MostAllocated
276+
weight: 2
277+
cpu:
278+
type: LeastAllocated
279+
weight: 1
280+
sra:
281+
enable: true
282+
resources: nvidia.com/gpu
283+
weight: 10
284+
resourceWeight:
285+
nvidia.com/gpu: 1
286+
```

0 commit comments

Comments
 (0)