You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After job single-1000-0 is scheduled, the Idel resouce is 8GPUs, 66CPUs, 120G memory. During the predicate phase, this plugin caculates the resource left if job single-1000-1 is scheduled`(node.Idel.CPU - task.Resreq.CPU < node.Idel.GPU * cpuRatio ||
54
-
node.Idel.Memory - task.Resreq.Memory < node.Idel.GPU * memoryRatio)`; the result is 8GPUs, 58CPUs, 112G memory, that unsatisfies the 1:8:8 proportion. Therefore nodeC0-0 is removed from the predicateNodes, and NodeIdle remains 8GPUs, 66CPUs, 120G memory.
55
+
After job single-1000-0 is scheduled, the Idle resource is 8GPUs, 66CPUs, 120G memory. During the predicate phase, this plugin calculates the resource left if job single-1000-1 is scheduled`(node.Idle.CPU - task.Resreq.CPU < node.Idle.GPU * cpuRatio ||
56
+
node.Idle.Memory - task.Resreq.Memory < node.Idle.GPU * memoryRatio)`; the result is 8GPUs, 58CPUs, 112G memory, that unsatisfied the 1:8:8 proportion. Therefore, nodeC0-0 is removed from the predicateNodes, and NodeIdle remains 8GPUs, 66CPUs, 120G memory.
The native k8s ResourceStrategyFit plug-in can only adopt one type of strategy for all resources, such as MostRequestedPriority and LeastRequestedPriority. However, in industrial practice, this design is not applicable in some scenarios. For example: in AI scenarios, we usually disperse CPU tasks in CPU machine groups to reduce hot spots. GPU tasks are gathered in GPU machine groups to reduce GPU fragmentation. Therefore, we need to expand a scheduling strategy to meet the needs of this scenario.
5
+
At present, there are few scheduling strategies for resource types. When users want to schedule according to resource types in some special scenarios, there is a lack of refined strategies to choose from. This plugin attempts to improve this part of the strategy to cover more application scenarios for users.
6
6
7
7
## Motivation
8
8
9
-
- Different resource types can be configured with different aggregation or dispersion strategies, and weights can be used to distinguish priorities
9
+
-**resourceStrategyFit**: The native k8s NodeResourcesFit plug-in can only adopt one type of strategy for all resources, such as MostRequestedPriority and LeastRequestedPriority. However, in industrial practice, this design is not applicable in some scenarios. For example: in AI scenarios, we usually disperse CPU tasks in CPU machine groups to reduce hot spots. GPU tasks are gathered in GPU machine groups to reduce GPU fragmentation. Therefore, we need to expand a scheduling strategy to meet the needs of this scenario.
10
+
11
+
-**sra**: In a cluster where GPU nodes and CPU nodes are deployed together, there is a risk that a task submitted to the cluster, which does not require GPU resources, may be scheduled to a GPU node. This can lead to inefficient resource utilization, as a job requiring both CPU and GPU resources might be pending due to insufficient CPU resources on the GPU node. This scenario is not ideal, as it can result in under utilization of resources and potential delays in job execution. Therefore, sra is proposed to avoid nodes with critical resources (such as GPUs) for certain CPU jobs and to improve overall resource utilization.
12
+
13
+
-**proportional**: In a cluster where CPU tasks and GPU tasks are mixed, sometimes some CPU tasks are scheduled to GPU nodes, resulting in some GPU tasks being in a long-term Pending state due to lack of CPU resources. In this regard, we may specify a 'primary' resource(e.g., GPU in deep learning), and preserve the amount of associated 'secondary' resources by a pre-set proportion. This policy will play a role in the predicate stage, and is committed to ensuring that the idle resources of the node will meet the secondary resources required by the scare resources of the node according to the proportion.
10
14
11
15
### Goals
12
16
13
-
- Different types of resources can be configured with different strategies to prioritize them in the form of weights
17
+
- Different types of resources can be configured with different strategies to prioritize them in the form of weights.
18
+
19
+
- Different resources can be configured with different weights, which are used to indicate the scarcity of resources.
20
+
21
+
- Different resources can set the proportion of secondary resources to ensure that the idle secondary resources on the node are sufficient.
14
22
15
23
### Non-Goals
16
24
17
25
- None.
18
26
19
27
## Proposal
20
28
21
-
Extend one plug-ins to meet the above needs
29
+
Extend one plugin (ResourceStrategyFit) to meet the above needs
30
+
-**ResourceStrategyFit**: Provide users with **MostAllocated** and **LeastAllocated** these two ways to facilitate the user to decide whether the task should be distributed or aggregated according to the different needs of the task.
31
+
32
+
-**sra**: Provide users with a weighted way for each resource, avoid scheduling common tasks to scarce resource nodes, thereby improving the utilization of scare resources.
22
33
23
-
-ResourceStrategyFit
34
+
-**proportional**: Provide users with the way of proportion to avoid common task being scheduled to the presence of scare resource nodes, thereby improving the utilization of scare resources.
24
35
25
36
## User Story
26
37
@@ -35,20 +46,20 @@ Extend one plug-ins to meet the above needs
- `sra`: Give each scarce resource a weight. Higher weight means more scarce. Jobs that don’t need the scarce resource will try to stay off nodes that have it, so the scarce resource can be retained.
96
+
> **Notes**: The `sra` policy is different from the `proportional` policy, because the `proportional` policy is a "**hard**" policy by setting the proportion of secondary resources to prevent task scheduling to nodes that do not meet the proportion of resources, while the `sra` policy is a "**soft**" policy by setting weights to important resources to try to guide task scheduling to nodes that do not have scare resources.
97
+
98
+
#### Solution
99
+
1. In sra policy, arguments `sra.resources` is provided to configure important resources in the cluster.
100
+
2. Based on the significance of different resources, `sra.resourceWeight.[ResourceName]` can assign varying weights for resource allocation. A higher weight signifies greater importance, and tasks that do not require this resource will, to the extent possible, be scheduled away from nodes with such resources
101
+
3. For all tasks, user can set `sra.resources` and `sra.resourceWeight.[ResourceName]` field in `resource-strategy-fit` arguments via `volcano-scheduler-configmap` in following format:
4. `sra` policy will affect the score result of node order. The higher the node score is, the higher the priority is. The score result of sra policy is calculated as follows :
119
+
```
120
+
the resource requested by pod is not on the node, sraScore = 0
121
+
the resource requested by pod is on the node, sraScore = MaxNodeScore * sraWeight * (weight_1 + weight_2 + ··· + weight_n) / weightSum
122
+
```
123
+
> `sraScore`: the node score of sra policy. \
124
+
> `MaxNodeScore`: the maximum score of node. default is 100. \
125
+
> `sraWeight`: the weight of sra policy. \
126
+
> `weightSum`: the sum of all sra resources weights. \
127
+
> `weight_x`: the weight of the xth sra resource that does not exist on the node.
| gpu-task-1 | node3 | 0 | node resources meet the task request and have t4, a10 |
159
+
160
+
### proportional
161
+
#### Policy Description
162
+
- `proportional`: By specifying 'primary' scarce resources (e.g. GPU in deep learning) and preserve the amount of associated 'secondary' resources by a pre-set proportion.
163
+
164
+

165
+
#### Solution
166
+
Firstly set the proportion binding in `volcano-scheduler.conf`:
After job single-1000-0 is scheduled, the Idle resource is 8GPUs, 66CPUs, 120G memory. During the predicate phase, this plugin calculates the resource left if job single-1000-1 is scheduled `(node.Idle.CPU - task.Resreq.CPU < node.Idle.GPU * cpuRatio ||
196
+
node.Idle.Memory - task.Resreq.Memory < node.Idle.GPU * memoryRatio)` ; the result is 8GPUs, 58CPUs, 112G memory, that unsatisfied the 1:8:8 proportion. Therefore, nodeC0-0 is removed from the predicateNodes, and NodeIdle remains 8GPUs, 66CPUs, 120G memory.
197
+
198
+
For more details, please refer to: [proportional design](./proportional.md)
82
199
83
200
## Syntax Rules
84
201
### Wildcard Syntax Support
@@ -115,6 +232,7 @@ resources:
115
232
- **Configuration-time validation**: Invalid wildcard patterns are filtered during plugin initialization with warning logs
116
233
- **Runtime matching**: Uses O(n) prefix matching algorithm with exact match optimization
117
234
- **Backward compatibility**: Existing exact match configurations continue to work unchanged
235
+
- **Support Policy**: Only `ResourceStrategyFit` is supported; `proportional` and `sra` are not.
118
236
119
237
## Performance Considerations
120
238
@@ -139,4 +257,30 @@ resources:
139
257
## Alternatives
140
258
141
259
### Binpack VS ResourceStrategyFit
142
-
If you want to use the clustering strategy for all resource types, you can choose the Binpack plugin. If you need to configure different clustering or scattering strategies for different resource types, you can choose the ResourceStrategyFit plugin. ResourceStrategyFit can also achieve the same results as Binpack by adjusting configuration parameters.
260
+
If you want to use aggregated strategy for all resource types, you can choose the Binpack plugin. If you need to configure different aggregated or dispersed strategies for different resource types, you can choose the ResourceStrategyFit plugin. ResourceStrategyFit can also achieve the same results as Binpack by adjusting configuration parameters.
261
+
262
+
## Best Practices
263
+
### AI scenario
264
+
In some AI scenarios, CPU tasks are usually dispersed into CPU machine groups to reduce hot spots. GPU tasks are gathered in the GPU machine group to reduce GPU fragmentation. At the same time, it is also necessary to avoid the situation that CPU tasks are assigned to GPU nodes, resulting in long-term waiting of GPU tasks due to insufficient CPU or MEM resources of nodes. In this scenario, we can combine **resourceStrategyFit** and **sra policy** to deal with this scenario. The corresponding example configuration is as follows:
0 commit comments