-
Notifications
You must be signed in to change notification settings - Fork 564
feat: add kep md #845
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add kep md #845
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,114 @@ | ||
# Node Resource Fit plus Scheduling | ||
|
||
<!-- toc --> | ||
- [Summary](#summary) | ||
- [Motivation](#motivation) | ||
- [Design Consideration](#design-consideration) | ||
- [Goals](#goals) | ||
- [Non-Goals](#non-goals) | ||
- [Proposal](#proposal) | ||
- [Design Details](#design-details) | ||
- [NodeResourcesFitPlus](#noderesourcesfitplus) | ||
- [ScarceResourceAvoidance](#scarceresourceavoidance) | ||
- [Test Plan](#test-plan) | ||
- [Graduation Criteria](#graduation-criteria) | ||
- [Alpha](#alpha) | ||
- [Beta](#beta) | ||
- [Implementation History](#implementation-history) | ||
<!-- /toc --> | ||
|
||
|
||
## Summary | ||
|
||
The NodeResourcesFit plug-in of native k8s can only adopt a type of strategy for all resources, such as MostRequestedPriority and LeastRequestedPriority. However, in industrial practice, this design does not apply to some scenarios. For example: In AI scenarios, businesses that apply for GPUs prefer to occupy the entire GPU machine first to prevent GPU fragmentation; businesses that apply for CPU & MEM are prioritized and dispersed to non-GPU machines to prevent excessive consumption of CPU & MEM on GPU machines, resulting in real tasks of applying for GPUs. Pending due to insufficient non-GPU resources | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In the scenario of using gpu nodes, which are scarce resources, we should directly filter out the gpu nodes. Shouldn't this reduce the score? In addition, IIUC, gpu nodes (or other devices) are labeled (based on gpu-operator or nfd), and are generally filtered in this way. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Affinity strategies or nodeSelector require labeling nodes in advance, which is costly for cluster maintainers. The advantage of the strategy is to reduce this maintenance operation There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think quite the opposite, we should have provided feature tags for device-specific nodes (eg: nvidia.com/gpu.xxx). 🤔 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Understand, labels can indeed be printed to distinguish machine types. Of course, it can also be done using the Affinity strategy. But what I want to say is that the above process has costs at the industrial practice level. 100 heterogeneous resources require the maintenance cost of 100 sets of labels. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @googs1025 If you think that maintenance cost is not something that k8s needs to consider, then indeed the second expansion strategy does not need to be incorporated. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @googs1025 Does the ScarceResourceAvoidance strategy have a clear conclusion? Accept or not? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is not for me to decide and can be left to other maintainers to suggest. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @googs1025 Thank you for your feedback. Can you help me let other students review it? |
||
. Therefore, two plugins are extended to solve this common problem. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's AFAICT uncommon for a single KEP to introduce two different concepts. If they concepts are closely coupled, can they be handled by the same plugin? |
||
|
||
## Motivation | ||
case: | ||
- GPU tasks take priority over the entire GPU | ||
- CPU&MEM tasks are distributed to the CPU machine first | ||
Comment on lines
+26
to
+29
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. are these use cases covered somehow by the DRA feature (https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/ ) ? |
||
|
||
## Design Consideration | ||
|
||
- The solution is more versatile, not limited to AI clusters or CPU clusters, and not limited to common CPU resources or extended GPU resources. | ||
|
||
- Different resource policies can be configured for different cluster types and prioritized in the form of weights. | ||
|
||
- Easy to expand | ||
Comment on lines
+33
to
+37
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. these looks like pros of your approach rather than the rationale for the aforementioned approach, which is the topic of this sections, on which we usually explain the design decisions and the motivations |
||
|
||
### Goals | ||
|
||
- Different types of resources can be configured with different strategies to prioritize them in the form of weights | ||
|
||
- Prevent pods that have not applied for scarce resources from being scheduled to nodes with scarce resources. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. which is the usecase beyond GPUs? Above you mention CPU/MEM (commodity) and GPU (scarce resource?). |
||
|
||
### Non-Goals | ||
|
||
- None. | ||
|
||
## Proposal | ||
|
||
Extend two plug-ins to meet the above needs | ||
|
||
- NodeResourcesFitPlus | ||
- ScarceResourceAvoidance | ||
|
||
## Design Details | ||
|
||
### NodeResourcesFitPlus | ||
|
||
config: | ||
``` | ||
resources: | ||
nvidia.com/gpu: | ||
type: MostAllocated | ||
weight: 2 | ||
cpu: | ||
type: LeastAllocated | ||
weight: 1 | ||
memory: | ||
type: LeastAllocated | ||
weight: 1 | ||
``` | ||
config description: | ||
<p align="center"><img src="images/img1.png" title="Key components" width="600" class="center"/></p> | ||
|
||
node score: | ||
``` | ||
finalScoreNode = [(weight1 * resource1) + (weight2 * resource2) + … + (weightN* resourceN)] /(weight1+weight2+ … +weightN) | ||
``` | ||
Comment on lines
+76
to
+79
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can we have few user stories and/or examples to see how this would translate in practice in various usage scenarios? |
||
|
||
### ScarceResourceAvoidance | ||
config: | ||
``` | ||
resources: | ||
- nvidia.com/gpu | ||
``` | ||
config description: | ||
<p align="center"><img src="images/img2.png" title="Key components" width="600" class="center"/></p> | ||
|
||
node score: | ||
``` | ||
finalScoreNode = (allocatablesResourcesNum - requestsResourcesNum) * framework.MaxNodeScore / allocatablesResourcesNum | ||
``` | ||
Comment on lines
+91
to
+93
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ditto |
||
|
||
### Test Plan | ||
|
||
Comprehensive unit tests will be added to ensure that each functionality works as expected. | ||
|
||
### Graduation Criteria | ||
|
||
#### Alpha | ||
|
||
- Implement the NodeResourcesFitPlus and ScarceResourceAvoidance scheduler plugins | ||
- Provide a reference implementation of the NodeResourcesFitPlus and ScarceResourceAvoidance | ||
- Unit tests and integration test from [Test Plan](#test-plan). | ||
|
||
#### Beta | ||
|
||
- Add E2E tests. | ||
- Provide beta-level documentation. | ||
|
||
## Implementation History | ||
|
||
- 2024-12-23: KEP created |
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is not recommended to use screenshots of tables and pictures. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm going to make adjustments |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
title: Node Resource Fit plus Scheduling | ||
kep-number: 624 | ||
authors: | ||
- "@LY-today" | ||
owning-sig: sig-scheduling | ||
creation-date: 2024-12-23 | ||
last-updated: 2024-12-23 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems very similar to an old plugin. Can you help to tell the difference or integrate it?
FYI: https://github.com/kubernetes-sigs/scheduler-plugins/tree/master/kep/48-node-resources-allocatable-scoring
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@googs1025 Older version policies can only use one policy for different resources. Not suitable for complex resource scenarios, such as AI
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@googs1025 Under the AI cluster. It is hoped that GPU tasks will be scheduled on one GPU machine as much as possible, and CPU tasks will be scattered on CPU machines. However, the old version of the policy does not support using different policies for the two resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned, it seems to be very similar to the previous nodeResourcesAllocatable, and I don't think it needs to be extended with a new plugin. If it is possible, can it be integrated into the original plugin? 🤔
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@googs1025 What you mean is that you agree with the design of the NodeResourcesFitPlus strategy, but you want to implement it by modifying the original nodeResourcesAllocatable strategy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@googs1025 Do I understand correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to exploring extension of existing plugins before to introduce a "plus" variant.
In addition, I think the plugin name should convey its purpose in a bit more explicit way, so let's try to find a better name rather than appending the Plus :)