gpud-v0.9.0-alpha.27
Pre-release
Pre-release
·
5 commits
to main
since this release
What's Changed
- [LEP-3205] fix(components/nvidia): handle "GPU requires reset" NVML error to set suggested actions by @gyuho in #1133
- [LEP-3196] feat(nvml): mark h100/h200 fabric state as supported by @gyuho in #1131
- [LEP-2916] feat(cmd/gpud): add control plane flag overrides by @gyuho in #1138
- [LEP-3107] feat(host): replace "uptime -s" with boottime syscall for reboot store by @gyuho in #1122
- [LEP-2852] feat(pkg/asn): implement fallback for as lookup by @gyuho in #1112
- [LEP-1995] feat(eventstore, metadata): drop unused table references since v0.5 (safe to drop) by @gyuho in #1123
- nits(pkg/host): remove unnecessary cpu name fallback in favor of upstream gopsutil by @gyuho in #1121
- [LEP-3269] fix(nvidia/fabric-manager): consider GPU device count for fabric state checks for H100/H200/GB200 by @gyuho in #1140
- [LEP-3114] feat(custom-plugins): drop unused/deprecated plugin "type" field from Spec by @gyuho in #1124
- [LEP-2892] feat(nvidia/xid): extract sub code/XID for XID 144-150 by @gyuho in #1120
- [LEP-3109] fix(pkg/disk): handle raid/md devices, recursive lsblk parse for more depth by @gyuho in #1111
- [LEP-3108] feat(infiniband): move code to components/infiniband package (no more under nvidia-query package) by @gyuho in #1128
- [LEP-3195] [LEP-3196] feat(cmd/gpud): add --gpu-uuids-with-gpu-lost, --gpu-uuids-with-gpu-requires-reset, --gpu-uuids-with-fabric-state-health-summary-unhealthy flags for testing by @gyuho in #1139
- [LEP-3112] feat(cmd/gpud): remove "gpud login" command in favor of "gpud up", remove/move unused code by @gyuho in #1125
- nits(tailscale/derpmap): sync more derp servers by @gyuho in #1132
- dep(go.mod): pin docker to v26.1.5, and update docker/containerd to latest stable in Dockerfile by @gyuho in #1135
- [LEP-2947] feat(nvlink): support global nvlink states thresholds by @gyuho in #1087
- [LEP-2905] feat(nvidia/infiniband): make ib port drop event more persistent, rename more packages to nvidia/ by @gyuho in #1110
- [LEP-1876] [LEP-3195] feat(nvidia/hw-slowdown): retain threshold exceeded events until set healthy by @gyuho in #1057
- [LEP-3113] feat(nvidia/xid,sxid): stop persisting SetHealthy events by @gyuho in #1126
- [LEP-3115] fix(pkg/host): do not attempt to create table everytime we load reboot events store by @gyuho in #1127
- [LEP-2836] feat(gossip/machine-info): include tailscale/containerd versions by @gyuho in #1143
- [LEP-3363] fix(nvidia/fabric-manager): mark H100 with PCIe product not supporting fabric manager by @gyuho in #1147
- [LEP-3336] feat(session/login): show login failures in "gpud status" by @gyuho in #1145
- [LEP-2916] feat(gpud run): perform login when --token flag value (optional) by @gyuho in #1148
- [LEP-3329] fix(nvidia/fabric-manager): do not skip activeness check when fabric state is supported by @gyuho in #1151
- fix(components/containerd): remove "ok" prefix when unhealthy by @gyuho in #1152
- chore(deps): bump golang.org/x/crypto from 0.36.0 to 0.45.0 by @dependabot[bot] in #1150
- doc(pkg/login): document success/failure cases with error code by @gyuho in #1157
- [LEP-3348] feat(gpud): optionally set data dir for testing (vs. /var/lib/gpud) by @gyuho in #1153
- fix(gpud/run): do not overwrite session token in metadata by @gyuho in #1159
- [LEP-3422] feat(nvidia/nvlink): log when threshold is exceeded by @gyuho in #1158
- Fix note on install script architecture support by @martbhell in #1161
- feat(*): add SECURITY.md and THIRD-PARTY.txt by @gyuho in #1166
- feat(nvidia/xid): set health state Unhealthy (not Degraded) for first occurrences of xid 65 + 94 by @gyuho in #1165
- [LEP-3499] [LEP-3539] [LEP-3540] fix(nvidia/xid): separate sub-code fatality for XID 144-150, better reason message by @gyuho in #1160
- [LEP-3519] [LEP-3422] feat(gpud/run): add "--skip-session-update-config" flag by @gyuho in #1162
- [LEP-3306] feat(disk, memory): log warn level when threshold exceeds in case of node pressure eviction by @gyuho in #1163
- fix(ci): bump up golangci-lint to latest by @gyuho in #1168
- ci: create helm dedicated workflow by @giuliocalzo in #1169
- ci: remove external actions by @giuliocalzo in #1173
- remove: golangci lint binary (unused) by @gyuho in #1176
New Contributors
- @martbhell made their first contribution in #1161
- @giuliocalzo made their first contribution in #1169
Full Changelog: v0.8.0...v0.9.0-alpha.27