Skip to content

gpud-v0.9.0-alpha.27

Pre-release
Pre-release

Choose a tag to compare

@github-actions github-actions released this 10 Dec 09:53
· 5 commits to main since this release

What's Changed

  • [LEP-3205] fix(components/nvidia): handle "GPU requires reset" NVML error to set suggested actions by @gyuho in #1133
  • [LEP-3196] feat(nvml): mark h100/h200 fabric state as supported by @gyuho in #1131
  • [LEP-2916] feat(cmd/gpud): add control plane flag overrides by @gyuho in #1138
  • [LEP-3107] feat(host): replace "uptime -s" with boottime syscall for reboot store by @gyuho in #1122
  • [LEP-2852] feat(pkg/asn): implement fallback for as lookup by @gyuho in #1112
  • [LEP-1995] feat(eventstore, metadata): drop unused table references since v0.5 (safe to drop) by @gyuho in #1123
  • nits(pkg/host): remove unnecessary cpu name fallback in favor of upstream gopsutil by @gyuho in #1121
  • [LEP-3269] fix(nvidia/fabric-manager): consider GPU device count for fabric state checks for H100/H200/GB200 by @gyuho in #1140
  • [LEP-3114] feat(custom-plugins): drop unused/deprecated plugin "type" field from Spec by @gyuho in #1124
  • [LEP-2892] feat(nvidia/xid): extract sub code/XID for XID 144-150 by @gyuho in #1120
  • [LEP-3109] fix(pkg/disk): handle raid/md devices, recursive lsblk parse for more depth by @gyuho in #1111
  • [LEP-3108] feat(infiniband): move code to components/infiniband package (no more under nvidia-query package) by @gyuho in #1128
  • [LEP-3195] [LEP-3196] feat(cmd/gpud): add --gpu-uuids-with-gpu-lost, --gpu-uuids-with-gpu-requires-reset, --gpu-uuids-with-fabric-state-health-summary-unhealthy flags for testing by @gyuho in #1139
  • [LEP-3112] feat(cmd/gpud): remove "gpud login" command in favor of "gpud up", remove/move unused code by @gyuho in #1125
  • nits(tailscale/derpmap): sync more derp servers by @gyuho in #1132
  • dep(go.mod): pin docker to v26.1.5, and update docker/containerd to latest stable in Dockerfile by @gyuho in #1135
  • [LEP-2947] feat(nvlink): support global nvlink states thresholds by @gyuho in #1087
  • [LEP-2905] feat(nvidia/infiniband): make ib port drop event more persistent, rename more packages to nvidia/ by @gyuho in #1110
  • [LEP-1876] [LEP-3195] feat(nvidia/hw-slowdown): retain threshold exceeded events until set healthy by @gyuho in #1057
  • [LEP-3113] feat(nvidia/xid,sxid): stop persisting SetHealthy events by @gyuho in #1126
  • [LEP-3115] fix(pkg/host): do not attempt to create table everytime we load reboot events store by @gyuho in #1127
  • [LEP-2836] feat(gossip/machine-info): include tailscale/containerd versions by @gyuho in #1143
  • [LEP-3363] fix(nvidia/fabric-manager): mark H100 with PCIe product not supporting fabric manager by @gyuho in #1147
  • [LEP-3336] feat(session/login): show login failures in "gpud status" by @gyuho in #1145
  • [LEP-2916] feat(gpud run): perform login when --token flag value (optional) by @gyuho in #1148
  • [LEP-3329] fix(nvidia/fabric-manager): do not skip activeness check when fabric state is supported by @gyuho in #1151
  • fix(components/containerd): remove "ok" prefix when unhealthy by @gyuho in #1152
  • chore(deps): bump golang.org/x/crypto from 0.36.0 to 0.45.0 by @dependabot[bot] in #1150
  • doc(pkg/login): document success/failure cases with error code by @gyuho in #1157
  • [LEP-3348] feat(gpud): optionally set data dir for testing (vs. /var/lib/gpud) by @gyuho in #1153
  • fix(gpud/run): do not overwrite session token in metadata by @gyuho in #1159
  • [LEP-3422] feat(nvidia/nvlink): log when threshold is exceeded by @gyuho in #1158
  • Fix note on install script architecture support by @martbhell in #1161
  • feat(*): add SECURITY.md and THIRD-PARTY.txt by @gyuho in #1166
  • feat(nvidia/xid): set health state Unhealthy (not Degraded) for first occurrences of xid 65 + 94 by @gyuho in #1165
  • [LEP-3499] [LEP-3539] [LEP-3540] fix(nvidia/xid): separate sub-code fatality for XID 144-150, better reason message by @gyuho in #1160
  • [LEP-3519] [LEP-3422] feat(gpud/run): add "--skip-session-update-config" flag by @gyuho in #1162
  • [LEP-3306] feat(disk, memory): log warn level when threshold exceeds in case of node pressure eviction by @gyuho in #1163
  • fix(ci): bump up golangci-lint to latest by @gyuho in #1168
  • ci: create helm dedicated workflow by @giuliocalzo in #1169
  • ci: remove external actions by @giuliocalzo in #1173
  • remove: golangci lint binary (unused) by @gyuho in #1176

New Contributors

Full Changelog: v0.8.0...v0.9.0-alpha.27