Skip to content

Conversation

@edipascale
Copy link
Contributor

No description provided.

Frostman
Frostman previously approved these changes Dec 10, 2025
@edipascale edipascale marked this pull request as ready for review December 11, 2025 16:24
@edipascale edipascale requested review from a team as code owners December 11, 2025 16:24
@edipascale edipascale force-pushed the ema/populate-th5-vlans branch from 6022bb9 to 1e0ef59 Compare December 12, 2025 07:18
@edipascale
Copy link
Contributor Author

@Frostman @pau-hedgehog there's been a couple of spurious CI failures, but then the upgrade jobs from 25.04 (and only those, upgrading from 25.05 appears to work just fine) always time out at 60 minutes. It appears they always get stuck at:

09:59:45 INF upgrade(control-1): Dec 12 09:59:45.323 INF Installing fabricator
09:59:45 INF upgrade(control-1): Dec 12 09:59:45.520 DBG Enforced kind=HelmChart name=fabricator-api result=updated
09:59:45 INF upgrade(control-1): Dec 12 09:59:45.524 DBG Enforced kind=HelmChart name=fabricator result=updated
09:59:45 INF upgrade(control-1): Dec 12 09:59:45.524 DBG Expected fabricator-ctrl image=172.30.0.1:31000/githedgehog/fabricator/fabricator:v0.43.1-60-g97acf358-f70950
09:59:45 INF upgrade(control-1): Dec 12 09:59:45.524 DBG Waiting for ready kind=Deployment name=fabricator-ctrl
09:59:50 INF upgrade(control-1): Dec 12 09:59:50.737 INF Waiting for fabricator applied
09:59:50 INF upgrade(control-1): Dec 12 09:59:50.738 DBG Waiting for ready kind=Fabricator name=default   <--- Last command before the job gets cancelled

I've tried looking in the artifacts for those jobs but I couldn't spot anything obvious, probably also because I am not quite sure what to look for. And since this is something happening consistently when the CI doesn't randomly implode, I don't feel comfortable just disabling the CI upgrade jobs for the sake of getting a green light. Any idea?

@pau-hedgehog
Copy link
Contributor

@Frostman @pau-hedgehog there's been a couple of spurious CI failures, but then the upgrade jobs from 25.04 (and only those, upgrading from 25.05 appears to work just fine) always time out at 60 minutes. It appears they always get stuck at:

09:59:45 INF upgrade(control-1): Dec 12 09:59:45.323 INF Installing fabricator
09:59:45 INF upgrade(control-1): Dec 12 09:59:45.520 DBG Enforced kind=HelmChart name=fabricator-api result=updated
09:59:45 INF upgrade(control-1): Dec 12 09:59:45.524 DBG Enforced kind=HelmChart name=fabricator result=updated
09:59:45 INF upgrade(control-1): Dec 12 09:59:45.524 DBG Expected fabricator-ctrl image=172.30.0.1:31000/githedgehog/fabricator/fabricator:v0.43.1-60-g97acf358-f70950
09:59:45 INF upgrade(control-1): Dec 12 09:59:45.524 DBG Waiting for ready kind=Deployment name=fabricator-ctrl
09:59:50 INF upgrade(control-1): Dec 12 09:59:50.737 INF Waiting for fabricator applied
09:59:50 INF upgrade(control-1): Dec 12 09:59:50.738 DBG Waiting for ready kind=Fabricator name=default   <--- Last command before the job gets cancelled

I've tried looking in the artifacts for those jobs but I couldn't spot anything obvious, probably also because I am not quite sure what to look for. And since this is something happening consistently when the CI doesn't randomly implode, I don't feel comfortable just disabling the CI upgrade jobs for the sake of getting a green light. Any idea?

  Dec 12 08:45:32.966 INF Log: NOS:   inflating: db_migrator.py serial=VM-1234567890 mac=0c:20:12:ff:01:00 level=INFO
  Dec 12 08:45:32.966 INF Log: NOS:   inflating: boot/vmlinuz-5.10.0-21-amd64 serial=VM-1234567890 mac=0c:20:12:ff:01:00 level=INFO
  Dec 12 08:45:32.966 INF Log: NOS:   inflating: boot/config-5.10.0-21-amd64 serial=VM-1234567890 mac=0c:20:12:ff:01:00 level=INFO
  Dec 12 08:45:33.129 INF Log: NOS: tar: write error: No space left on device serial=VM-1234567890 mac=0c:20:12:ff:01:00 level=INFO
  Dec 12 08:45:34.080 INF Log: Enforcing ONIE default boot entry serial=VM-1234567890 mac=0c:20:12:ff:01:00 level=WARN
  Dec 12 08:45:34.080 INF Log: Error during installation serial=VM-1234567890 mac=0c:20:12:ff:01:00 level=ERROR
  Dec 12 08:46:04.340 INF NOS install rid=control-1/wtNfcOnv0b-000371 platform=x86_64-kvm_x86_64-r0 serial=VM-1234567890 mac=0c:20:12:ff:01:00
  Dec 12 08:46:32.387 INF Log: NOS installer completed serial=VM-1234567890 mac=0c:20:12:ff:01:00 level=INFO

All 7 switches encountered "No space left on device" errors during the initial NOS installation (#1015). Despite the installation errors, all switches eventually boot successfully but the lost time made the workflow timeout with: The job has exceeded the maximum execution time of 1h0m0s

@edipascale
Copy link
Contributor Author

All 7 switches encountered "No space left on device" errors during the initial NOS installation (#1015). Despite the installation errors, all switches eventually boot successfully but the lost time made the workflow timeout with: The job has exceeded the maximum execution time of 1h0m0s

OK, that is a relief, but if this is a bug of 25.04 I still do not understand why it's happening always with this PR and not with other PRs that have been opened since? 🤔

In any case, if this is happening regularly, should we disable the upgrade jobs from 25.04? What's the way forward?

@pau-hedgehog
Copy link
Contributor

OK, that is a relief, but if this is a bug of 25.04 I still do not understand why it's happening always with this PR and not with other PRs that have been opened since? 🤔

Good question. For now I can't explain. I'll keep investigating

@pau-hedgehog
Copy link
Contributor

OK, that is a relief, but if this is a bug of 25.04 I still do not understand why it's happening always with this PR and not with other PRs that have been opened since? 🤔

You were right. My analysis was wrong because a 25.04 bug wouldn't affect the upgrade step in the workflow. After adding a commit to extend timeout and gather show-tech I could see what's really happening:

Dec 12 20:41:17.347 ERR Reconciler error controller=Fabricator controllerGroup=fabricator.githedgehog.com controllerKind=Fabricator Fabricator.name=default Fabricator.namespace=fab namespace=fab name=default reconcileID=017da6ff-5067-46c0-813f-39ec90dc8c3d err="enforcing fabricator and control install defaults: retrying create or update Fabricator/default: creating or updating Fabricator default: creating or updating object: admission webhook \"vfabricator.kb.io\" denied the request: TH5 workaround VLANs are required"

...

Dec 12 21:14:20.515 ERR Reconciler error controller=Fabricator controllerGroup=fabricator.githedgehog.com controllerKind=Fabricator Fabricator.name=default Fabricator.namespace=fab namespace=fab name=default reconcileID=b10590f4-aca8-4fc5-977c-1169b62076f7 err="enforcing fabricator and control install defaults: retrying create or update Fabricator/default: creating or updating Fabricator default: creating or updating object: admission webhook \"vfabricator.kb.io\" denied the request: TH5 workaround VLANs are required"

Timeline of Failure:

  1. 20:39:06: K8s successfully upgraded from v1.33.2 to v1.34.2
  2. 20:41:00: New fabricator-ctrl (v0.43.1) deployed with new validation webhook
  3. 20:41:17: Reconciliation loop starts failing immediately due to webhook validation
  4. 20:41:17 - 21:14:2: ~34 minutes of continuous reconciliation failures
  5. 21:14:18: Timeout - upgrade process cancelled

The new validation webhook requirement enforces that TH5 workaround VLANs must be configured in the Fabricator CR but validation webhook seems to block any updates to the Fabricator CR without TH5 VLANs

@edipascale edipascale force-pushed the ema/populate-th5-vlans branch from 6fb1f36 to db7d4eb Compare December 14, 2025 14:03
@edipascale
Copy link
Contributor Author

OK, that is a relief, but if this is a bug of 25.04 I still do not understand why it's happening always with this PR and not with other PRs that have been opened since? 🤔

You were right. My analysis was wrong because a 25.04 bug wouldn't affect the upgrade step in the workflow. After adding a commit to extend timeout and gather show-tech I could see what's really happening:

Dec 12 20:41:17.347 ERR Reconciler error controller=Fabricator controllerGroup=fabricator.githedgehog.com controllerKind=Fabricator Fabricator.name=default Fabricator.namespace=fab namespace=fab name=default reconcileID=017da6ff-5067-46c0-813f-39ec90dc8c3d err="enforcing fabricator and control install defaults: retrying create or update Fabricator/default: creating or updating Fabricator default: creating or updating object: admission webhook \"vfabricator.kb.io\" denied the request: TH5 workaround VLANs are required"

...

Dec 12 21:14:20.515 ERR Reconciler error controller=Fabricator controllerGroup=fabricator.githedgehog.com controllerKind=Fabricator Fabricator.name=default Fabricator.namespace=fab namespace=fab name=default reconcileID=b10590f4-aca8-4fc5-977c-1169b62076f7 err="enforcing fabricator and control install defaults: retrying create or update Fabricator/default: creating or updating Fabricator default: creating or updating object: admission webhook \"vfabricator.kb.io\" denied the request: TH5 workaround VLANs are required"

Timeline of Failure:

  1. 20:39:06: K8s successfully upgraded from v1.33.2 to v1.34.2
  2. 20:41:00: New fabricator-ctrl (v0.43.1) deployed with new validation webhook
  3. 20:41:17: Reconciliation loop starts failing immediately due to webhook validation
  4. 20:41:17 - 21:14:2: ~34 minutes of continuous reconciliation failures
  5. 21:14:18: Timeout - upgrade process cancelled

The new validation webhook requirement enforces that TH5 workaround VLANs must be configured in the Fabricator CR but validation webhook seems to block any updates to the Fabricator CR without TH5 VLANs

Thanks @pau-hedgehog. I wonder why this is not a problem when upgrading from 25.05?
I've removed the increase in CI times and added a commit which comments out the offending check. @Frostman we can revert it once we deprecate 25.04, unless you have a better solution?

@edipascale edipascale requested a review from Frostman December 14, 2025 19:31
@Frostman Frostman force-pushed the ema/populate-th5-vlans branch from db7d4eb to cec0d59 Compare December 14, 2025 22:10
@Frostman Frostman merged commit 8253693 into master Dec 14, 2025
31 checks passed
@Frostman Frostman deleted the ema/populate-th5-vlans branch December 14, 2025 23:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants