-
Notifications
You must be signed in to change notification settings - Fork 174
Description
This is identified during investigation of #4289
The containerVM in question is not setting the session.started flag and as such the property collector times out. The reason why it's not being set is unimportant in this issue.
The problem is how we handle that failure - we Unbind the network config, but we do not power down the container. This means that if the container process didn't start successfully we have a useless containerVM consuming resource. If it was a network hicup/VC queuing/other control plane issue, then we've just disconnected a functioning containerVM from the network it required.
Solutions:
- Do not unbind on the error path after power on has been confirmed
- Ensure that the VM is powered down if it fails to report status by the deadline
In either case, the Unbind can be triggered by the VM power off instead of explciitly.
Notes:
This impacts cVMs that start after the current 3min timeout has expired. The timeout for cVM start was added to address failure scenarios and because docker run -it inherits the awkward blocking behaviour of the standard docker client when attach is used (interception of Ctrl-C, et al) meaning it cannot be easily escaped. The correct solution to this is:
- address the problem behaviour in docker client that results in timeouts being used
- ensure that cVMs either come up cleanly or report failure and shut themselves down
- only unbind network addresses on power off
This likely means changing the power state operations to be async and then waiting on events (either the expected status change or an error). I've upated the estimate to encompass a possible shift to async power operations but doesn't not include raising a PR for the signal forwarding behaviour of docker client.