-
Notifications
You must be signed in to change notification settings - Fork 195
Proposal: release a Supervisor + Collector Contrib container image #948
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Generally, i am supporting this, but did you think about an alternative where we just provide a good base dockerfile that can be used as a basis? |
@mowies I thought the same regarding just providing a good base image instead, but it didn't make a lot of sense to me, because:
Making this process and new artifact part of our release saves everyone a good amount of time setting up and maintaining their own isolated copies of it. Additionally it fosters collaboration between the people working on the Supervisor so that we can all benefit of everyone's experience and expertise. |
sounds good, I'm convinced 😄 |
I'm building a small PoC at #957 by copying the Supervisor binary over from its own container image. Even though this is the fastest way I could find of bringing a PoC up, my favorite approach would be to build it from source or grab it from its latest build. I didn't go deep into this approach for this PoC for two reasons:
|
We should probably have a supervisor + collector for any distro that includes the opampextension. |
@TylerHelmuth to me it doesn't make much sense to include the Supervisor in a distro that's not the Contrib one, because it's part of the Contrib repository. But no strong opinions on this. |
I would argue the k8s distro is a better supervisor/collector combo since it is a production ready distro (we dont recommend using contrib in production) and because k8s is an image-centric environment. |
I see. But the Supervisor might also be very useful for people that are not using k8s, so I think it makes sense to have it in other distros too as you initially suggested. This is a bit of a side topic, but I'm curious to know why Contrib is not recommended in production. The arguments that are currently in the distribution's README are basically size and security, but to me the reasoning behind this is weak. I never seen a scenario where Collector size (I guess binary size?) was a problem (maybe some tight IoT scenarios?) and security always depends a lot on your configuration (I can have a very safe contrib config with the same amount of effort it takes to have a very safe k8s distro config). I wonder if in reality is exactly the opposite of the recommendation, with more people using the Contrib distro, even in cases where the k8s distro is recommended, because Contrib is most complete and maybe ultimately the users care more about this. Not saying we should change this or anything, just thinking out loud. |
We discussed this issue in the OpAMP Management sig and wanted to post some follow up from it. Some concerns were expressed in regards to how Kubernetes would manage multiple processes and what that setup would look like. At Bindplane we've been pushing users to use the OpenTelemetry Operator instead of the Supervisor when dealing with container environments. In the case of Kubernetes this is the preferred way to manage a process instead of having a second binary aka the Supervisor doing this. The operator is built for managing OTel collectors in kubernetes and supports OpAMP (or support is at least planned). |
For Kubernetes specifically, I believe the OpenTelemetry Operator will deploy the OpAMP managed configuration as a configmap. This will allow the OpAMP server to have downtime without preventing the operator from deploying new containers utilizing the last known configuration. Without the Operator, you need to consider how the configuration will be persisted between container restarts, and be okay with new containers operating with some nop configuration until the OpAMP server is available. We (Bindplane) have been operating OpAMP based collectors in Kubernetes for several years now and have had to deal with these challenges while we work toward migrating to Operator managed collectors. I just wanted to share our K8s perspective. Of course, not all containers are operating in a Kubernetes environment. A lot of the same challenges will exist, however. |
We always aim to run a single process per container, although I believe everyone already had some experience with a container running multiple processes. It's not super complicated, particularly in this case because the Supervisor is very lightweight. I will check the meeting notes later though, maybe there are valid concerns I'm not aware of.
Thanks for sharing this info, @dpaasman00. I also believe the Otel Operator is a good way forward. But we cannot leave behind all those that don't want to use the operator or who aren't using k8s at all. So unless everyone agrees that we should stop the efforts towards the Collector Supervisor, I think we should keep working on it. To clarify, I'm proposing that we release the "supervised" images in addition to the normal ones.
You are 100% correct on these concerns, @jsirianni. To help mitigate this kind of problem I wrote a proposal and an implementation for a Supervisor feature that allows you to use a given configuration file as a 1st layer of configuration (lowest priority), besides the configuration file as top layer (highest priority) that we already have today. So you could have a basic configuration in a ConfigMap that is the minimum for a Collector to start to be useful until it can receive the remote configuration from the OpAMP backend. It would look like this (with the changes of my implementation): agent:
config_files:
- $OPAMP_EXTENSION_CONFIG
- $OWN_METRICS_CONFIG
- $BUILTIN_CONFIG
- base_config.yaml
- $REMOTE_CONFIG But, if the remote configuration gets significantly different from the base configuration over time, it might not even be worth to have a Collector running without loading it because it could mess up how data looks in certain queries/dashboards. It's a tradeoff that some might decide to make, but others don't. Failure to talk to the OpAMP backend for getting the latest remote configuration is a tricky subject though. I don't think the Otel Operator can completely solve it if we take into consideration that certain remote configurations might only apply to certain managed Collectors based on their attributes (talking about identifying and non-identifying attributes, for example) or based on their instance IDs, etc. It's complicated. |
My first thought here is that we should probably avoid publishing official images with both the Supervisor and the Collector in them, and for containerized environments instead work toward a "Supervisorless" model where the Collector receives configuration through an OpAMP I have a few concerns in particular about running the Supervisor in containerized environments, particularly in Kubernetes:
3 and 4 apply more generally to any environment the Supervisor runs in, but I think become more complex when you run it in a containerized environment. Overall, I don't think there's any fundamental issue with running the Supervisor in a containerized environment if that's the desired deployment model, and shipping the Supervisor in its own container image forces users to answer a lot of these questions themselves for their own environment. I'm mostly concerned that we need to make recommendations to users for best practices when running their Collectors, and I'm not totally sure we can make a blanket suggestion for most users to run bundled Supervisor+Collector images like this. I'd be open to input from others on whether they have run images like this in a number of environments and have found it to be a straightforward deployment model. |
They take information from whatever you want them to and only if you want them to do it as well. We could, for instance, develop a health endpoint for the Supervisor that might reply based on health updates it gets from the Collector.
Upgrades of the Collector would have to be redone every time their container restarts. This is covered by the OpAMP spec, which says that an agent should download any package it might want that is offered by the server if it doesn't already have that binary (see https://opentelemetry.io/docs/specs/opamp/#step-2). When a restart happens, if the binary download destination is ephemeral it'll be redownloaded and if it's persistent no redownload happens. It's also worth noting that package management is not only necessarily for Collector upgrades. The spec considers the top-level package to be the Collector itself, but there's the concept of sub-packages.
The persistent layer to store these things doesn't need to live forever though. I think it's totally fine to let all this information live inside the lifetime of a container/pod, just like they might live for the lifetime of an EC2 instance. I would flip the question the other way around: what could be the problems? I personally don't foresee any.
The top level orchestrators will become responsible for starting the Supervisor. And the Supervisor becomes responsible for managing the Collector. I think the responsibility separation is clear.
Otel releasing something does not mean it's a recommendation or best practice to run it blindly, as-is. For instance, there are many "alpha" or "dev" stage components of the Collector (Contrib) out there and it doesn't mean we recommend users to use them. At the same time, users might still run them if they want. Another example: we release the Supervisor container image, but it's not meant to be ran by users and it already goes a against recommendations because of what it does. I believe that if we make it clear that the Supervisor is alpha and that a Supervisor+Collector container image is a two-processes container with its own complications and tradeoffs, we should be fine releasing it. Potential users will then decide if they want to run it, assuming the complications it entails, just like analyzing whether they should run an alpha or beta component of the Collector. Now, leaving the technical/documentation discussion behind a bit... given the conversations that are happening around the Supervisor, I have a few "meta" questions:
These are probably interesting questions for the OpAMP or Collector SIG calls. |
Answering some of my own questions, according to what I see in the original Supervisor design doc and the issue proposing one:
|
Uh oh!
There was an error while loading. Please reload this page.
Since we are already release a container image with the Collector Supervisor (see #858), I think it makes sense to start to release an image with the Supervisor + Collector Contrib.
First of all, currently the Supervisor itself is not useful without a Collector binary to start as it cannot automatically (and safely) download a Collector to get started.
Second, I propose to include a Collector Contrib with it because the Supervisor is currently developed under the Contrib distribution.
Effectively we could create an new distribution, maybe called
supervised-contrib
for this purpose.The text was updated successfully, but these errors were encountered: