Skip to content

Proposal: release a Supervisor + Collector Contrib container image #948

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
douglascamata opened this issue May 8, 2025 · 14 comments
Open

Comments

@douglascamata
Copy link
Member

douglascamata commented May 8, 2025

Since we are already release a container image with the Collector Supervisor (see #858), I think it makes sense to start to release an image with the Supervisor + Collector Contrib.

First of all, currently the Supervisor itself is not useful without a Collector binary to start as it cannot automatically (and safely) download a Collector to get started.

Second, I propose to include a Collector Contrib with it because the Supervisor is currently developed under the Contrib distribution.

Effectively we could create an new distribution, maybe called supervised-contrib for this purpose.

@mowies
Copy link
Member

mowies commented May 12, 2025

Generally, i am supporting this, but did you think about an alternative where we just provide a good base dockerfile that can be used as a basis?
Say, you have the dockerfile and it has everything ready and just copies in a user-provided collector binary. Would that be more useful to people?

@douglascamata
Copy link
Member Author

douglascamata commented May 12, 2025

@mowies I thought the same regarding just providing a good base image instead, but it didn't make a lot of sense to me, because:

  1. Everyone will have to do the exact same thing: copy a Collector (most likely Contrib) binary into the container image.
  2. Everyone will have to build a pipeline doing what the above mentioned in 1, pushing to a container image registry and pull that.
  3. Setting up this pipeline might be easy for some, not so easy for others, but it'll mean a lot of duplicated/disconnected effort for everyone wanting to use the Supervisor.

Making this process and new artifact part of our release saves everyone a good amount of time setting up and maintaining their own isolated copies of it. Additionally it fosters collaboration between the people working on the Supervisor so that we can all benefit of everyone's experience and expertise.

@mowies
Copy link
Member

mowies commented May 12, 2025

sounds good, I'm convinced 😄

@douglascamata
Copy link
Member Author

I'm building a small PoC at #957 by copying the Supervisor binary over from its own container image.

Even though this is the fastest way I could find of bringing a PoC up, my favorite approach would be to build it from source or grab it from its latest build. I didn't go deep into this approach for this PoC for two reasons:

  1. Everything I could come up with sounded pretty hacky.
  2. It's still not clear to me what's the best way forward. There's an extra complication that it would be the 1st distribution with another binary besides the Collector.
  3. The Dockerfile solution sounds simple enough, which means it could be the de-facto solution.

@TylerHelmuth
Copy link
Member

We should probably have a supervisor + collector for any distro that includes the opampextension.

@douglascamata
Copy link
Member Author

@TylerHelmuth to me it doesn't make much sense to include the Supervisor in a distro that's not the Contrib one, because it's part of the Contrib repository. But no strong opinions on this.

@TylerHelmuth
Copy link
Member

I would argue the k8s distro is a better supervisor/collector combo since it is a production ready distro (we dont recommend using contrib in production) and because k8s is an image-centric environment.

@douglascamata
Copy link
Member Author

I would argue the k8s distro is a better supervisor/collector combo since it is a production ready distro (we dont recommend using contrib in production) and because k8s is an image-centric environment.

I see. But the Supervisor might also be very useful for people that are not using k8s, so I think it makes sense to have it in other distros too as you initially suggested.

This is a bit of a side topic, but I'm curious to know why Contrib is not recommended in production. The arguments that are currently in the distribution's README are basically size and security, but to me the reasoning behind this is weak. I never seen a scenario where Collector size (I guess binary size?) was a problem (maybe some tight IoT scenarios?) and security always depends a lot on your configuration (I can have a very safe contrib config with the same amount of effort it takes to have a very safe k8s distro config).

I wonder if in reality is exactly the opposite of the recommendation, with more people using the Contrib distro, even in cases where the k8s distro is recommended, because Contrib is most complete and maybe ultimately the users care more about this.

Not saying we should change this or anything, just thinking out loud.

@dpaasman00
Copy link

We discussed this issue in the OpAMP Management sig and wanted to post some follow up from it. Some concerns were expressed in regards to how Kubernetes would manage multiple processes and what that setup would look like.

At Bindplane we've been pushing users to use the OpenTelemetry Operator instead of the Supervisor when dealing with container environments. In the case of Kubernetes this is the preferred way to manage a process instead of having a second binary aka the Supervisor doing this. The operator is built for managing OTel collectors in kubernetes and supports OpAMP (or support is at least planned).

@jsirianni
Copy link
Member

For Kubernetes specifically, I believe the OpenTelemetry Operator will deploy the OpAMP managed configuration as a configmap. This will allow the OpAMP server to have downtime without preventing the operator from deploying new containers utilizing the last known configuration. Without the Operator, you need to consider how the configuration will be persisted between container restarts, and be okay with new containers operating with some nop configuration until the OpAMP server is available. We (Bindplane) have been operating OpAMP based collectors in Kubernetes for several years now and have had to deal with these challenges while we work toward migrating to Operator managed collectors.

I just wanted to share our K8s perspective. Of course, not all containers are operating in a Kubernetes environment. A lot of the same challenges will exist, however.

@douglascamata
Copy link
Member Author

douglascamata commented May 28, 2025

We discussed this issue in the OpAMP Management sig and wanted to post some follow up from it. Some concerns were expressed in regards to how Kubernetes would manage multiple processes and what that setup would look like.

We always aim to run a single process per container, although I believe everyone already had some experience with a container running multiple processes. It's not super complicated, particularly in this case because the Supervisor is very lightweight. I will check the meeting notes later though, maybe there are valid concerns I'm not aware of.

At Bindplane we've been pushing users to use the OpenTelemetry Operator instead of the Supervisor when dealing with container environments. In the case of Kubernetes this is the preferred way to manage a process instead of having a second binary aka the Supervisor doing this. The operator is built for managing OTel collectors in kubernetes and supports OpAMP (or support is at least planned).

Thanks for sharing this info, @dpaasman00. I also believe the Otel Operator is a good way forward. But we cannot leave behind all those that don't want to use the operator or who aren't using k8s at all. So unless everyone agrees that we should stop the efforts towards the Collector Supervisor, I think we should keep working on it. To clarify, I'm proposing that we release the "supervised" images in addition to the normal ones.

For Kubernetes specifically, I believe the OpenTelemetry Operator will deploy the OpAMP managed configuration as a configmap. This will allow the OpAMP server to have downtime without preventing the operator from deploying new containers utilizing the last known configuration. Without the Operator, you need to consider how the configuration will be persisted between container restarts, and be okay with new containers operating with some nop configuration until the OpAMP server is available. We (Bindplane) have been operating OpAMP based collectors in Kubernetes for several years now and have had to deal with these challenges while we work toward migrating to Operator managed collectors.

You are 100% correct on these concerns, @jsirianni. To help mitigate this kind of problem I wrote a proposal and an implementation for a Supervisor feature that allows you to use a given configuration file as a 1st layer of configuration (lowest priority), besides the configuration file as top layer (highest priority) that we already have today. So you could have a basic configuration in a ConfigMap that is the minimum for a Collector to start to be useful until it can receive the remote configuration from the OpAMP backend. It would look like this (with the changes of my implementation):

agent:
  config_files:
  - $OPAMP_EXTENSION_CONFIG
  - $OWN_METRICS_CONFIG
  - $BUILTIN_CONFIG
  - base_config.yaml
  - $REMOTE_CONFIG

But, if the remote configuration gets significantly different from the base configuration over time, it might not even be worth to have a Collector running without loading it because it could mess up how data looks in certain queries/dashboards. It's a tradeoff that some might decide to make, but others don't.

Failure to talk to the OpAMP backend for getting the latest remote configuration is a tricky subject though. I don't think the Otel Operator can completely solve it if we take into consideration that certain remote configurations might only apply to certain managed Collectors based on their attributes (talking about identifying and non-identifying attributes, for example) or based on their instance IDs, etc. It's complicated.

@evan-bradley
Copy link
Contributor

My first thought here is that we should probably avoid publishing official images with both the Supervisor and the Collector in them, and for containerized environments instead work toward a "Supervisorless" model where the Collector receives configuration through an OpAMP confmap.Provider for remote configuration. The Operator will still likely be the best tool for Kubernetes environments, but for situations where this isn't possible or desirable, I think allowing the Collector itself to speak OpAMP will be the best approach.

I have a few concerns in particular about running the Supervisor in containerized environments, particularly in Kubernetes:

  1. In Kubernetes, how do the liveness/readiness probes work? Do they take information from the Collector, or from the Supervisor?
  2. Most containerized environments expect that the containers are immutable version upgrades are done by upgrading the container version. How does this interact with the package management facilities offered by the Supervisor? For example, a user who upgrades their Collector with the Supervisor will see the Collector version reverted when the container restarts.
  3. The Supervisor expects a persistence layer for storing configs, IDs, etc. Can we make it easy and obvious for users who want an ephemeral environment to take advantage of these features another way? Does the Supervisor run predictably in these environments without persistence or any adaptations?
  4. Kubernetes, Systemd, and other container orchestration tools fill a very similar role to the Supervisor in that they do process/container management. Can we clearly delineate the responsibilities of each when we run the Supervisor under one of these systems?

3 and 4 apply more generally to any environment the Supervisor runs in, but I think become more complex when you run it in a containerized environment.

Overall, I don't think there's any fundamental issue with running the Supervisor in a containerized environment if that's the desired deployment model, and shipping the Supervisor in its own container image forces users to answer a lot of these questions themselves for their own environment. I'm mostly concerned that we need to make recommendations to users for best practices when running their Collectors, and I'm not totally sure we can make a blanket suggestion for most users to run bundled Supervisor+Collector images like this.

I'd be open to input from others on whether they have run images like this in a number of environments and have found it to be a straightforward deployment model.

@douglascamata
Copy link
Member Author

  • In Kubernetes, how do the liveness/readiness probes work? Do they take information from the Collector, or from the Supervisor?

They take information from whatever you want them to and only if you want them to do it as well. We could, for instance, develop a health endpoint for the Supervisor that might reply based on health updates it gets from the Collector.

  • Most containerized environments expect that the containers are immutable version upgrades are done by upgrading the container version. How does this interact with the package management facilities offered by the Supervisor? For example, a user who upgrades their Collector with the Supervisor will see the Collector version reverted when the container restarts.

Upgrades of the Collector would have to be redone every time their container restarts. This is covered by the OpAMP spec, which says that an agent should download any package it might want that is offered by the server if it doesn't already have that binary (see https://opentelemetry.io/docs/specs/opamp/#step-2). When a restart happens, if the binary download destination is ephemeral it'll be redownloaded and if it's persistent no redownload happens.

It's also worth noting that package management is not only necessarily for Collector upgrades. The spec considers the top-level package to be the Collector itself, but there's the concept of sub-packages.

  • The Supervisor expects a persistence layer for storing configs, IDs, etc. Can we make it easy and obvious for users who want an ephemeral environment to take advantage of these features another way? Does the Supervisor run predictably in these environments without persistence or any adaptations?

The persistent layer to store these things doesn't need to live forever though. I think it's totally fine to let all this information live inside the lifetime of a container/pod, just like they might live for the lifetime of an EC2 instance.

I would flip the question the other way around: what could be the problems? I personally don't foresee any.

  • Kubernetes, Systemd, and other container orchestration tools fill a very similar role to the Supervisor in that they do process/container management. Can we clearly delineate the responsibilities of each when we run the Supervisor under one of these systems?

The top level orchestrators will become responsible for starting the Supervisor. And the Supervisor becomes responsible for managing the Collector. I think the responsibility separation is clear.

I'm mostly concerned that we need to make recommendations to users for best practices when running their Collectors, and I'm not totally sure we can make a blanket suggestion for most users to run bundled Supervisor+Collector images like this.

Otel releasing something does not mean it's a recommendation or best practice to run it blindly, as-is. For instance, there are many "alpha" or "dev" stage components of the Collector (Contrib) out there and it doesn't mean we recommend users to use them. At the same time, users might still run them if they want. Another example: we release the Supervisor container image, but it's not meant to be ran by users and it already goes a against recommendations because of what it does.

I believe that if we make it clear that the Supervisor is alpha and that a Supervisor+Collector container image is a two-processes container with its own complications and tradeoffs, we should be fine releasing it. Potential users will then decide if they want to run it, assuming the complications it entails, just like analyzing whether they should run an alpha or beta component of the Collector.

Now, leaving the technical/documentation discussion behind a bit... given the conversations that are happening around the Supervisor, I have a few "meta" questions:

  • If there are so many concerns regarding the Supervisor and its interactions with other supervisor layers or orchestration systems, why was the project started and added into contrib? Didn't we think about this before starting to work on it, adding it to the contrib repo, and starting to release it?
  • Should a discussion be started to decide what's the future of the Supervisor? Should we even keep adding new features to it or should we focus on adding what we can to the Collector itself?

These are probably interesting questions for the OpAMP or Collector SIG calls.

@douglascamata
Copy link
Member Author

Answering some of my own questions, according to what I see in the original Supervisor design doc and the issue proposing one:

  • It doesn't contain any discussion on how the Supervisor would interact with container orchestration and other supervisor (i.e. systemd). This might've been talked about, but without a written record.
  • Of the the initial goals of having the Supervisor also be a Collector extension was never achieved. It has been always only available in the "external" mode. Maybe what was initially planned as an "extension" became the opamp-go library?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants