Skip to content

Conversation

@balamanova
Copy link
Contributor

@balamanova balamanova commented Dec 9, 2025

Description

Added OpenTelemetry metrics for X.509 certificate refresh events in cert_refresher library.

Implementation follows SIA OTel pattern

This implementation mirrors the existing SIA (Go) OTel metrics pattern from libs/go/sia/otel/metricset.go:

Metrics (aligned with SIA)

This PR (Java) SIA (Go)
athenz_cert_refresher.refresh.result_total{function,result} sia.agent_command.result_total{function,result}
athenz_cert_refresher.service_cert.validity.remaining_secs{name} sia.service_cert.validity.remaining_secs{cname}
athenz_cert_refresher.refresh.result_last_timestamp{function,result} New - tracks when context was last updated

Attributes (same as SIA)

  • function - identifies the operation (e.g., "cert_refresh")
  • result - "success" or "failure" (same values as SIA)
  • name - certificate subject name (similar to SIA's cname)

Configuration

  • Disable with: -Dathenz.cert_refresher.otel_disabled=true
  • Component name: -Dathenz.cert_refresher.otel_component_name=my-service

Contribution Checklist:

  • The pull request does not introduce any breaking changes
  • I have read the contribution guidelines.
  • Create an issue and link to the pull request.

Attach Screenshots (Optional)

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @balamanova, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the cert_refresher library by integrating OpenTelemetry for comprehensive monitoring of X.509 certificate refresh operations. The changes introduce new metrics to track the outcomes of certificate reloads (success or failure) and provide real-time insights into the remaining validity and expiry timestamps of the currently loaded certificates, thereby improving the operational visibility and reliability of certificate management within Athenz.

Highlights

  • OpenTelemetry Integration: Introduces OpenTelemetry (OTel) metrics to the cert_refresher library for improved observability of certificate management.
  • New Metric Emitter: Adds OpenTelemetryCertReloadEventEmitter to manage and emit various certificate refresh and validity metrics.
  • Configurable Metrics: Allows disabling OTel metrics via a system property (athenz.cert_refresher.otel_disabled) and configuring the component name.
  • Enhanced Certificate Reload Logic: Modifies KeyRefresher to record successful/failed certificate reloads and export certificate validity metrics using the new OTel emitter.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces OpenTelemetry metrics for X.509 certificate refresh events by adding a new OpenTelemetryCertReloadEventEmitter class and integrating it into the KeyRefresher. The implementation is solid, adding valuable observability. My review includes several suggestions for the new OpenTelemetryCertReloadEventEmitter class to improve metric consistency, code clarity, and the utility of the emitted metrics. These changes will make the new telemetry data more robust and easier to consume.

@balamanova balamanova force-pushed the ATHENS-8722-x509_otel branch 2 times, most recently from bc60e29 to c9e32ba Compare December 9, 2025 17:46
ATHENS-8722 adding cert refresh metrics

Signed-off-by: abalamanova <[email protected]>
@balamanova balamanova force-pushed the ATHENS-8722-x509_otel branch from c9e32ba to 522931a Compare December 9, 2025 17:46
@balamanova balamanova requested a review from psasidhar December 11, 2025 18:07
.build();

refreshResultCounter.add(1, attrs);
resultLastTimestampGauge.set(timestamp, attrs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to send timestamp explicity, won't that be automatically be available with refreshResultCounter?

Copy link
Contributor Author

@balamanova balamanova Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using OTel Metrics API (counter + gauge), not OTel Events/Logs API. Prometheus is a pull-based time-series database.

Prometheus only stores scrape timestamps, not event timestamps. I can detect that a refresh happened using increase(), but the timestamp is approximate (within scrape interval).

With this timestamp I save exact timestamp when the refresh occurred

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no need to generate a timestamp attribute. the metric already tells you that there is a failure in the given time period and the exact timestamp is not really needed and will not be used.

@balamanova balamanova changed the title Athens 8722 x509 otel x509 otel certificate refresh events in cert_refresher library. Dec 12, 2025
@balamanova balamanova force-pushed the ATHENS-8722-x509_otel branch from 9b54ae6 to 8adce7d Compare December 12, 2025 20:38
*/
private OpenTelemetryCertReloadEventEmitter initOtelMetrics() {
String otelDisabledProp = System.getProperty(PROP_OTEL_DISABLED);
boolean otelDisabled = "true".equalsIgnoreCase(otelDisabledProp);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default behavior here should be false. We can't introduce new functionality which could cause problems and generate more metrics if nobody is looking at them.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using the "true" (or "false") , please specify a static string and use that. Alternatively you can convert to Boolean and use Boolean.TRUE/Boolean.FALSE

//Signal key change event
if (keyRefresherListener != null) {
keyRefresherListener.onKeyChangeEvent();
try {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did we add a new try/catch block here? we already have one in place where we log any errors?

.build();

refreshResultCounter.add(1, attrs);
resultLastTimestampGauge.set(timestamp, attrs);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no need to generate a timestamp attribute. the metric already tells you that there is a failure in the given time period and the exact timestamp is not really needed and will not be used.

<groupId>io.opentelemetry</groupId>
<artifactId>opentelemetry-api</artifactId>
<version>${opentelemetry.version}</version>
<optional>true</optional>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have we verified that if we enable otel metrics and the jars are not available, the code works as expected without throwing any errors?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants