Skip to content

fix(azuremonitorreceiver): Azure Monitor receiver should not produce gaps in data points for PT1M time grains #37342

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

an-mmx
Copy link
Contributor

@an-mmx an-mmx commented Jan 20, 2025

Description

The current implementation of tracking for the last metrics fetch is highly sensitive to time, relying on time.Now(). This sensitivity causes the loss of data points due to jitter in 1-minute intervals. I recommend a change request to implement a more robust comparison with the last minute of metrics fetch.

Link to tracking issue

Fixes #37337

Testing

A new coverage introduced in unit test

Documentation

No documentation added

@an-mmx an-mmx requested a review from a team as a code owner January 20, 2025 14:56
@an-mmx an-mmx requested a review from bogdandrutu January 20, 2025 14:56
@github-actions github-actions bot requested a review from nslaughter January 20, 2025 14:56
@an-mmx an-mmx force-pushed the fix/azure-monitor-receiver-time-grain branch 6 times, most recently from ee3a095 to 9783d0e Compare January 22, 2025 19:10
Copy link
Contributor

github-actions bot commented Feb 6, 2025

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Feb 6, 2025
@an-mmx
Copy link
Contributor Author

an-mmx commented Feb 6, 2025

/ping

@celian-garcia
Copy link
Member

That's very interesting. Did you observe the same thing with other timegrain?
I feel like your fix would fix only the PT1M but not others.

We have a fork (using batch metric API instead of metric API in order to resolve a problem of rate limitation, anyway..) and in that fork we did something different. We remove 4 times the duration of the timegrain.
So for a PT1M, we take 4 * 60 = 240 seconds in the past. This to ensure that we'll receive some data as if we take only 60sec back, we sometimes have nothing, and nothing, and nothing. Of course it's not the exactly the same API but maybe the problem is the same.

So in the end, we're very interested in having some figures with before/after. And I think I will try your implementation with our fork.

@github-actions github-actions bot removed the Stale label Feb 7, 2025
@an-mmx
Copy link
Contributor Author

an-mmx commented Feb 10, 2025

.. Did you observe the same thing with other timegrain? I feel like your fix would fix only the PT1M but not others.

Following the scrapper logic and our changes, it should work well. The root cause of the issue was in a timedrift caused by various factors (like requests duration and processing time) and the way it was compared. This change mitigates impact of the timedrift.
Could you please share your concerns regarding other timegrains?

We have a fork (using batch metric API instead of metric API in order to resolve a problem of rate limitation, anyway..) and in that fork we did something different. We remove 4 times the duration of the timegrain. So for a PT1M, we take 4 * 60 = 240 seconds in the past. This to ensure that we'll receive some data as if we take only 60sec back, we sometimes have nothing, and nothing, and nothing. Of course it's not the exactly the same API but maybe the problem is the same.

Have you checked for these gaps in Azure App Insights? Azure doesn't store null-value data points, it might be the case: https://learn.microsoft.com/en-us/azure/azure-monitor/essentials/metrics-troubleshoot#chart-shows-dashed-line

So in the end, we're very interested in having some figures with before/after. And I think I will try your implementation with our fork.

Sure.
We collect Azure EventHub IncomingMessages metric. Here you can see a line, representing a count of data points per 1 hour collected before the fix, filtered by a single unique dimension. We can observe that we collect less than 60 datapoints per hour, and this isn't something we expect.
Screenshot 2025-02-10 at 16 23 04

The next screenshot have taken after the fix. It is build on the same query. We can observe that we have reached 60 datapoints per hour. We can also observe some "spikes". These spikes is a result of otel-collector restarts. However, it isn't a problem for us.
Screenshot 2025-02-10 at 16 23 17

@celian-garcia
Copy link
Member

Well ok, I checked twice and our fork using getBatch metrics API is completely different in terms of interface. We make a batchRequest giving start and end time and the classic metrics API doesn't even have this notion actually.
This is just to take the metrics at a T time. So what you did is perfect and seems indeed to largely mitigates the issue.

LGTM in the end, I would like to approve, but I'm owner of this code since only a week and I'm not sure how can I approve it :D

@an-mmx an-mmx force-pushed the fix/azure-monitor-receiver-time-grain branch from 9783d0e to 58fe5a0 Compare February 14, 2025 13:29
@an-mmx
Copy link
Contributor Author

an-mmx commented Feb 14, 2025

Usch, I've rebased w/ main and lots of test got failed :/

Copy link
Member

@celian-garcia celian-garcia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hoping that it has some weight here, I'm approving these changes.
🙏
I've been assign to CODEOWNERS after the PR has been created :)

@an-mmx an-mmx force-pushed the fix/azure-monitor-receiver-time-grain branch from 58fe5a0 to 562fe13 Compare February 14, 2025 15:14
@crobert-1
Copy link
Member

Hoping that it has some weight here, I'm approving these changes.

@celian-garcia: You're a code owner now, so it definitely counts. Really appreciate your input here 🙂

@crobert-1 crobert-1 added the ready to merge Code review completed; ready to merge by maintainers label Feb 17, 2025
@songy23 songy23 merged commit 848f0c0 into open-telemetry:main Feb 18, 2025
173 checks passed
@github-actions github-actions bot added this to the next release milestone Feb 18, 2025
yiquanzhou added a commit to dash0hq/opentelemetry-collector-contrib that referenced this pull request Feb 18, 2025
* main: (111 commits)
  fix(azuremonitorreceiver): Azure Monitor receiver should not produce gaps in data points for PT1M time grains (open-telemetry#37342)
  fix(deps): update module sigs.k8s.io/controller-runtime to v0.20.2 (open-telemetry#37996)
  fix(deps): update module github.com/hashicorp/consul/api to v1.31.2 (open-telemetry#38031)
  [processor/resourcedetection] add instructions for recommended use of the dynatrace detector (open-telemetry#37962)
  fix(deps): update module github.com/go-sql-driver/mysql to v1.9.0 (open-telemetry#38007)
  fix(deps): update module google.golang.org/api to v0.221.0 (open-telemetry#38027)
  prometheusreceiver: deprecate start time adjustment (open-telemetry#37879)
  fix(deps): update module github.com/tencentcloud/tencentcloud-sdk-go/tencentcloud/common to v1.0.1100 (open-telemetry#37995)
  chore(deps): update golang docker tag to v1.24 (open-telemetry#37997)
  fix(deps): update all github.com/aws packages (open-telemetry#37983)
  chore(deps): update prom/prometheus docker tag to v3.2.0 (open-telemetry#37998)
  fix(deps): update kubernetes packages to v0.32.2 (open-telemetry#38004)
  fix(deps): update module github.com/clickhouse/clickhouse-go/v2 to v2.32.1 (open-telemetry#38006)
  fix(deps): update module github.com/google/go-github/v69 to v69.2.0 (open-telemetry#38014)
  fix(deps): update module github.com/sap/go-hdb to v1.13.3 (open-telemetry#38021)
  fix(deps): update module go.etcd.io/bbolt to v1.4.0 (open-telemetry#38024)
  [exporter/stefexporter] Fix a context cancellation bug in STEF exporter (open-telemetry#37944)
  fix(deps): update module github.com/spf13/cobra to v1.9.1 (open-telemetry#38023)
  fix(deps): update module github.com/envoyproxy/go-control-plane/envoy to v1.32.4 (open-telemetry#37990)
  fix(deps): update module github.com/hashicorp/consul/api to v1.31.1 (open-telemetry#37991)
  ...
@an-mmx an-mmx deleted the fix/azure-monitor-receiver-time-grain branch February 18, 2025 19:48
yiquanzhou added a commit to dash0hq/opentelemetry-collector-contrib that referenced this pull request Feb 19, 2025
* main: (111 commits)
  fix(azuremonitorreceiver): Azure Monitor receiver should not produce gaps in data points for PT1M time grains (open-telemetry#37342)
  fix(deps): update module sigs.k8s.io/controller-runtime to v0.20.2 (open-telemetry#37996)
  fix(deps): update module github.com/hashicorp/consul/api to v1.31.2 (open-telemetry#38031)
  [processor/resourcedetection] add instructions for recommended use of the dynatrace detector (open-telemetry#37962)
  fix(deps): update module github.com/go-sql-driver/mysql to v1.9.0 (open-telemetry#38007)
  fix(deps): update module google.golang.org/api to v0.221.0 (open-telemetry#38027)
  prometheusreceiver: deprecate start time adjustment (open-telemetry#37879)
  fix(deps): update module github.com/tencentcloud/tencentcloud-sdk-go/tencentcloud/common to v1.0.1100 (open-telemetry#37995)
  chore(deps): update golang docker tag to v1.24 (open-telemetry#37997)
  fix(deps): update all github.com/aws packages (open-telemetry#37983)
  chore(deps): update prom/prometheus docker tag to v3.2.0 (open-telemetry#37998)
  fix(deps): update kubernetes packages to v0.32.2 (open-telemetry#38004)
  fix(deps): update module github.com/clickhouse/clickhouse-go/v2 to v2.32.1 (open-telemetry#38006)
  fix(deps): update module github.com/google/go-github/v69 to v69.2.0 (open-telemetry#38014)
  fix(deps): update module github.com/sap/go-hdb to v1.13.3 (open-telemetry#38021)
  fix(deps): update module go.etcd.io/bbolt to v1.4.0 (open-telemetry#38024)
  [exporter/stefexporter] Fix a context cancellation bug in STEF exporter (open-telemetry#37944)
  fix(deps): update module github.com/spf13/cobra to v1.9.1 (open-telemetry#38023)
  fix(deps): update module github.com/envoyproxy/go-control-plane/envoy to v1.32.4 (open-telemetry#37990)
  fix(deps): update module github.com/hashicorp/consul/api to v1.31.1 (open-telemetry#37991)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready to merge Code review completed; ready to merge by maintainers receiver/azuremonitor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Azure Monitor receiver doesn't honor time grain
4 participants