-
Notifications
You must be signed in to change notification settings - Fork 12.8k
chore: adds metrics for failed app reasons #37133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
|
Looks like this PR is not ready to merge, because of the following issues:
Please fix the issues and try again If you have any trouble, please check the PR guidelines |
|
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughAdds per-app failure reporting: a new Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Scheduler as Scheduler/Cron
participant Collector as collectMetrics
participant Stats as getAppsStatistics
participant Prom as prom-client
Scheduler->>Collector: setPrometheusData()
Collector->>Stats: _getAppsStatistics()
Stats-->>Collector: { appsFailed: [{name,id,reason},...], totalInstalled, totalActive, ... }
rect rgba(200,235,255,0.25)
note right of Collector: update Prometheus metrics
Collector->>Prom: totalFailed.set(appsFailed.length)
Collector->>Prom: appsFailedReason.reset()
loop per failed app
Collector->>Prom: appsFailedReason.labels(name,id,reason).set(1)
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Potential focus areas:
Suggested labels
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro Disabled knowledge base sources:
📒 Files selected for processing (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
🔇 Additional comments (1)
Tip 📝 Customizable high-level summaries are now available in beta!You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.
Example instruction:
Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #37133 +/- ##
===========================================
- Coverage 68.97% 67.07% -1.91%
===========================================
Files 3359 3458 +99
Lines 114214 114874 +660
Branches 20535 21097 +562
===========================================
- Hits 78784 77054 -1730
- Misses 33335 35771 +2436
+ Partials 2095 2049 -46
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
|
@d-gubert do you think there is some other info we'd like to include in the metrics? Might be a good time now since we're adding this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
packages/core-typings/src/IStats.ts (1)
159-160: Consider removingtotalFailedto avoid redundancy.The interface now has both
totalFailed(line 159) andappsFailed(line 160). According to the AI summary,totalFailedis derived fromappsFailed.lengthin other parts of the codebase. Having both fields in the type definition creates redundancy and potential for data inconsistency if they're not kept in sync.Consider one of the following approaches:
- Remove
totalFailedand derive it where needed fromappsFailed.length- Make the relationship explicit by documenting that
totalFailedmust equalappsFailed.length- If
totalFailedserves a different purpose (e.g.,falseindicates a specific state), document this distinction clearly
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Jira integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
packages/core-typings/src/IStats.ts(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: 📦 Build Packages
- GitHub Check: CodeQL-Build
- GitHub Check: CodeQL-Build
6e56a81 to
e0e724e
Compare
e0e724e to
b840490
Compare
b840490 to
62a2841
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
apps/meteor/app/metrics/server/lib/collectMetrics.ts (1)
50-52: Consider explicit label assignment for clarity.The spread operator
{ ...app }works correctly here since the object shape matches the expected labels. However, for better maintainability and explicit intent, consider destructuring or explicitly passing the labels.Apply this diff to make the label assignment more explicit:
- for (const app of appsFailed) { - metrics.appsFailedReason.set({ ...app }, 1); + for (const { name, error } of appsFailed) { + metrics.appsFailedReason.set({ name, error }, 1); }This makes it immediately clear which properties are being used as labels and prevents unintended properties from being spread if the object structure changes.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Jira integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
apps/meteor/app/metrics/server/lib/collectMetrics.ts(1 hunks)apps/meteor/app/metrics/server/lib/metrics.ts(1 hunks)apps/meteor/app/statistics/server/lib/getAppsStatistics.ts(7 hunks)packages/core-typings/src/IStats.ts(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- packages/core-typings/src/IStats.ts
🧰 Additional context used
🧬 Code graph analysis (2)
apps/meteor/app/metrics/server/lib/collectMetrics.ts (2)
apps/meteor/app/statistics/server/lib/getAppsStatistics.ts (1)
getAppsStatistics(89-89)apps/meteor/app/metrics/server/lib/metrics.ts (1)
metrics(5-248)
apps/meteor/app/statistics/server/lib/getAppsStatistics.ts (1)
packages/core-typings/src/IStats.ts (1)
IStats(22-277)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: 📦 Build Packages
- GitHub Check: CodeQL-Build
- GitHub Check: CodeQL-Build
🔇 Additional comments (4)
apps/meteor/app/metrics/server/lib/metrics.ts (1)
202-206: LGTM! Gauge definition follows conventions.The new
appsFailedReasongauge is correctly defined with appropriate label names (name,error) to capture per-app failure details. The implementation aligns well with the existing metrics structure.apps/meteor/app/metrics/server/lib/collectMetrics.ts (1)
49-49: Good practice: resetting gauge before collecting metrics.Resetting the gauge prevents stale metrics from persisting when apps recover from failure states.
apps/meteor/app/statistics/server/lib/getAppsStatistics.ts (2)
38-38: LGTM! Correctly aggregates failed apps with detailed information.The implementation properly:
- Collects failed app details (name and status) into the
appsFailedarray- Excludes manually disabled apps from the failure list (line 59 condition)
- Derives
totalFailedfromappsFailed.lengthfor consistencyThe change maintains backward compatibility while enabling per-app failure metrics.
Also applies to: 60-60, 69-70
89-89: Note: Memoization impact on metric freshness.The 60-second memoization cache means app failure metrics may lag by up to a minute. Since
setPrometheusDataruns every 5 seconds, the same app statistics will be reported across multiple collection cycles. This is acceptable for app statistics, which don't change frequently, but it's worth being aware of the potential delay in reflecting app state changes.
dad5171 to
640910e
Compare
feat(metrics): Update typings for failed apps statistics feat(metrics): Collect detailed failed apps in statistics feat(metrics): Add Prometheus gauge for failed apps reasons feat(metrics): Populate Prometheus gauge with failed app reasons styling fix how to reset metrics fix rollback file rollback file linting chore: review keep old payload to stats collector linting fix add the id as a label fix linting
640910e to
251f8bf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
apps/meteor/app/metrics/server/lib/metrics.ts (1)
202-206: Update help text to mention all labels.The gauge is correctly defined with appropriate labels for tracking per-app failure details. However, the help text mentions "name and reason" but omits "id", which is also a label.
Apply this diff to make the help text complete:
appsFailedReason: new client.Gauge({ name: 'rocketchat_apps_failed_reason', labelNames: ['name', 'id', 'reason'], - help: 'name and reason for the apps that failed to load', + help: 'name, id, and reason for the apps that failed to load', }),
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Jira integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
apps/meteor/app/metrics/server/lib/collectMetrics.ts(1 hunks)apps/meteor/app/metrics/server/lib/metrics.ts(1 hunks)apps/meteor/app/statistics/server/lib/getAppsStatistics.ts(7 hunks)packages/core-typings/src/IStats.ts(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- packages/core-typings/src/IStats.ts
🧰 Additional context used
🧬 Code graph analysis (2)
apps/meteor/app/statistics/server/lib/getAppsStatistics.ts (1)
packages/core-typings/src/IStats.ts (1)
IStats(22-277)
apps/meteor/app/metrics/server/lib/collectMetrics.ts (2)
apps/meteor/app/statistics/server/lib/getAppsStatistics.ts (1)
getAppsStatistics(89-89)apps/meteor/app/metrics/server/lib/metrics.ts (1)
metrics(5-248)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: 📦 Build Packages
- GitHub Check: Builds matrix rust bindings against alpine
- GitHub Check: CodeQL-Build
- GitHub Check: CodeQL-Build
🔇 Additional comments (7)
apps/meteor/app/metrics/server/lib/collectMetrics.ts (2)
43-43: LGTM! Integration with updated getAppsStatistics.The destructuring correctly includes the new
appsFailedfield returned bygetAppsStatistics().
49-52: LGTM! Proper gauge reset pattern for labeled metrics.The implementation correctly:
- Resets the gauge before collecting new metrics to prevent stale label combinations from persisting
- Iterates over failed apps and sets a value of 1 for each unique label combination (name, id, reason)
- Uses the spread operator to pass labels, which works correctly with the gauge's
setmethodThis pattern is appropriate for gauges with dynamic label sets.
apps/meteor/app/statistics/server/lib/getAppsStatistics.ts (5)
4-4: LGTM! Proper type import for consistency.Importing
IStatsensures type consistency with the core typings package.
15-15: LGTM! Type-safe reference to shared type definition.Using
IStats['apps']['appsFailed']ensures theAppsStatisticstype remains in sync with the core type definition.
27-27: LGTM! Consistent initialization of appsFailed.Both the early return path (when Apps is not initialized) and the error path correctly initialize
appsFailedas an empty array, maintaining consistency with the successful code path.Also applies to: 81-81
38-38: LGTM! Proper tracking of failed apps with correct filtering.The implementation:
- Declares
appsFailedwith explicit typing- Correctly identifies apps that failed (not enabled AND not manually disabled)
- Captures the necessary metadata (name, id, reason) for each failed app
The conditional logic at line 59-61 appropriately excludes manually disabled apps from the failure tracking, focusing only on apps that failed due to errors.
Also applies to: 60-60
69-70: LGTM! Deriving totalFailed from appsFailed.length maintains consistency.Deriving
totalFailedfrom the array length is good practice, as it eliminates the need for a separate counter and ensures the two values remain consistent.
d-gubert
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you show an example of how we would visualize this in a Grafana panel? I'm trying to understand the Gauge choice
| totalActive++; | ||
| } else if (status !== AppStatus.MANUALLY_DISABLED) { | ||
| totalFailed++; | ||
| appsFailed.push({ name: app.getName(), id: app.getID(), reason: status }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would change the payload sent to the stats collector, which is undesirable. Please extract this logic to affect only metrics


Adds a Prometheus gauge to track the reasons for app failures. This provides more granular insights into app loading issues, aiding in diagnosis and resolution. It includes the name and error code of each failed app.
The changes include:
getAppsStatisticsfunction to return a list of failed apps with their reasonsrocketchat_apps_failed_reasonto expose the failed app informationappsFailedReasongauge before collecting new metrics to avoid stale dataSummary by CodeRabbit
New Features
Improvements