Skip to content

Conversation

@gabrielpetry
Copy link

@gabrielpetry gabrielpetry commented Oct 3, 2025

Adds a Prometheus gauge to track the reasons for app failures. This provides more granular insights into app loading issues, aiding in diagnosis and resolution. It includes the name and error code of each failed app.

The changes include:

  • Updating the getAppsStatistics function to return a list of failed apps with their reasons
  • Adding a new Prometheus gauge rocketchat_apps_failed_reason to expose the failed app information
  • Resetting the appsFailedReason gauge before collecting new metrics to avoid stale data

Summary by CodeRabbit

  • New Features

    • Records per‑app failures (name, id, reason) and exposes a per‑app failure metric for observability.
  • Improvements

    • Derives total failed apps from the per‑app failure list for consistency.
    • Per‑app failure metrics are reset and refreshed each collection cycle to reflect current state.

@dionisio-bot
Copy link
Contributor

dionisio-bot bot commented Oct 3, 2025

Looks like this PR is not ready to merge, because of the following issues:

  • This PR is missing the 'stat: QA assured' label
  • This PR is targeting the wrong base branch. It should target 7.14.0, but it targets 7.13.0

Please fix the issues and try again

If you have any trouble, please check the PR guidelines

@changeset-bot
Copy link

changeset-bot bot commented Oct 3, 2025

⚠️ No Changeset found

Latest commit: 5459bbb

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@CLAassistant
Copy link

CLAassistant commented Oct 3, 2025

CLA assistant check
All committers have signed the CLA.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 3, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Adds per-app failure reporting: a new appsFailed array in app stats and typings, a Prometheus gauge rocketchat_apps_failed_reason (labels: name, id, reason), and updates metrics collection to derive totalFailed from appsFailed.length and emit per-app failure metrics.

Changes

Cohort / File(s) Summary
Metrics collection logic
apps/meteor/app/metrics/server/lib/collectMetrics.ts
Read appsFailed from stats; derive totalFailed = appsFailed.length; reset and populate new per-app gauge via appsFailedReason.labels(name, id, reason).set(1).
Metrics definitions
apps/meteor/app/metrics/server/lib/metrics.ts
Add appsFailedReason gauge: rocketchat_apps_failed_reason with labelNames: ['name','id','reason'] and help 'name and reason for the apps that failed to load'.
Statistics computation
apps/meteor/app/statistics/server/lib/getAppsStatistics.ts
Add appsFailed: Array<{ name: string; id: string; reason: AppStatus }> to returned stats; accumulate per-app failures into appsFailed; derive totalFailed = appsFailed.length; error paths return appsFailed: [].
Core typings
packages/core-typings/src/IStats.ts
Add appsFailed: Array<{ name: string; id: string; reason: AppStatus }> to IStats.apps and import AppStatus; totalFailed remains available (now derived from appsFailed.length).

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Scheduler as Scheduler/Cron
  participant Collector as collectMetrics
  participant Stats as getAppsStatistics
  participant Prom as prom-client

  Scheduler->>Collector: setPrometheusData()
  Collector->>Stats: _getAppsStatistics()
  Stats-->>Collector: { appsFailed: [{name,id,reason},...], totalInstalled, totalActive, ... }

  rect rgba(200,235,255,0.25)
    note right of Collector: update Prometheus metrics
    Collector->>Prom: totalFailed.set(appsFailed.length)
    Collector->>Prom: appsFailedReason.reset()
    loop per failed app
      Collector->>Prom: appsFailedReason.labels(name,id,reason).set(1)
    end
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Potential focus areas:

  • getAppsStatistics.ts — correctness of status-to-reason mapping and error paths.
  • collectMetrics.ts — label ordering and reset/setting semantics for the new gauge.
  • IStats.ts — type import and compatibility with existing consumers.

Suggested labels

stat: ready to merge, stat: QA assured

Suggested reviewers

  • sampaiodiego

Poem

I hop through metrics, soft and spry,
Counting names and ids that sigh.
Reasons lined in tidy rows,
Each small failure clearly shows.
A carrot for insight, watch them fly 🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding Prometheus metrics to track reasons for failed app loading.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch chore/failed-apps-reasons-metric

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Jira integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between c4feb56 and 5459bbb.

📒 Files selected for processing (1)
  • apps/meteor/app/metrics/server/lib/metrics.ts (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: 📦 Build Packages
  • GitHub Check: CodeQL-Build
  • GitHub Check: CodeQL-Build
🔇 Additional comments (1)
apps/meteor/app/metrics/server/lib/metrics.ts (1)

202-206: LGTM! Well-implemented failure tracking metric.

The new gauge follows the established patterns in the codebase and appropriately uses labels to provide granular insights into app failures. The choice of a Gauge (rather than a Counter) is correct since failed apps can be fixed or uninstalled, allowing the metric to decrease. The three labels ('name', 'id', 'reason') provide both human-readable context and unique identification for monitoring and alerting.

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. 📝 Description — Summarize the main change in 50–60 words, explaining what was done.
  2. 📓 References — List relevant issues, discussions, documentation, or related PRs.
  3. 📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. 📊 Contributor Summary — Include a Markdown table showing contributions:
    | Contributor | Lines Added | Lines Removed | Files Changed |
  5. ✔️ Additional Notes — Add any extra reviewer context.
    Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gabrielpetry gabrielpetry changed the title Adds metrics for failed app reasons chore: Adds metrics for failed app reasons Oct 3, 2025
@gabrielpetry gabrielpetry changed the title chore: Adds metrics for failed app reasons chore: adds metrics for failed app reasons Oct 3, 2025
@codecov
Copy link

codecov bot commented Oct 3, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 67.07%. Comparing base (035e0c7) to head (5459bbb).

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop   #37133      +/-   ##
===========================================
- Coverage    68.97%   67.07%   -1.91%     
===========================================
  Files         3359     3458      +99     
  Lines       114214   114874     +660     
  Branches     20535    21097     +562     
===========================================
- Hits         78784    77054    -1730     
- Misses       33335    35771    +2436     
+ Partials      2095     2049      -46     
Flag Coverage Δ
e2e 57.46% <ø> (+0.01%) ⬆️
e2e-api 42.85% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@gabrielpetry
Copy link
Author

gabrielpetry commented Oct 7, 2025

Worked!

image

After fixing the issues, the app will stop reporting the metric, which is the ideal behavior.
image

@gabrielpetry gabrielpetry marked this pull request as ready for review October 7, 2025 12:55
@gabrielpetry gabrielpetry requested a review from a team as a code owner October 7, 2025 12:55
debdutdeb
debdutdeb previously approved these changes Oct 7, 2025
@gabriellsh
Copy link
Member

@d-gubert do you think there is some other info we'd like to include in the metrics? Might be a good time now since we're adding this.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
packages/core-typings/src/IStats.ts (1)

159-160: Consider removing totalFailed to avoid redundancy.

The interface now has both totalFailed (line 159) and appsFailed (line 160). According to the AI summary, totalFailed is derived from appsFailed.length in other parts of the codebase. Having both fields in the type definition creates redundancy and potential for data inconsistency if they're not kept in sync.

Consider one of the following approaches:

  1. Remove totalFailed and derive it where needed from appsFailed.length
  2. Make the relationship explicit by documenting that totalFailed must equal appsFailed.length
  3. If totalFailed serves a different purpose (e.g., false indicates a specific state), document this distinction clearly
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Jira integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 19fd160 and 6e56a81.

📒 Files selected for processing (1)
  • packages/core-typings/src/IStats.ts (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: 📦 Build Packages
  • GitHub Check: CodeQL-Build
  • GitHub Check: CodeQL-Build

@debdutdeb debdutdeb force-pushed the chore/failed-apps-reasons-metric branch from 6e56a81 to e0e724e Compare October 7, 2025 18:15
@debdutdeb debdutdeb requested review from a team as code owners October 7, 2025 18:15
@debdutdeb debdutdeb changed the base branch from develop to release-7.11.0 October 7, 2025 18:15
Copilot AI review requested due to automatic review settings October 7, 2025 18:17
@debdutdeb debdutdeb force-pushed the chore/failed-apps-reasons-metric branch from e0e724e to b840490 Compare October 7, 2025 18:17

This comment was marked as outdated.

@debdutdeb debdutdeb force-pushed the chore/failed-apps-reasons-metric branch from b840490 to 62a2841 Compare October 7, 2025 18:18
@debdutdeb debdutdeb removed request for a team October 7, 2025 18:19
@debdutdeb debdutdeb added this to the 7.11.0 milestone Oct 7, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
apps/meteor/app/metrics/server/lib/collectMetrics.ts (1)

50-52: Consider explicit label assignment for clarity.

The spread operator { ...app } works correctly here since the object shape matches the expected labels. However, for better maintainability and explicit intent, consider destructuring or explicitly passing the labels.

Apply this diff to make the label assignment more explicit:

-	for (const app of appsFailed) {
-		metrics.appsFailedReason.set({ ...app }, 1);
+	for (const { name, error } of appsFailed) {
+		metrics.appsFailedReason.set({ name, error }, 1);
	}

This makes it immediately clear which properties are being used as labels and prevents unintended properties from being spread if the object structure changes.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Jira integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 6e56a81 and 62a2841.

📒 Files selected for processing (4)
  • apps/meteor/app/metrics/server/lib/collectMetrics.ts (1 hunks)
  • apps/meteor/app/metrics/server/lib/metrics.ts (1 hunks)
  • apps/meteor/app/statistics/server/lib/getAppsStatistics.ts (7 hunks)
  • packages/core-typings/src/IStats.ts (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • packages/core-typings/src/IStats.ts
🧰 Additional context used
🧬 Code graph analysis (2)
apps/meteor/app/metrics/server/lib/collectMetrics.ts (2)
apps/meteor/app/statistics/server/lib/getAppsStatistics.ts (1)
  • getAppsStatistics (89-89)
apps/meteor/app/metrics/server/lib/metrics.ts (1)
  • metrics (5-248)
apps/meteor/app/statistics/server/lib/getAppsStatistics.ts (1)
packages/core-typings/src/IStats.ts (1)
  • IStats (22-277)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: 📦 Build Packages
  • GitHub Check: CodeQL-Build
  • GitHub Check: CodeQL-Build
🔇 Additional comments (4)
apps/meteor/app/metrics/server/lib/metrics.ts (1)

202-206: LGTM! Gauge definition follows conventions.

The new appsFailedReason gauge is correctly defined with appropriate label names (name, error) to capture per-app failure details. The implementation aligns well with the existing metrics structure.

apps/meteor/app/metrics/server/lib/collectMetrics.ts (1)

49-49: Good practice: resetting gauge before collecting metrics.

Resetting the gauge prevents stale metrics from persisting when apps recover from failure states.

apps/meteor/app/statistics/server/lib/getAppsStatistics.ts (2)

38-38: LGTM! Correctly aggregates failed apps with detailed information.

The implementation properly:

  • Collects failed app details (name and status) into the appsFailed array
  • Excludes manually disabled apps from the failure list (line 59 condition)
  • Derives totalFailed from appsFailed.length for consistency

The change maintains backward compatibility while enabling per-app failure metrics.

Also applies to: 60-60, 69-70


89-89: Note: Memoization impact on metric freshness.

The 60-second memoization cache means app failure metrics may lag by up to a minute. Since setPrometheusData runs every 5 seconds, the same app statistics will be reported across multiple collection cycles. This is acceptable for app statistics, which don't change frequently, but it's worth being aware of the potential delay in reflecting app state changes.

@debdutdeb debdutdeb force-pushed the chore/failed-apps-reasons-metric branch from dad5171 to 640910e Compare October 8, 2025 13:27
feat(metrics): Update typings for failed apps statistics

feat(metrics): Collect detailed failed apps in statistics

feat(metrics): Add Prometheus gauge for failed apps reasons

feat(metrics): Populate Prometheus gauge with failed app reasons

styling

fix how to reset metrics

fix

rollback file

rollback file

linting

chore: review

keep old payload to stats collector

linting

fix

add the id as a label

fix linting
@debdutdeb debdutdeb force-pushed the chore/failed-apps-reasons-metric branch from 640910e to 251f8bf Compare October 8, 2025 13:30
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
apps/meteor/app/metrics/server/lib/metrics.ts (1)

202-206: Update help text to mention all labels.

The gauge is correctly defined with appropriate labels for tracking per-app failure details. However, the help text mentions "name and reason" but omits "id", which is also a label.

Apply this diff to make the help text complete:

 appsFailedReason: new client.Gauge({
   name: 'rocketchat_apps_failed_reason',
   labelNames: ['name', 'id', 'reason'],
-  help: 'name and reason for the apps that failed to load',
+  help: 'name, id, and reason for the apps that failed to load',
 }),
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Jira integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 62a2841 and c4feb56.

📒 Files selected for processing (4)
  • apps/meteor/app/metrics/server/lib/collectMetrics.ts (1 hunks)
  • apps/meteor/app/metrics/server/lib/metrics.ts (1 hunks)
  • apps/meteor/app/statistics/server/lib/getAppsStatistics.ts (7 hunks)
  • packages/core-typings/src/IStats.ts (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • packages/core-typings/src/IStats.ts
🧰 Additional context used
🧬 Code graph analysis (2)
apps/meteor/app/statistics/server/lib/getAppsStatistics.ts (1)
packages/core-typings/src/IStats.ts (1)
  • IStats (22-277)
apps/meteor/app/metrics/server/lib/collectMetrics.ts (2)
apps/meteor/app/statistics/server/lib/getAppsStatistics.ts (1)
  • getAppsStatistics (89-89)
apps/meteor/app/metrics/server/lib/metrics.ts (1)
  • metrics (5-248)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: 📦 Build Packages
  • GitHub Check: Builds matrix rust bindings against alpine
  • GitHub Check: CodeQL-Build
  • GitHub Check: CodeQL-Build
🔇 Additional comments (7)
apps/meteor/app/metrics/server/lib/collectMetrics.ts (2)

43-43: LGTM! Integration with updated getAppsStatistics.

The destructuring correctly includes the new appsFailed field returned by getAppsStatistics().


49-52: LGTM! Proper gauge reset pattern for labeled metrics.

The implementation correctly:

  • Resets the gauge before collecting new metrics to prevent stale label combinations from persisting
  • Iterates over failed apps and sets a value of 1 for each unique label combination (name, id, reason)
  • Uses the spread operator to pass labels, which works correctly with the gauge's set method

This pattern is appropriate for gauges with dynamic label sets.

apps/meteor/app/statistics/server/lib/getAppsStatistics.ts (5)

4-4: LGTM! Proper type import for consistency.

Importing IStats ensures type consistency with the core typings package.


15-15: LGTM! Type-safe reference to shared type definition.

Using IStats['apps']['appsFailed'] ensures the AppsStatistics type remains in sync with the core type definition.


27-27: LGTM! Consistent initialization of appsFailed.

Both the early return path (when Apps is not initialized) and the error path correctly initialize appsFailed as an empty array, maintaining consistency with the successful code path.

Also applies to: 81-81


38-38: LGTM! Proper tracking of failed apps with correct filtering.

The implementation:

  • Declares appsFailed with explicit typing
  • Correctly identifies apps that failed (not enabled AND not manually disabled)
  • Captures the necessary metadata (name, id, reason) for each failed app

The conditional logic at line 59-61 appropriately excludes manually disabled apps from the failure tracking, focusing only on apps that failed due to errors.

Also applies to: 60-60


69-70: LGTM! Deriving totalFailed from appsFailed.length maintains consistency.

Deriving totalFailed from the array length is good practice, as it eliminates the need for a separate counter and ensures the two values remain consistent.

Base automatically changed from release-7.11.0 to master October 17, 2025 18:29
@geekgonecrazy geekgonecrazy changed the base branch from master to develop October 29, 2025 14:07
@geekgonecrazy geekgonecrazy modified the milestones: 7.11.0, 7.13.0 Oct 29, 2025
@github-actions
Copy link
Contributor

📦 Docker Image Size Report

📈 Changes

Service Current Baseline Change Percent
sum of all images 1.2GiB 1.2GiB +12MiB
rocketchat 367MiB 355MiB +12MiB
omnichannel-transcript-service 141MiB 141MiB +786B
queue-worker-service 141MiB 141MiB +443B
ddp-streamer-service 127MiB 127MiB +264B
account-service 114MiB 114MiB -2.8KiB
stream-hub-service 111MiB 111MiB +187B
authorization-service 111MiB 111MiB +175B
presence-service 111MiB 111MiB -1.1KiB

📊 Historical Trend

---
config:
  theme: "dark"
  xyChart:
    width: 900
    height: 400
---
xychart
  title "Image Size Evolution by Service (Last 30 Days + This PR)"
  x-axis ["11/15 22:28", "11/16 01:28", "11/17 23:50", "11/18 22:53", "11/19 19:17", "11/19 20:16 (PR)"]
  y-axis "Size (GB)" 0 --> 0.5
  line "account-service" [0.11, 0.11, 0.11, 0.11, 0.11, 0.11]
  line "authorization-service" [0.11, 0.11, 0.11, 0.11, 0.11, 0.11]
  line "ddp-streamer-service" [0.12, 0.12, 0.12, 0.12, 0.12, 0.12]
  line "omnichannel-transcript-service" [0.14, 0.14, 0.14, 0.14, 0.14, 0.14]
  line "presence-service" [0.11, 0.11, 0.11, 0.11, 0.11, 0.11]
  line "queue-worker-service" [0.14, 0.14, 0.14, 0.14, 0.14, 0.14]
  line "rocketchat" [0.36, 0.36, 0.35, 0.35, 0.35, 0.36]
  line "stream-hub-service" [0.11, 0.11, 0.11, 0.11, 0.11, 0.11]
Loading

Statistics (last 5 days):

  • 📊 Average: 1.4GiB
  • ⬇️ Minimum: 1.2GiB
  • ⬆️ Maximum: 1.6GiB
  • 🎯 Current PR: 1.2GiB
ℹ️ About this report

This report compares Docker image sizes from this build against the develop baseline.

  • Tag: pr-37133
  • Baseline: develop
  • Timestamp: 2025-11-19 20:16:24 UTC
  • Historical data points: 5

Updated: Wed, 19 Nov 2025 20:16:24 GMT

Copy link
Member

@d-gubert d-gubert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you show an example of how we would visualize this in a Grafana panel? I'm trying to understand the Gauge choice

totalActive++;
} else if (status !== AppStatus.MANUALLY_DISABLED) {
totalFailed++;
appsFailed.push({ name: app.getName(), id: app.getID(), reason: status });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would change the payload sent to the stats collector, which is undesirable. Please extract this logic to affect only metrics

@AliNunes AliNunes modified the milestones: 7.13.0, 7.14.0 Nov 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants