DOC-5338: RDI enchance observability page with more metrics information #1701

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

ZdravkoDonev-redis wants to merge 2 commits into main from DOC-5338-rdi-enchance-obsrevability-page-with-more-metrics-information

Collaborator

ZdravkoDonev-redis commented Jun 13, 2025

I used some AI magic and a few sources - Example metrics from a running RDI instance, the codebase, Debezium docs, etc.

I think the format is good, but the alerting recommendations I'm not sure if the alerting recommendations are correct.

ZdravkoDonev-redis added 2 commits

June 13, 2025 15:10


          DOC-5338: RDI enchance observability page with more metrics information

d6be677


          Update alerting strategy

7f3ea19

ZdravkoDonev-redis requested review from yaronp68, andy-stark-redis and Copilot

June 13, 2025 12:20

CLAassistant commented Jun 13, 2025

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Contributor

github-actions bot commented Jun 13, 2025

Staging links:
https://redis.io/docs/staging/DOC-5338-rdi-enchance-obsrevability-page-with-more-metrics-information/integrate/redis-data-integration/observability

Copilot AI reviewed

View reviewed changes

Contributor

Copilot AI left a comment

Pull Request Overview

This PR enhances the observability documentation page for RDI by adding detailed metrics tables and alerting recommendations.

Added a collector metrics table with descriptions and alerting guidelines.
Introduced a second table covering stream processor metrics with detailed contextual notes.
Updated the recommended alerting strategy section for critical and informational monitoring.

Comments suppressed due to low confidence (3)

content/integrate/redis-data-integration/observability.md:110

[nitpick] Consider clarifying which specific states for 'rdi_engine_state' should trigger a critical alert to eliminate ambiguity for users.

| `rdi_engine_state` | Gauge | Current state of the RDI engine with labels for `state` (e.g., STARTED, RUNNING) and `sync_mode` (e.g., SNAPSHOT, STREAMING) | **Critical Alert**: Alert if state indicates failure or error condition |

content/integrate/redis-data-integration/observability.md:106

[nitpick] Consider expanding the description for 'incoming_records_created' to explain its purpose and usage, since reporting a timestamp as a gauge might be confusing for some users.

| `incoming_records_created` | Gauge | Timestamp when the incoming records counter was created | Informational - no alerting needed |

content/integrate/redis-data-integration/observability.md:55

[nitpick] Metric naming conventions differ between the first table (CamelCase) and the second table (snake_case). Consider aligning these conventions to avoid potential confusion.

| **ChangesApplied** | Counter | Total number of schema changes applied during recovery and runtime | Informational - monitor for trends |

andy-stark-redis approved these changes

View reviewed changes

Contributor

andy-stark-redis left a comment

A few minor suggestions and questions, but they're easy to fix, so I'll approve. Great addition to the info in this page :-)

content/integrate/redis-data-integration/observability.md

+              | MilliSecondsSinceLastRecoveredChange | Gauge | Number of milliseconds since the last change was recovered from the history store | Informational - monitor for trends |
+              | RecoveryStartTime | Gauge | Time in epoch milliseconds when recovery started (-1 if not applicable) | Informational - monitor for trends |
+              | **Connection and State Metrics** | | | |
+              | Connected | Gauge | Whether the connector is currently connected to the database (1=connected, 0=disconnected) | **Critical Alert**: Alert if value = 0 (disconnected) |

Contributor

andy-stark-redis Jun 13, 2025

Several items in the table refer to the "connector". Should this be "collector"?

content/integrate/redis-data-integration/observability.md

+              | Metric | Type | Description | Alerting Recommendations |
+              |:--|:--|:--|:--|
+              | **Schema History Metrics** | | | |
+              | ChangesApplied | Counter | Total number of schema changes applied during recovery and runtime | Informational - monitor for trends |

Contributor

andy-stark-redis Jun 13, 2025

The metric names are in normal style here, but in code style in the second table. Probably best to have them both the same.

content/integrate/redis-data-integration/observability.md

Comment on lines +118 to +122

+              **Additional information about stream processor metrics:**
+              - The `rdi_` prefix comes from the Kubernetes namespace where RDI is installed. For VM install it is always this value.
+              - Metrics with `_created` suffix are automatically generated by Prometheus for counters and gauges to track when they were first created.
+              - The `rdi_incoming_entries` metric provides detailed breakdown by operation type for each data source.

Contributor

andy-stark-redis Jun 13, 2025

Suggested change

      
            **Additional information about stream processor metrics:**
          
            - The `rdi_` prefix comes from the Kubernetes namespace where RDI is installed. For VM install it is always this value.
          
            - Metrics with `_created` suffix are automatically generated by Prometheus for counters and gauges to track when they were first created.
          
            - The `rdi_incoming_entries` metric provides detailed breakdown by operation type for each data source.
          
            - Where the metric name has the `rdi_` prefix, this will be replaced by the Kubernetes namespace name if you supplied a custom name during installation. The prefix is always `rdi_` for VM installations.
          
            - Metrics with the `_created` suffix are automatically generated by Prometheus for counters and gauges to track when they were first created.
          
            - The `rdi_incoming_entries` metric provides a detailed breakdown for each data source by operation type.

content/integrate/redis-data-integration/observability.md


		## Recommended alerting strategy

		The following alerting strategy focuses on system failures and data integrity issues that require immediate attention. Most metrics are informational and should be monitored for trends rather than triggering alerts.

Contributor

andy-stark-redis Jun 13, 2025

Suggested change

      
            The following alerting strategy focuses on system failures and data integrity issues that require immediate attention. Most metrics are informational and should be monitored for trends rather than triggering alerts.
          
            The alerting strategy described in the sections below focuses on system failures and data integrity issues that require immediate attention. Most ther metrics are informational, so you should monitor them for trends rather than trigger alerts.

content/integrate/redis-data-integration/observability.md


		### Critical alerts (immediate response required)

		These are the only alerts that should wake someone up or require immediate action:

Contributor

andy-stark-redis Jun 13, 2025

Suggested change

      
            These are the only alerts that should wake someone up or require immediate action:
          
            These are the only alerts that require immediate action:

content/integrate/redis-data-integration/observability.md

Comment on lines +134 to +138

+              - **`Connected = 0`**: Database connectivity lost - RDI cannot function without database connection
+              - **`NumberOfErroneousEvents > 0`**: Data processing errors occurring - indicates data corruption or processing failures
+              - **`rejected_records_total > 0`**: Records being rejected - indicates data quality issues or processing failures
+              - **`SnapshotAborted = 1`**: Snapshot process failed - initial sync is incomplete
+              - **`rdi_engine_state`**: Alert only if the state indicates a clear failure condition (not just "not running")

Contributor

andy-stark-redis Jun 13, 2025

Suggested change

      
            - **`Connected = 0`**: Database connectivity lost - RDI cannot function without database connection
          
            - **`NumberOfErroneousEvents > 0`**: Data processing errors occurring - indicates data corruption or processing failures  
          
            - **`rejected_records_total > 0`**: Records being rejected - indicates data quality issues or processing failures
          
            - **`SnapshotAborted = 1`**: Snapshot process failed - initial sync is incomplete
          
            - **`rdi_engine_state`**: Alert only if the state indicates a clear failure condition (not just "not running")
          
            - **`Connected = 0`**: Database connectivity has been lost. RDI cannot function without a database connection.
          
            - **`NumberOfErroneousEvents > 0`**: Errors are occurring during data processing. This indicates data corruption or processing failures.
          
            - **`rejected_records_total > 0`**: Records are being rejected. This indicates data quality issues or processing failures.
          
            - **`SnapshotAborted = 1`**: The snapshot process has failed, so the initial sync is incomplete.
          
            - **`rdi_engine_state`**: This is an alert only if the state indicates a clear failure condition (not just "not running").

content/integrate/redis-data-integration/observability.md


		### Important monitoring (but not alerts)

		These metrics should be monitored on dashboards and reviewed regularly, but do not require automated alerts:

Contributor

andy-stark-redis Jun 13, 2025

Suggested change

      
            These metrics should be monitored on dashboards and reviewed regularly, but do not require automated alerts:
          
            You should monitor these metrics on dashboards and review them regularly, but they don't require automated alerts:

content/integrate/redis-data-integration/observability.md

Comment on lines +144 to +148

+              - **Queue metrics**: Queue utilization can vary widely and hitting 0% or 100% capacity may be normal during certain operations
+              - **Latency metrics**: Lag and processing times depend heavily on business requirements and normal operational patterns
+              - **Event counters**: Event rates naturally vary based on application usage patterns
+              - **Snapshot progress**: Snapshot duration and progress depend on data size and are typically monitored manually
+              - **Schema changes**: Schema change frequency is highly application-dependent

Contributor

andy-stark-redis Jun 13, 2025

Suggested change

      
            - **Queue metrics**: Queue utilization can vary widely and hitting 0% or 100% capacity may be normal during certain operations
          
            - **Latency metrics**: Lag and processing times depend heavily on business requirements and normal operational patterns
          
            - **Event counters**: Event rates naturally vary based on application usage patterns
          
            - **Snapshot progress**: Snapshot duration and progress depend on data size and are typically monitored manually
          
            - **Schema changes**: Schema change frequency is highly application-dependent
          
            - **Queue metrics**: Queue utilization can vary widely and hitting 0% or 100% capacity may be normal during certain operations.
          
            - **Latency metrics**: Lag and processing times depend heavily on business requirements and normal operational patterns.
          
            - **Event counters**: Event rates naturally vary based on application usage patterns.
          
            - **Snapshot progress**: Snapshot duration and progress depend on data size, so you should typically monitor them manually.
          
            - **Schema changes**: Schema change frequency is highly application-dependent.

content/integrate/redis-data-integration/observability.md

Comment on lines +152 to +156

+. **Alert on failures, not performance**: Focus alerts on system failures rather than performance degradation
+. **Business context matters**: Latency and throughput requirements vary significantly between organizations
+. **Establish baselines first**: Monitor metrics for weeks before setting any threshold-based alerts
+. **Avoid alert fatigue**: Too many alerts reduce response to truly critical issues
+. **Use dashboards for trends**: Most metrics are better suited for dashboard monitoring than alerting

Contributor

andy-stark-redis Jun 13, 2025

I think a bullet list works better here (numbers tend to suggest a sequence or a priority order, and also the lists in the other sections are bulleted).

Suggested change

      
            1. **Alert on failures, not performance**: Focus alerts on system failures rather than performance degradation
          
            2. **Business context matters**: Latency and throughput requirements vary significantly between organizations
          
            3. **Establish baselines first**: Monitor metrics for weeks before setting any threshold-based alerts
          
            4. **Avoid alert fatigue**: Too many alerts reduce response to truly critical issues
          
            5. **Use dashboards for trends**: Most metrics are better suited for dashboard monitoring than alerting
          
            - **Alert on failures, not performance**: Focus alerts on system failures rather than performance degradation.
          
            - **Business context matters**: Latency and throughput requirements vary significantly between organizations.
          
            - **Establish baselines first**: Monitor metrics for weeks before you set any threshold-based alerts.
          
            - **Avoid alert fatigue**: If you see too many non-critical alerts, you are less likely to take truly critical issues seriously.
          
            - **Use dashboards for trends**: Most metrics are better suited for dashboard monitoring than alerting

content/integrate/redis-data-integration/observability.md

Comment on lines +160 to +163

+              - **Dashboard-first approach**: Use Grafana dashboards to visualize trends and patterns
+              - **Baseline establishment**: Monitor your specific workload for 2-4 weeks before considering additional alerts
+              - **Business SLA alignment**: Only create alerts for metrics that directly impact your business SLA requirements
+              - **Manual review**: Regularly review metric trends during business reviews rather than automated alerting

Contributor

andy-stark-redis Jun 13, 2025

Suggested change

      
            - **Dashboard-first approach**: Use Grafana dashboards to visualize trends and patterns
          
            - **Baseline establishment**: Monitor your specific workload for 2-4 weeks before considering additional alerts
          
            - **Business SLA alignment**: Only create alerts for metrics that directly impact your business SLA requirements
          
            - **Manual review**: Regularly review metric trends during business reviews rather than automated alerting
          
            - **Dashboard-first approach**: Use Grafana dashboards to visualize trends and patterns.
          
            - **Baseline establishment**: Monitor your specific workload for 2-4 weeks before you consider adding more alerts.
          
            - **Business SLA alignment**: Only create alerts for metrics that directly impact your business SLA requirements.
          
            - **Manual review**: Don't use automated alerts to review metric trends. Instead, schedule regular business reviews to check them manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet