Skip to content

Conversation

@remo-lab
Copy link
Contributor

Description

This PR adds a proper readiness check to Hyperledger Fabric peers while keeping the existing liveness (/healthz) behavior unchanged.

Today, /healthz mainly tells whether the process is running. In Kubernetes and production setups, it’s often useful to know whether a peer is actually ready to receive traffic (ledger open, gossip initialized, etc.). This change separates those concerns in a safe, backward-compatible way.

What changed

  1. Added a new /readyz endpoint for readiness probing
  2. Kept /healthz exactly as it is (liveness only)
  3. Introduced a small readiness handler (without touching vendored code)
  4. Added optional readiness checkers:
  • Gossip: checks that the gossip service is initialized and (optionally) connected to peers
  • Ledger: checks that ledgers are available and readable
  • Orderer: optional connectivity check (disabled by default)
  1. Added an optional /healthz/detailed endpoint with component-level status
  • Disabled by default
  • Protected by operations TLS / client auth
  • Uses normalized states: OK, DEGRADED, UNAVAILABLE
    All new checks are opt-in and use conservative defaults to avoid false negatives

Why this is useful

  • Works cleanly with Kubernetes liveness and readiness probes
  • Avoids sending traffic to peers that aren’t ready yet
  • Makes partial failures easier to observe without forcing restarts
  • Improves operational visibility without changing core behavior

Backward compatibility

  • No behavior change to existing /healthz
  • No breaking config changes
  • Readiness failures only happen if explicitly enabled
  • Ledger lag and orderer connectivity checks are disabled by default

Configuration (example)

operations:
  healthCheck:
    gossip:
      enabled: true
      minPeers: 0
    ledger:
      enabled: true
      failOnLag: false
    orderer:
      enabled: false
    detailedHealth:
      enabled:  false

Tests

  • Unit tests for readiness handler and individual checkers
  • Coverage for readiness vs liveness behavior
  • Basic validation for failure and success cases

screenshot of terminal after the feat is implemented:

image

Notes

This is intentionally scoped to readiness and safety.
Stricter checks (ledger lag enforcement, channel-level readiness, etc.) can be added incrementally later if needed.
Feedback welcome — happy to adjust based on review.

@remo-lab remo-lab requested review from a team as code owners December 25, 2025 13:16
changgesi and others added 2 commits December 26, 2025 10:01
   - Add /readyz endpoint for Kubernetes readiness probes
   - Add /healthz/detailed endpoint for component-level status
   - Implement GossipChecker, LedgerChecker, and OrdererChecker
   - Support OK/DEGRADED/UNAVAILABLE status semantics
   - DEGRADED components don't fail readiness (only UNAVAILABLE does)
   - Safe defaults: minPeers=0, failOnLag=false, orderer disabled
   - Add comprehensive unit and integration tests
   - Update configuration and documentation

Signed-off-by: remo-lab <[email protected]>
@remo-lab remo-lab force-pushed the feat/operations-readiness-healthchecks branch from 122d2bb to f1ab5e2 Compare December 26, 2025 04:33
@remo-lab remo-lab closed this Dec 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants