JetStream recovery latency increases sub-linearly with backlog size for stalled durable consumers #7681
Abubakarsidiq01
started this conversation in
General
Replies: 1 comment 1 reply
-
|
When restarting the server, make sure to stop it using the USR2 signal so that it gracefully stops, otherwise on each restart it will be reading all blocks to repair some state which is what I think is happening here (you would see a WRN in the logs on the stream). |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Description
While testing JetStream recovery behavior with file-backed storage and a stalled durable pull consumer, I observed that recovery latency after a server restart increases as publish backlog grows, though the scaling appears sub-linear. Recovery correctness is preserved, with no redelivery and accurate backlog tracking, but restart latency becomes noticeably longer with larger backlogs.
This may be expected behavior, but I wanted to share measured results since it could affect restart predictability for streams with large backlogs.
Environment
Reproduction Steps
Recovery time was measured as the time from server restart to successful delivery of the first message using Measure-Command { nats consumer next … }.
Observed Results
With approximately 100 published messages, recovery took about 2.77 seconds and resulted in 98 unprocessed messages.
With approximately 5000 published messages, recovery took about 4.80 seconds and resulted in 4,998 unprocessed messages.
In both cases, recovery completed successfully, no redelivered messages were observed, and unprocessed message counts matched the expected backlog. Recovery latency increased with backlog size, but not linearly.
Expected Behavior
Recovery should remain correct and deterministic, which it does. Ideally, restart latency would remain bounded or at least predictable as backlog size grows.
Notes:
These results suggest that recovery work scales with backlog size, though efficiently. I’m not sure if this behavior is expected or already well understood, but I wanted to share measured data in case it’s useful for understanding JetStream restart characteristics. I’m happy to dig further or help validate improvements or documentation if helpful
Beta Was this translation helpful? Give feedback.
All reactions