JetStream recovery latency increases sub-linearly with backlog size for stalled durable consumers #7681

Abubakarsidiq01 · 2025-12-27T15:29:22Z

Abubakarsidiq01
Dec 27, 2025

Description
While testing JetStream recovery behavior with file-backed storage and a stalled durable pull consumer, I observed that recovery latency after a server restart increases as publish backlog grows, though the scaling appears sub-linear. Recovery correctness is preserved, with no redelivery and accurate backlog tracking, but restart latency becomes noticeably longer with larger backlogs.
This may be expected behavior, but I wanted to share measured results since it could affect restart predictability for streams with large backlogs.

Environment

NATS Server: 2.14.0-dev (current main)
JetStream: enabled with file storage
OS: Windows (PowerShell), reproducible locally
Consumer type: durable pull consumer with explicit acknowledgments

Reproduction Steps

Start nats-server -js
Create a file-backed stream and a durable pull consumer with explicit acks
Publish messages while stalling the consumer by consuming one message and then stopping
Terminate the server during active publish
Restart the server
Measure time to first delivery after restart

Recovery time was measured as the time from server restart to successful delivery of the first message using Measure-Command { nats consumer next … }.

Observed Results
With approximately 100 published messages, recovery took about 2.77 seconds and resulted in 98 unprocessed messages.
With approximately 5000 published messages, recovery took about 4.80 seconds and resulted in 4,998 unprocessed messages.

In both cases, recovery completed successfully, no redelivered messages were observed, and unprocessed message counts matched the expected backlog. Recovery latency increased with backlog size, but not linearly.

Expected Behavior
Recovery should remain correct and deterministic, which it does. Ideally, restart latency would remain bounded or at least predictable as backlog size grows.

Notes:
These results suggest that recovery work scales with backlog size, though efficiently. I’m not sure if this behavior is expected or already well understood, but I wanted to share measured data in case it’s useful for understanding JetStream restart characteristics. I’m happy to dig further or help validate improvements or documentation if helpful

wallyqs · 2025-12-27T15:58:31Z

wallyqs
Dec 27, 2025
Collaborator

When restarting the server, make sure to stop it using the USR2 signal so that it gracefully stops, otherwise on each restart it will be reading all blocks to repair some state which is what I think is happening here (you would see a WRN in the logs on the stream).

1 reply

Abubakarsidiq01 Dec 27, 2025
Author

Thanks for the clarification. That makes sense. I was intentionally using an abrupt stop to observe crash-recovery behavior rather than graceful shutdown, but it’s helpful to understand that the repair path involves reading all blocks and explains the scaling I observed.
I can re-run the tests using USR2 to compare graceful vs crash restart behavior if that would be useful, or help document the expected differences between the two paths.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

JetStream recovery latency increases sub-linearly with backlog size for stalled durable consumers #7681

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

JetStream recovery latency increases sub-linearly with backlog size for stalled durable consumers #7681

Uh oh!

Abubakarsidiq01 Dec 27, 2025

Replies: 1 comment · 1 reply

Uh oh!

wallyqs Dec 27, 2025 Collaborator

Uh oh!

Abubakarsidiq01 Dec 27, 2025 Author

Abubakarsidiq01
Dec 27, 2025

Replies: 1 comment 1 reply

wallyqs
Dec 27, 2025
Collaborator

Abubakarsidiq01 Dec 27, 2025
Author