[multistage] Handling Server Restarts Gracefully

**Note:** Creating this ticket for tracking. Some existing solutions might already suffice, but I want to ideally find a path forward where we don't have to turn any config or tune any knob to get resilience from basic server operations like restarts.

While benchmarking MSE Lite Mode's resilience to server restarts, I found that even with Lite Mode, where only the leaf stages are run in the servers and everything else is run in the brokers, we were seeing high query failure rate during restart. The load used to test this was around 50 QPS with query latencies of ~50ms.

You can see the attached graph below. The first wave of queries is with the regular MSE, and the second wave of queries is using the Lite Mode. While Lite Mode has a much lower failure rate than regular MSE, the failure rate is still quite high, and ideally we want to aim for zero failures due to server restarts.

This experiment was run without any failure detector configured. See the section at the bottom.

<img width="1097" alt="Image" src="https://github.com/user-attachments/assets/eb33e930-8c89-4ce3-97e6-ae601fa5cf1b" />

# Errors Seen

All the errors I saw were connection refused errors coming from the `QueryDispatcher`.

```
Query processing exceptions: {200=QueryExecutionError: Error dispatching query: 1048745684000196581 to server: some-host@{port1,port2} org.apache.pinot.query.service.dispatch.QueryDispatcher.execute(QueryDispatcher.java:312) org.apache.pinot.query.service.dispatch.QueryDispatcher.submit(QueryDispatcher.java:212) org.apache.pinot.query.service.dispatch.QueryDispatcher.submitAndReduce(QueryDispatcher.java:149) org.apache.pinot.broker.requesthandler.MultiStageBrokerRequestHandler.query(MultiStageBrokerRequestHandler.java:427) UNAVAILABLE: io exception io.grpc.Status.asRuntimeException(Status.java:535) io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:487) io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:563) io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70) finishConnect(..) failed: Connection refused: some-host/111111:12345 finishConnect(..) failed: Connection refused io.grpc.netty.shaded.io.netty.channel.unix.Errors.newConnectException0(Errors.java:155) io.grpc.netty.shaded.io.netty.channel.unix.Errors.handleConnectErrno(Errors.java:128) io.grpc.netty.shaded.io.netty.channel.unix.Socket.finishConnect(Socket.java:321) io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:710) finishConnect(..) failed: Connection refused io.grpc.netty.shaded.io.netty.channel.unix.Errors.newConnectException0(Errors.java:155) io.grpc.netty.shaded.io.netty.channel.unix.Errors.handleConnectErrno(Errors.java:128) io.grpc.netty.shaded.io.netty.channel.unix.Socket.finishConnect(Socket.java:321) io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:710) }
```

# On Failure Detector

Recent work by @yashmayya to add Failure Detector is specifically to tackle this scenario (I suppose), where exceptions from failed query dispatch are caught and fed into the routing manager logic, to exclude servers that are failing query dispatch temporarily with custom backoff and retry strategies.

We'll be retrying the experiment with the Connection Based failure detector soon, but given this feature seems crucial to support basic reliability guarantees from the MSE, I think we should make this the default as soon as possible.

# Related

I had assumed that I wouldn't need anything like Failure Detector to get restart resilience. I'll look into the server start-up/shutdown process again to see if we have misconfigured something on our end.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[multistage] Handling Server Restarts Gracefully #15997

Errors Seen

On Failure Detector

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[multistage] Handling Server Restarts Gracefully #15997

Description

Errors Seen

On Failure Detector

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions