Description
Note: Creating this ticket for tracking. Some existing solutions might already suffice, but I want to ideally find a path forward where we don't have to turn any config or tune any knob to get resilience from basic server operations like restarts.
While benchmarking MSE Lite Mode's resilience to server restarts, I found that even with Lite Mode, where only the leaf stages are run in the servers and everything else is run in the brokers, we were seeing high query failure rate during restart. The load used to test this was around 50 QPS with query latencies of ~50ms.
You can see the attached graph below. The first wave of queries is with the regular MSE, and the second wave of queries is using the Lite Mode. While Lite Mode has a much lower failure rate than regular MSE, the failure rate is still quite high, and ideally we want to aim for zero failures due to server restarts.
This experiment was run without any failure detector configured. See the section at the bottom.

Errors Seen
All the errors I saw were connection refused errors coming from the QueryDispatcher
.
Query processing exceptions: {200=QueryExecutionError: Error dispatching query: 1048745684000196581 to server: some-host@{port1,port2} org.apache.pinot.query.service.dispatch.QueryDispatcher.execute(QueryDispatcher.java:312) org.apache.pinot.query.service.dispatch.QueryDispatcher.submit(QueryDispatcher.java:212) org.apache.pinot.query.service.dispatch.QueryDispatcher.submitAndReduce(QueryDispatcher.java:149) org.apache.pinot.broker.requesthandler.MultiStageBrokerRequestHandler.query(MultiStageBrokerRequestHandler.java:427) UNAVAILABLE: io exception io.grpc.Status.asRuntimeException(Status.java:535) io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:487) io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:563) io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70) finishConnect(..) failed: Connection refused: some-host/111111:12345 finishConnect(..) failed: Connection refused io.grpc.netty.shaded.io.netty.channel.unix.Errors.newConnectException0(Errors.java:155) io.grpc.netty.shaded.io.netty.channel.unix.Errors.handleConnectErrno(Errors.java:128) io.grpc.netty.shaded.io.netty.channel.unix.Socket.finishConnect(Socket.java:321) io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:710) finishConnect(..) failed: Connection refused io.grpc.netty.shaded.io.netty.channel.unix.Errors.newConnectException0(Errors.java:155) io.grpc.netty.shaded.io.netty.channel.unix.Errors.handleConnectErrno(Errors.java:128) io.grpc.netty.shaded.io.netty.channel.unix.Socket.finishConnect(Socket.java:321) io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:710) }
On Failure Detector
Recent work by @yashmayya to add Failure Detector is specifically to tackle this scenario (I suppose), where exceptions from failed query dispatch are caught and fed into the routing manager logic, to exclude servers that are failing query dispatch temporarily with custom backoff and retry strategies.
We'll be retrying the experiment with the Connection Based failure detector soon, but given this feature seems crucial to support basic reliability guarantees from the MSE, I think we should make this the default as soon as possible.
Related
I had assumed that I wouldn't need anything like Failure Detector to get restart resilience. I'll look into the server start-up/shutdown process again to see if we have misconfigured something on our end.