Skip to content

[multistage] Handling Server Restarts Gracefully #15997

Open
@ankitsultana

Description

@ankitsultana

Note: Creating this ticket for tracking. Some existing solutions might already suffice, but I want to ideally find a path forward where we don't have to turn any config or tune any knob to get resilience from basic server operations like restarts.

While benchmarking MSE Lite Mode's resilience to server restarts, I found that even with Lite Mode, where only the leaf stages are run in the servers and everything else is run in the brokers, we were seeing high query failure rate during restart. The load used to test this was around 50 QPS with query latencies of ~50ms.

You can see the attached graph below. The first wave of queries is with the regular MSE, and the second wave of queries is using the Lite Mode. While Lite Mode has a much lower failure rate than regular MSE, the failure rate is still quite high, and ideally we want to aim for zero failures due to server restarts.

This experiment was run without any failure detector configured. See the section at the bottom.

Image

Errors Seen

All the errors I saw were connection refused errors coming from the QueryDispatcher.

Query processing exceptions: {200=QueryExecutionError: Error dispatching query: 1048745684000196581 to server: some-host@{port1,port2} org.apache.pinot.query.service.dispatch.QueryDispatcher.execute(QueryDispatcher.java:312) org.apache.pinot.query.service.dispatch.QueryDispatcher.submit(QueryDispatcher.java:212) org.apache.pinot.query.service.dispatch.QueryDispatcher.submitAndReduce(QueryDispatcher.java:149) org.apache.pinot.broker.requesthandler.MultiStageBrokerRequestHandler.query(MultiStageBrokerRequestHandler.java:427) UNAVAILABLE: io exception io.grpc.Status.asRuntimeException(Status.java:535) io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:487) io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:563) io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70) finishConnect(..) failed: Connection refused: some-host/111111:12345 finishConnect(..) failed: Connection refused io.grpc.netty.shaded.io.netty.channel.unix.Errors.newConnectException0(Errors.java:155) io.grpc.netty.shaded.io.netty.channel.unix.Errors.handleConnectErrno(Errors.java:128) io.grpc.netty.shaded.io.netty.channel.unix.Socket.finishConnect(Socket.java:321) io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:710) finishConnect(..) failed: Connection refused io.grpc.netty.shaded.io.netty.channel.unix.Errors.newConnectException0(Errors.java:155) io.grpc.netty.shaded.io.netty.channel.unix.Errors.handleConnectErrno(Errors.java:128) io.grpc.netty.shaded.io.netty.channel.unix.Socket.finishConnect(Socket.java:321) io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:710) }

On Failure Detector

Recent work by @yashmayya to add Failure Detector is specifically to tackle this scenario (I suppose), where exceptions from failed query dispatch are caught and fed into the routing manager logic, to exclude servers that are failing query dispatch temporarily with custom backoff and retry strategies.

We'll be retrying the experiment with the Connection Based failure detector soon, but given this feature seems crucial to support basic reliability guarantees from the MSE, I think we should make this the default as soon as possible.

Related

I had assumed that I wouldn't need anything like Failure Detector to get restart resilience. I'll look into the server start-up/shutdown process again to see if we have misconfigured something on our end.

Metadata

Metadata

Assignees

Labels

multi-stageRelated to the multi-stage query engine

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions