-
Notifications
You must be signed in to change notification settings - Fork 44
perf: async flushing & buffering + lock contention #173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Rewrites the management of the aggregation buffer to enable asynchronous flushing of batched writes, while continuing to build up the next buffer of writes in parallel. A behavioural change of note: previously there was (synchronous) back-pressure applied when a batch was being flushed, and there could only ever be at most 1 batch of writes outstanding. After this commit this back-pressure is removed as a side effect of async flushing, and there is no upper bound on the number of in-flight batch writes enforced. Back-pressure can easily be placed back into rskafka if it is deemed the responsibility of rskafka, rather than the caller, to limit the number of outstanding produce() calls. This commit also eliminates lock contention between the callers attempting to add to the batch, and those that are waiting for their write result, and also minimises the spurious task wakeups caused by all callers waiting on the linger timer, before (likely) moving on to wait on the async mutex, before finally being able to read their write result over the broadcast channel (also a mutex). Now one caller performs a linger dance, while all the others wait exclusively on their broadcast handle for the results, avoiding having to wait on the aggregation mutex.
Prior to this commit, all the readers (holders of BroadcastOnceReceiver) would contend with each other to read the value that has been broadcast. The BroadcastOnce type restricts usage to one writer, with many readers, making the case of >1 reader contending to read the result common by design. A RwLock eliminates the reader contention, preventing the readers from queuing up to read their value - it is now a pair of cheap atomic inc/sub to read the value, in parallel, for each reader.
1d5be60
to
82862df
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the split between "aggregate" and "send batch" that you perform here. I have a couple of questions and remarks.
// Remove the batch, temporarily swapping it for a None until a new | ||
// batch is built. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is NOT panic-safe. If batch.background_flush
which calls aggregator.flush
panics than the entire aggregator is poisoned forever1. I think we should allow the aggregator to panic and only error the affected inputs instead of all future actions.
Footnotes
-
This BTW is the reason why the stdlib mutexes implement poisoning. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right - I did consider panics here, but took the view that a panicking aggregator is an exceptional event (i.e. not an recoverable error) and should be propagated up.
We're effectively implementing poisoning here by having the next call take()
a None
and exploding ourselves - I think it's reasonable to expect panics to be exceptional / fatal, but open to ideas.
// batch, verify this BatchBuilder is the batch it is indenting to | ||
// flush. | ||
if let Some(token) = flusher_token { | ||
if token != self.flush_clock { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this only happen when there are manual flushes involved or are there other cases as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block of code intentionally doesn't care why the tokens don't match!
But to answer properly, this can happen two ways:
-
A manual call to
BatchProducer::flush()
causes the buffer to be flushed early. The caller doing the linger either:- Observes the result of the flush early, before the linger expires, or
- The linger fires after the manual
flush()
, but before the result is broadcast. In this case, this conditional flush becomes a NOP.
-
At least two writes to an aggregator are made; the first caller starts lingering, and the second caller observes a full aggregator and flushes the batch before the linger expires. The subsequent flush attempt by the linger is a NOP (which may also take the same early flush path as above).
return FlushResult::Ok(Self::new(self.aggregator), None); | ||
} | ||
|
||
let handle = tokio::spawn({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need yet another task here? Isn't there already one that drives the linger and the flushing? Or is this so we have also cancellation safety if there are manual flushes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This decouples the flushing I/O (in the task), and the instantiation of the new batch using the aggregator (in the return below, with the task handle). This lets all new produce()
calls progress. The task handle is stored, and later used to wait for all batches to complete flushing when calling the manual BatchProducer::flush()
.
We need the buffer to be ready immediately to begin accruing payloads, but do not wish to wait for the flush to complete. Unfortunately having the aggregator internal to the client makes this dance tricky.
//! ``` | ||
use std::sync::Arc; | ||
use std::time::Duration; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a flow-chart on the top of this file (GitHub doesn't let me comment there). This one only shows the user-visible data flow and this is all good. From an rskafka developer PoV, I would welcome if the interplay between the producer, the aggregator and the batches (which now contain their own background task) would be illustrated somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll knock one up this afternoon and PR it - I think it'll take me a bit of digging around (you're right, the interplay is complex!)
When dropping the ProducerInner (likely caused by dropping the producer itself), terminate all background flush tasks to avoid them outliving the producer itself.
This is a big one that I couldn't break up into anything smaller - sorry!
Happy to talk through the changes if needed, if not, grab a cuppa! 🍵 Large portions of the change are comments, actual LoC change is a small percentage.
The main aims of this PR are outlined below, in order of priority:
Make aggregating batches, and writing already full batches asynchronous w.r.t each other
Currently this is not the case. A call to
produce()
with space in the aggregator goes like this:Because the lock is held while the flush happens, all callers to
produce()
begin queuing up, even though their payloads would be buffered. This means they needlessly incur the latency of the protocol overhead/serialisation/network I/O, only to then have to try and stuff as many payloads as possible into the buffer before the next 5ms linger timer blocks aggregation again.In an environment where writing to kafka incurs a RTT of ~50ms, we spend ~5ms aggregating payloads before blocking for ~50ms, repeated indefinitely.
While most Kafka writes are quick, we observe long tail latencies:
After the changes, flushing happens asynchronously, without having to hold a lock over the aggregator state. This allows ~5ms of payloads to be batched up and dispatched, and immediately the next batch begins being built up, allowing near continuous buffer building. This should reduce the overall latency of calls to
produce()
and increase the throughput of messages to Kafka.I've been careful to construct the buffering / flushing such that there is no need for an async mutex - given that the mutex is in the buffering hot path, replacing it with a non-async
parking_lot
mutex (which has far lower overhead) reduces the latency of calls that simply add the payload to an im-memory buffer. This should slightly increase the number of messages we can buffer in the ~5ms linger period.Flush result readers do not contend with buffering
produce()
callersPreviously all calls to
produce()
would lead the task to waiting for either the linger to fire, or the flush to complete:rskafka/src/client/producer.rs
Lines 435 to 438 in 26af770
However given the low default linger time (5ms) relative to the average RTT latency of a flush (especially under high load), it's probable most, if not all waiters will observe their linger timer expiring before the flush completes and the result is broadcast to the
result_slot
. I've observed manyAlready flushed
logs in our prod, with noInsufficient capacity in aggregator - flushing
to cause the flush in the first place, indicating this happening quite often.Once the linger timer fires, all the callers proceed to queue up to acquire the aggregator mutex again, unnecessarily serialising readers when their result is available and adding further contention with callers of
produce()
hoping to buffer their payloads in parallel.After this PR, only one caller will attempt to acquire the aggregator mutex to initiate a flush, and the rest of the callers can wait on their result broadcast only.
Removing this lock contention allows for more
produce()
calls to buffer their payloads within the 5ms linger, increasing the message sizes over the wire.Reading the broadcast result has also been made cheaper by reducing contention of the
BroadcastOnce
primitive with many readers, switching an exclusive mutex to a many-readers lock - they now no longer have to queue up to sequentially read (+ clone) the results, which should marginally reduce the per-call latency toproduce()
too.Useless Wakeups
In the above linger-expired-acquire-lock situation, the readers all .await on the timer, then .await on the aggregator mutex, only to then queue up on the mutex synchronising the peek() call. This causes many useless wakeups only to immediately hit the next
Pending
await point, as well as unnecessary thread blocking queuing on the non-async results mutex.Reducing the number of awaits in this path should help reduce the (small) tokio overhead incurred when parking/waking tasks.
Behavioural Changes
None of these changes necessitate a breaking external API change 🎉
Back-pressure
Previously rskafka would allow one single buffer flush to be in-flight. This (very!) effectively applied back-pressure to clients, slowing them down.
After this change, there is no back-pressure and limiting the number of outstanding
produce()
calls is the responsibility of the caller. Apps can throw as much as they want at rskafka, and it'll do the best it can to keep up, withproduce()
latencies increasing as load increases.It is, however, relatively simple to add back-pressure to rskafka using a semaphore if desired.
Tokio
This requires the
is_finished()
method on the taskJoinHandle
, which was added in1.19
and required a min tokio version bump.perf: async batch flushing & lock contention (9e9d848)
perf: use RwLock for BroadcastOnce (1d5be60)