-
Notifications
You must be signed in to change notification settings - Fork 640
Description
Describe the bug
This is a 95% overlap with the problem described in Issue 2129, where the process of the problem is
- Blocks are not being cleared from Tempo's PVC because
- Blocks are not being flushed because
- One of the ingester's flush queues stops processing flush operations because
- A flush operation in the queue hangs indefinitely because
In the case of 2129, the ultimate source of the issue was
a racy deadlock in the azure-storage-blob-go SDK we use
In our case, this appears to be a problem entirely within the Tempo ingester itself whereby its creating parquet blocks that take hundreds to thousands of times longer to process than normal. This prevents flushes, which means that the blocks stick around forever when working in a kubernetes context where the ingesters are running in a StatefulSet.
We've observed a few things with this bug
- There appears to be a correlation between an external (i.e., user or kubernetes) call to
/flushor/shutdownand the creation of a bad block - There may also be a correlation between a large number of TRACE_TOO_LARGE errors with respect to a block and the creation of a bad block
- this may just be a red herring though
- No matter what precipitated these bad blocks, the result is always the same: the rows of within the parquet file are sorted-interleaved
By sorted-interleaved, I mean what we see in the following image:
As the annotations note, the block contains two sets of interleaved rows, with each sorted by their respective trace IDs.
It essentially looks like two blocks were sorted individually and then folded into one another like two decks of cards.
To Reproduce
One process that we've had good results with in reproducing this problem is to simply hit the /flush endpoint over and over, e.g.,
while true ; do curl localhost:3100/flush; done
or
while true ; do curl localhost:3100/flush; sleep 9; done
or
while true ; do curl localhost:3100/flush; sleep 29; done
Eventually some of these flushes will trigger the interleaving, and the resulting blocks take upwards of 30 minutes to process.
Expected behavior
All blocks go through the usual process of
- completing
- completed
- flushing
- flushed
- removed
Environment:
- Infrastructure: Kubernetes, distributed Tempo
- Deployment tool: helm
Additional Context
We've got a few working theories, but our primary one is related to a potential for simultaneous flush operations
- one caused by the natural operational cycle of an
instancewithin the ingester - one caused by some external entity shutting down the ingester (/flush or /shutdown)
Some notes from our attempts at understanding this (note that the code references are for 2.7.1 but we're still seeing this in 2.8.2):
When a trace comes in it goes trough a bit of processing, the important part is that it eventually hits
- trace := i.getOrCreateTrace(id, tkn)
- https://github.com/grafana/tempo/blob/v2.7.1/modules/ingester/instance.go#L218
- trace is fetched from the (i *instance).traces[] map
- trace.Push (ctx, i.instanceID, traceBytes)
- https://github.com/grafana/tempo/blob/v2.7.1/modules/ingester/instance.go#L220
- Bytes are added to the traces (this is usually coming from the different spans that compose a trace)
- t.batches = append(t.batches, trace)
- https://github.com/grafana/tempo/blob/v2.7.1/modules/ingester/trace.go#L36
- Under traces (t *liveTrace).batches means that the spans somewhat under (i *instance).traces[].batches[].traceBytes
At this point the traces are only in memory. The path for these files to be transferred to disk is through :
- (i *instance) CutCompleteTraces(cutoff time.Duration, immediate bool)
- https://github.com/grafana/tempo/blob/v2.7.1/modules/ingester/instance.go#L238
- called in (i *Ingester) Fush ()
- https://github.com/grafana/tempo/blob/v2.7.1/modules/ingester/flush.go#L82
- this loops through all instances (related to tenant head blocks) and cut complete traces as part of a shutdown
- Called in (i *Ingester) sweepInstance(instance *instance, immediate bool)
- https://github.com/grafana/tempo/blob/v2.7.1/modules/ingester/flush.go#L163
- Sweep instance is call in different situation.
- under (i *Ingester) FlushHandler(w http.ResponseWriter, r *http.Request) triggered by the /flush endpoint
- and by sweep all instances called by
- by the flush handler same as above if no instances is provided by http call (flush all)
- by the flushRemaining called by shutdown handler /shutdown
- by (i *Ingester) stopping a slightly different but similar condition. it is received by the parent service.
- finally the (i *Ingester) loop this is the main normal operation. I believe under normal condition and no call to /flush, there should be only one non concurrent call to the function.
In Summary we write from memory to disk in the following conditions:
- Periodic loop, every cfg.FlushCheckPeriod (10s in our case)
- On command, if the ingester receives a /flush http request
- On shutdown, potentially through both /shutdown and from the parent process, from what I can tell through the context or some other channe.