Flushes triggered by `/flush` and `/shutdown` interfere with natural `flush_check_period` flushes, causing interleaved parquet blocks and broken ingesters

**Describe the bug**

This is a 95% overlap with the problem [described in Issue 2129](https://github.com/grafana/tempo/issues/2129#issuecomment-1687675158), where the process of the problem is 

> 1. Blocks are not being cleared from Tempo's PVC _because_
> 1. Blocks are not being flushed _because_
> 1. One of the ingester's flush queues stops processing flush operations _because_
> 1. A flush operation in the queue hangs indefinitely _because_

In the case of 2129, the ultimate source of the issue was

> a racy deadlock in the [azure-storage-blob-go](https://github.com/Azure/azure-storage-blob-go) SDK we use

In our case, this appears to be a problem entirely within the Tempo ingester itself whereby its creating parquet blocks that take hundreds to thousands of times longer to process than normal. This prevents flushes, which means that the blocks stick around forever when working in a kubernetes context where the ingesters are running in a StatefulSet. 

We've observed a few things with this bug
1. There appears to be a correlation between an external (i.e., user or kubernetes) call to `/flush` or `/shutdown` and the creation of a bad block
1. There _may_ also be a correlation between a large number of TRACE_TOO_LARGE errors with respect to a block and the creation of a bad block
   * this may just be a red herring though
1. No matter what precipitated these bad blocks, the result is always the same: **the rows of within the parquet file are sorted-interleaved**

By sorted-interleaved, I mean what we see in the following image:

<img width="1248" height="1272" alt="Image" src="https://github.com/user-attachments/assets/247af66e-c56c-45a9-b166-71ea874248ae" />

As the annotations note, the block contains two sets of interleaved rows, with each sorted by their respective trace IDs.
It essentially looks like two blocks were sorted individually and then folded into one another like two decks of cards.

**To Reproduce**

One process that we've had good results with in reproducing this problem is to simply hit the `/flush` endpoint over and over, e.g.,

```
while true ; do curl localhost:3100/flush; done
```
or
```
while true ; do curl localhost:3100/flush; sleep 9; done
```
or
```
while true ; do curl localhost:3100/flush; sleep 29; done
```

Eventually some of these flushes will trigger the interleaving, and the resulting blocks take upwards of 30 minutes to process.

**Expected behavior**

All blocks go through the usual process of
1. completing
1. completed
1. flushing
1. flushed
1. removed

**Environment:**

 - Infrastructure: Kubernetes, distributed Tempo
 - Deployment tool: helm

**Additional Context**

We've got a few working theories, but our primary one is related to a potential for simultaneous flush operations
* one caused by the natural operational cycle of an `instance` within the ingester 
* one caused by some external entity shutting down the ingester (/flush or /shutdown)

Some notes from our attempts at understanding this (note that the code references are for 2.7.1 but we're still seeing this in 2.8.2):

> When a trace comes in it goes trough a bit of processing, the important part is that it eventually hits
> 
> * trace := i.getOrCreateTrace(id, tkn)
>    * https://github.com/grafana/tempo/blob/v2.7.1/modules/ingester/instance.go#L218
>    * trace is fetched from the (i *instance).traces[] map
> * trace.Push (ctx, i.instanceID, traceBytes)
>    * https://github.com/grafana/tempo/blob/v2.7.1/modules/ingester/instance.go#L220
>    * Bytes are added to the traces (this is usually coming from the different spans that compose a trace)
> * t.batches = append(t.batches, trace)
>    * https://github.com/grafana/tempo/blob/v2.7.1/modules/ingester/trace.go#L36
>    * Under traces (t *liveTrace).batches means that the spans somewhat under (i *instance).traces[].batches[].traceBytes
>  
> At this point the traces are only in memory. The path for these files to be transferred to disk is through :
> 
> * (i *instance) CutCompleteTraces(cutoff time.Duration, immediate bool)
>    * https://github.com/grafana/tempo/blob/v2.7.1/modules/ingester/instance.go#L238 
>    * called in (i *Ingester) Fush ()
>       * https://github.com/grafana/tempo/blob/v2.7.1/modules/ingester/flush.go#L82 
>       * this loops through all instances (related to tenant head blocks) and cut complete traces as part of a shutdown
>    * Called in (i *Ingester) sweepInstance(instance *instance, immediate bool)
>       * https://github.com/grafana/tempo/blob/v2.7.1/modules/ingester/flush.go#L163 
>       * Sweep instance is call in different situation.
>          * under (i *Ingester) FlushHandler(w http.ResponseWriter, r *http.Request) triggered by the /flush endpoint
>          * and by sweep all instances called by
>             * by the flush handler same as above if no instances is provided by http call (flush all)
>             * by the flushRemaining called by shutdown handler /shutdown
>             * by (i *Ingester) stopping a slightly different but similar condition. it is received by the parent service.
>             * finally the (i *Ingester) loop this is the main normal operation. I believe under normal condition and no call to /flush, there should be only one non concurrent call to the function. 
> 
> In Summary we write from memory to disk in the following conditions:
> 
> 1. Periodic loop, every cfg.FlushCheckPeriod (10s in our case)
> 1. On command, if the ingester receives a /flush http request
> 1. On shutdown, potentially through both /shutdown and from the parent process, from what I can tell through the context or some other channe.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Flushes triggered by `/flush` and `/shutdown` interfere with natural `flush_check_period` flushes, causing interleaved parquet blocks and broken ingesters #5942

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Flushes triggered by /flush and /shutdown interfere with natural flush_check_period flushes, causing interleaved parquet blocks and broken ingesters #5942

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Flushes triggered by `/flush` and `/shutdown` interfere with natural `flush_check_period` flushes, causing interleaved parquet blocks and broken ingesters #5942