Skip to content

Flushes triggered by /flush and /shutdown interfere with natural flush_check_period flushes, causing interleaved parquet blocks and broken ingesters #5942

@cancub

Description

@cancub

Describe the bug

This is a 95% overlap with the problem described in Issue 2129, where the process of the problem is

  1. Blocks are not being cleared from Tempo's PVC because
  2. Blocks are not being flushed because
  3. One of the ingester's flush queues stops processing flush operations because
  4. A flush operation in the queue hangs indefinitely because

In the case of 2129, the ultimate source of the issue was

a racy deadlock in the azure-storage-blob-go SDK we use

In our case, this appears to be a problem entirely within the Tempo ingester itself whereby its creating parquet blocks that take hundreds to thousands of times longer to process than normal. This prevents flushes, which means that the blocks stick around forever when working in a kubernetes context where the ingesters are running in a StatefulSet.

We've observed a few things with this bug

  1. There appears to be a correlation between an external (i.e., user or kubernetes) call to /flush or /shutdown and the creation of a bad block
  2. There may also be a correlation between a large number of TRACE_TOO_LARGE errors with respect to a block and the creation of a bad block
    • this may just be a red herring though
  3. No matter what precipitated these bad blocks, the result is always the same: the rows of within the parquet file are sorted-interleaved

By sorted-interleaved, I mean what we see in the following image:

Image

As the annotations note, the block contains two sets of interleaved rows, with each sorted by their respective trace IDs.
It essentially looks like two blocks were sorted individually and then folded into one another like two decks of cards.

To Reproduce

One process that we've had good results with in reproducing this problem is to simply hit the /flush endpoint over and over, e.g.,

while true ; do curl localhost:3100/flush; done

or

while true ; do curl localhost:3100/flush; sleep 9; done

or

while true ; do curl localhost:3100/flush; sleep 29; done

Eventually some of these flushes will trigger the interleaving, and the resulting blocks take upwards of 30 minutes to process.

Expected behavior

All blocks go through the usual process of

  1. completing
  2. completed
  3. flushing
  4. flushed
  5. removed

Environment:

  • Infrastructure: Kubernetes, distributed Tempo
  • Deployment tool: helm

Additional Context

We've got a few working theories, but our primary one is related to a potential for simultaneous flush operations

  • one caused by the natural operational cycle of an instance within the ingester
  • one caused by some external entity shutting down the ingester (/flush or /shutdown)

Some notes from our attempts at understanding this (note that the code references are for 2.7.1 but we're still seeing this in 2.8.2):

When a trace comes in it goes trough a bit of processing, the important part is that it eventually hits

At this point the traces are only in memory. The path for these files to be transferred to disk is through :

In Summary we write from memory to disk in the following conditions:

  1. Periodic loop, every cfg.FlushCheckPeriod (10s in our case)
  2. On command, if the ingester receives a /flush http request
  3. On shutdown, potentially through both /shutdown and from the parent process, from what I can tell through the context or some other channe.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions