-
Notifications
You must be signed in to change notification settings - Fork 746
Description
Flyte & Flytekit version
Flyte v1.15.X
Describe the bug
We had some alerts fire that the P99 acceptance latency of workflows was greater than 30s. We use these alerts to understand if something is wrong with Flyte Propeller. The workflow latency was extremely high and didn't seem to match up with any recent executions.

After doing some investigation I found a workflow that had completed several hours ago that was somehow re-evaluated by Flyte Propeller. After doing some additional investigation it appears that Flyte Scheduler was the trigger for the workflow as it was on a scheduled launch plan.
It appears that the first workflow execution failed, which was tracked in the propeller's terminated tracking store. A couple hours later the second workflow execution was triggered. This created a new CRD but did not create a new DB model since the previous execution already exists. The second workflow CRD was ignored because it was still tracked as terminated by Flyte Propeller until the statically configured LRU cache evicted the entry and the second workflow CRD was finally processed, thus creating a very large acceptance latency.
Expected behavior
I would expect that duplicate executions would not create a duplicate workflow CRD
Additional context to reproduce
No response
Screenshots
No response
Are you sure this issue hasn't been raised already?
- Yes
Have you read the Code of Conduct?
- Yes
Metadata
Metadata
Assignees
Labels
Type
Projects
Status