-
Notifications
You must be signed in to change notification settings - Fork 16.2k
Description
Apache Airflow version
2.1.4
What happened
SLAMiss is firing notifications (Slack notification, as defined by the sla_miss_callback) but every time it calls the sla_miss_callback it's sending notifications for the same set of tasks. It seems as though the notification sent flag in the database is never set to true. This happens when there are a large number of sla misses that need to be processed at the same time.
The use case for this is backfilling a DAG that runs frequently starting at ~1 month ago. This causes around 14k sla misses to need to be processed all at the same time.
What you expected to happen
Expected that sla_miss_callback is called, and then by the end of managing the SLAs, they no longer need to be processed. Expect that SLAs are managed one time, and then not used again when managing SLAs.
We found the root cause for this issue. This happens because the DAGFileProcessor is timing out before the transactions that change notification sent = True for the SLAs to be committed to the database. This is a somewhat weird "in-between" case because the timeout is long enough that the sla_miss_callback runs, but not long enough that all of the flags can be changed in the database. This causes the same SLAs to be processed over and over again every time we manage SLAs.
The offending line in the code base is the commit call at the end of manage SLAs. When we try to commit the changes to all 14k records, the DAGFileProcessor times out in the middle of that line.
How to reproduce
Generate many SLA misses all at once. This can be triggered by setting the start date for a DAG in the past and setting it to run frequently. Then, once manage slas is called, we process all of the SLA misses at the same time, causing a pile up in the system.
After, we have to get the timeout just right such that sla_miss_callback runs, but the transactions are not committed to the database. This will all depend on the system that this reproduction is running on.
Operating System
macOS Big Sur 11.3.1
Versions of Apache Airflow Providers
n/a
Deployment
Astronomer
Deployment details
No response
Anything else
No response
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct