Replies: 3 comments
-
Sure. You have not explained what deployment you used in general. so hard to say what exactly scope of changes you propose - note that not every component of airflow currently has an http server so you would have to change that, also the current http servers we use - we use fast_api, might not be good idea to provide such But yes - I personally can answer that this seems like a good idea, but It's just my personal opinion. With changing general approach for it and applying it to all components (and helm chart) that kind of change would likely require a devlist discussion. You can follow our community page to see how to subscribe to it and start a "[DICCUSSION]" thread and see if you can get consensus, then later if you see the consensus you can call for More about decision making process and voting here https://www.apache.org/foundation/voting.html |
Beta Was this translation helpful? Give feedback.
-
Thank you for the thoughtful response and questions! Let me address each point with specific technical details: Deployment ContextEnvironment:
Scope of Proposed ChangesI should clarify - I'm specifically proposing this for the scheduler component only, not all Airflow components. Here's why: Scheduler-Specific Problem
Other Components
Technical Implementation ApproachBased on your FastAPI concern, I'd like to propose a lightweight approach that avoids heavy dependencies: Note: I should mention that I'm not deeply familiar with the internal structure of the Airflow source code, so the following represents my rough understanding of how this could be implemented. I'd very much appreciate guidance from the maintainers on the best architectural approach and would be happy to adjust the implementation based on your recommendations. Lightweight Built-in Health ServerPlease note: The following is my initial attempt at designing this based on my limited understanding of Airflow's architecture. I'd greatly appreciate feedback on whether this approach aligns with Airflow's design patterns and coding standards.
Performance and Security ImplicationsNote: The following analysis is based on theoretical assumptions and design expectations. I haven't conducted actual performance testing or security analysis yet, so these estimates should be validated through proper testing and community review. Performance Impact (Estimated)
Security Considerations (Theoretical)
|
Beta Was this translation helpful? Give feedback.
-
I will proceed to close this discussion for the moment as we are currently adopting airflow 3.x. We found out that the architecture have been greatly enhanced, especially where there is a new dag_processor component which have offloaded the DAG parsing job from the scheduler. It would greatly help to reduce the overwhelming issue on the schedueler in airflow 2.x. Hence, instead of working on pre airflow 3.x issue, we would love to move forward with the new architecture. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Incident Timeline:
Log Patterns Observed
Monitoring Data Patterns
Memory usage: 85-95% of container limits
Process count: Normal (scheduler process still running)
CPU usage: Normal or slightly elevated
Liveness probe success rate: Drops to 0% while scheduler process remains active
Current Behavior
Expected Behavior
Kubernetes should automatically detect unhealthy scheduler pods experiencing OOM conditions and restart them without manual intervention, ensuring high availability and operational reliability.
Impact
Proposed Solutions
1. HTTP Health Endpoint for Liveness Probes
Add an optional lightweight HTTP health endpoint to the scheduler:
Helm Chart Configuration:
2. Enhanced Probe Configuration Options
Provide alternative probe configurations in the official Helm chart:
Supporting Evidence
Error Log Examples
Related Issues Research
Based on comprehensive GitHub repository analysis:
Backward Compatibility
All proposed solutions maintain backward compatibility:
Implementation Priority
High Priority - This affects production reliability and requires manual intervention, violating Kubernetes self-healing principles. The solution addresses a fundamental operational gap in the current architecture.
Alternative Workarounds
Current mitigation strategies:
However, these workarounds don't address the fundamental design issue and require additional operational overhead.
Would the maintainers be open to a PR implementing the HTTP health endpoint approach? This seems like the most robust solution that follows Kubernetes best practices while maintaining backward compatibility.
Beta Was this translation helpful? Give feedback.
All reactions