Kubernetes scheduler liveness probe fails during OOM conditions preventing automatic restart #53662

JinRenNg · 2025-07-23T09:01:59Z

JinRenNg
Jul 23, 2025

Incident Timeline:

Gradual memory increase: Scheduler memory consumption grows over days/weeks (known memory leak behavior)
Memory pressure threshold: Memory usage approaches but stays below K8s limits
Probe execution failure: Liveness probe exec commands begin failing with memory allocation errors
Silent degradation: Scheduler becomes unresponsive but Kubernetes shows "Running" status
Manual intervention: Platform team must manually delete pods to restore service

Log Patterns Observed

# Kubernetes events
Warning  Unhealthy  pod/scheduler-xxx  Liveness probe failed: 
  rpc error: code = Unknown desc = failed to exec in container

# Container logs show scheduler still running
[2025-07-23 09:02:30] airflow.jobs.scheduler_job - INFO - Scheduler heartbeat
[2025-07-23 09:02:45] <probe exec fails but no log in container>

# Manual exec also fails
$ kubectl exec scheduler-xxx -- airflow jobs check
OCI runtime exec failed: exec failed: unable to start container process: 
error executing setns process: exit status 1: fork/exec: cannot allocate memory

Monitoring Data Patterns

Memory usage: 85-95% of container limits
Process count: Normal (scheduler process still running)
CPU usage: Normal or slightly elevated
Liveness probe success rate: Drops to 0% while scheduler process remains active

Current Behavior

Memory leak progression: Scheduler memory usage gradually increases over time due to known Airflow memory leaks (see related issues Scheduler out of memory / stuck #11365, Scheduler Memory Leak in Airflow 2.0.1 #14924, airflow workers and scheduler memory leak #28740)
Memory pressure threshold: Pod reaches high memory usage from processing large DAGs, task queuing, or prolonged operation
Below-limit memory exhaustion: Memory usage stays below Kubernetes limits (e.g., using 3.5GB of 4GB limit) but approaches system allocation limits
Container process continues running but system cannot spawn new processes due to memory fragmentation/pressure
Liveness probe exec command fails: OCI runtime exec failed: cannot allocate memory
Kubernetes logs probe failures but doesn't restart pod (process still running)
Scheduler becomes unresponsive to new DAG runs but appears "healthy" to Kubernetes
Manual pod deletion required to restore scheduler functionality

Expected Behavior

Kubernetes should automatically detect unhealthy scheduler pods experiencing OOM conditions and restart them without manual intervention, ensuring high availability and operational reliability.

Impact

Production Reliability: Scheduler outages require manual intervention, increasing MTTR
Memory Leak Amplification: Existing memory leaks become operationally critical due to probe failure
Operational Overhead: Platform teams must monitor and manually restart stuck schedulers
Data Pipeline Availability: DAG execution halts when scheduler becomes unresponsive
Kubernetes Best Practices: Current implementation doesn't leverage K8s self-healing capabilities
Silent Failures: Unlike OOMKilled events, these failures don't generate clear Kubernetes events

Proposed Solutions

1. HTTP Health Endpoint for Liveness Probes

Add an optional lightweight HTTP health endpoint to the scheduler:

# python
# In airflow/jobs/scheduler_job.py
from flask import Flask, jsonify
import threading

class SchedulerHealthServer:
    def __init__(self, port=8080):
        self.app = Flask(__name__)
        self.app.add_url_rule('/health', 'health', self.health_check)
        self.last_heartbeat = time.time()
        
    def health_check(self):
        # Lightweight check that doesn't require heavy operations
        if time.time() - self.last_heartbeat < 30:
            return jsonify({"status": "healthy"}), 200
        return jsonify({"status": "unhealthy"}), 503

Helm Chart Configuration:

#yaml
scheduler:
  livenessProbe:
    httpGet:
      path: /health
      port: 8080
    initialDelaySeconds: 60
    periodSeconds: 30
    timeoutSeconds: 5
    failureThreshold: 3

2. Enhanced Probe Configuration Options

Provide alternative probe configurations in the official Helm chart:

# yaml
# values.yaml
scheduler:
  # Current exec-based probe (default)
  livenessProbe:
    enabled: true
    type: "exec"  # or "http"
    
    # Exec probe configuration
    exec:
      command: ["airflow", "jobs", "check", "--job-type", "SchedulerJob", "--hostname", "$(HOSTNAME)"]
    
    # HTTP probe configuration (alternative)
    httpGet:
      path: /health
      port: 8080
      
    # Conservative timing for OOM scenarios
    initialDelaySeconds: 120
    periodSeconds: 60
    timeoutSeconds: 30
    failureThreshold: 3

Supporting Evidence

Error Log Examples

Event: Liveness probe errored: rpc error: code = Unknown desc = failed to exec in container: 
failed to start exec "cdbf4a3f7f1f9fabd3b5022ea399f6dbf94daed74bb8c374586a1514898eb170": 
OCI runtime exec failed: exec failed: unable to start container process: 
error executing setns process: exit status 1: unknown

Related Issues Research

Based on comprehensive GitHub repository analysis:

Issue Scheduler out of memory / stuck #11365: Scheduler OOM crashes documented but no liveness probe correlation
Issue Liveness probe fails, causing scheduler and triggerer restarts every 5 minutes #20644: Liveness probe failures resolved with timeout increases
Issue Airflow scheduler could stuck sending callbacks to DAG processor #41869: Scheduler blocking on DAG processor causing probe failures
No existing issues specifically address this OOM + liveness probe combination

Backward Compatibility

All proposed solutions maintain backward compatibility:

Default behavior remains exec-based probe
HTTP endpoint is optional (disabled by default)
Enhanced configurations are opt-in
No breaking changes to existing deployments

Implementation Priority

High Priority - This affects production reliability and requires manual intervention, violating Kubernetes self-healing principles. The solution addresses a fundamental operational gap in the current architecture.

Alternative Workarounds

Current mitigation strategies:

Resource Limits: Set memory limits to trigger OOMKilled (bypasses probe issue)
External Monitoring: Deploy sidecar pods to monitor scheduler health
Conservative Probe Settings: Increase timeouts (doesn't solve root cause)
Manual Monitoring: Platform team monitoring and manual restarts

However, these workarounds don't address the fundamental design issue and require additional operational overhead.

Would the maintainers be open to a PR implementing the HTTP health endpoint approach? This seems like the most robust solution that follows Kubernetes best practices while maintaining backward compatibility.

potiuk · 2025-07-23T12:46:19Z

potiuk
Jul 23, 2025
Collaborator

Would the maintainers be open to a PR implementing the HTTP health endpoint approach? This seems like the most robust solution that follows Kubernetes best practices while maintaining backward compatibility.

Sure. You have not explained what deployment you used in general. so hard to say what exactly scope of changes you propose - note that not every component of airflow currently has an http server so you would have to change that, also the current http servers we use - we use fast_api, might not be good idea to provide such healhz endpoints. Generally speaking you are probably talking about embedding and exposing another http server in evry single component of airflow - specifically for http /healthz check. That has certain implications (mostly about performance, memory usage and security and also would require to adapt the helm chart for example.

But yes - I personally can answer that this seems like a good idea, but It's just my personal opinion.

With changing general approach for it and applying it to all components (and helm chart) that kind of change would likely require a devlist discussion. You can follow our community page to see how to subscribe to it and start a "[DICCUSSION]" thread and see if you can get consensus, then later if you see the consensus you can call for [LAZY CONSENSUS] thread - or if people will have different opinions and you will not be able to drive it to consensus, yoy would have to starta [VOTE] thread. Most likely you would also - that will come in the discussion - you will have to propose a short description of the impact (i.e. describe the http.server and way how you expose it including security implications).

More about decision making process and voting here https://www.apache.org/foundation/voting.html

0 replies

JinRenNg · 2025-07-24T02:50:34Z

JinRenNg
Jul 24, 2025
Author

Thank you for the thoughtful response and questions! Let me address each point with specific technical details:

Deployment Context

Environment:

Kubernetes: v1.27.16 and v1.25.16
Airflow: 2.10.5 and 2.10.3 via official Apache Helm chart
Executor: KubernetesExecutor
Scale: ~60 Active DAGs

Scope of Proposed Changes

I should clarify - I'm specifically proposing this for the scheduler component only, not all Airflow components. Here's why:

Scheduler-Specific Problem

Memory pressure impact: Scheduler is most affected by memory leaks and prolonged operation
Critical path: Scheduler failure stops entire pipeline execution
Current gap: exec probes fail during memory pressure, but scheduler process continues running
Kubernetes integration: Scheduler is the component most likely to be auto-restarted by K8s

Other Components

Workers: Already handle OOM well (OOMKilled → restart)
Webserver: Already has HTTP server (could reuse existing FastAPI)
Triggerer: Less memory-intensive, fewer reported issues

Technical Implementation Approach

Based on your FastAPI concern, I'd like to propose a lightweight approach that avoids heavy dependencies:

Note: I should mention that I'm not deeply familiar with the internal structure of the Airflow source code, so the following represents my rough understanding of how this could be implemented. I'd very much appreciate guidance from the maintainers on the best architectural approach and would be happy to adjust the implementation based on your recommendations.

Lightweight Built-in Health Server

Please note: The following is my initial attempt at designing this based on my limited understanding of Airflow's architecture. I'd greatly appreciate feedback on whether this approach aligns with Airflow's design patterns and coding standards.

# Minimal HTTP server separate from FastAPI
import threading
from http.server import HTTPServer, BaseHTTPRequestHandler
import time

class SchedulerHealthHandler(BaseHTTPRequestHandler):
    def do_GET(self):
        if self.path == '/health':
            # Lightweight check - just verify scheduler heartbeat recency
            if hasattr(self.server, 'last_heartbeat') and \
               time.time() - self.server.last_heartbeat < 60:
                self.send_response(200)
                self.send_header('Content-type', 'text/plain')
                self.end_headers()
                self.wfile.write(b'healthy')
            else:
                self.send_response(503)
                self.send_header('Content-type', 'text/plain') 
                self.end_headers()
                self.wfile.write(b'unhealthy')

class SchedulerHealthServer:
    def __init__(self, port=8080):
        self.server = HTTPServer(('', port), SchedulerHealthHandler)
        self.server.last_heartbeat = time.time()
        
    def start(self):
        # Run in daemon thread
        thread = threading.Thread(target=self.server.serve_forever, daemon=True)
        thread.start()
        
    def update_heartbeat(self):
        self.server.last_heartbeat = time.time()

Performance and Security Implications

Note: The following analysis is based on theoretical assumptions and design expectations. I haven't conducted actual performance testing or security analysis yet, so these estimates should be validated through proper testing and community review.

Performance Impact (Estimated)

Memory: ~1MB overhead for HTTP server thread (theoretical estimate)
CPU: Negligible (~0.01% during probe checks, estimated)
Network: Single local port binding
Scheduler: Zero impact on DAG processing (assumption based on daemon thread design)

Security Considerations (Theoretical)

Network exposure: Local port only (pod-internal)
Authentication: None needed (Kubernetes internal)
Attack surface: Single GET endpoint, no data exposure
Principle of least privilege: Health check only, no operational access

0 replies

JinRenNg · 2025-07-28T08:08:34Z

JinRenNg
Jul 28, 2025
Author

I will proceed to close this discussion for the moment as we are currently adopting airflow 3.x. We found out that the architecture have been greatly enhanced, especially where there is a new dag_processor component which have offloaded the DAG parsing job from the scheduler. It would greatly help to reduce the overwhelming issue on the schedueler in airflow 2.x. Hence, instead of working on pre airflow 3.x issue, we would love to move forward with the new architecture.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Kubernetes scheduler liveness probe fails during OOM conditions preventing automatic restart #53662

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Kubernetes scheduler liveness probe fails during OOM conditions preventing automatic restart #53662

Uh oh!

JinRenNg Jul 23, 2025

Incident Timeline:

Log Patterns Observed

Monitoring Data Patterns

Current Behavior

Expected Behavior

Impact

Proposed Solutions

1. HTTP Health Endpoint for Liveness Probes

2. Enhanced Probe Configuration Options

Supporting Evidence

Error Log Examples

Related Issues Research

Backward Compatibility

Implementation Priority

Alternative Workarounds

Replies: 3 comments

Uh oh!

potiuk Jul 23, 2025 Collaborator

Uh oh!

Uh oh!

JinRenNg Jul 24, 2025 Author

Deployment Context

Scope of Proposed Changes

Scheduler-Specific Problem

Other Components

Technical Implementation Approach

Lightweight Built-in Health Server

Performance and Security Implications

Performance Impact (Estimated)

Security Considerations (Theoretical)

Uh oh!

JinRenNg Jul 28, 2025 Author

JinRenNg
Jul 23, 2025

potiuk
Jul 23, 2025
Collaborator

JinRenNg
Jul 24, 2025
Author

JinRenNg
Jul 28, 2025
Author