self-healing-pipeline-guard

AI-powered CI/CD guardian that automatically detects, diagnoses, and fixes pipeline failures. Reduces mean-time-to-green by up to 65% through intelligent remediation strategies.

🎯 Key Features

Intelligent Failure Detection: ML-based classification of failure types
Automated Remediation: Self-healing actions for common failure patterns
Cost Analysis: Track cloud spend from unnecessary reruns
Pattern Library: Pre-built detectors for flaky tests, OOM, race conditions
Multi-Platform Support: GitHub Actions, GitLab CI, Jenkins, CircleCI
ROI Dashboard: Measure time saved and reliability improvements

🚀 Installation

As a GitHub Action

Add to .github/workflows/your-workflow.yml:

name: CI with Self-Healing

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Run Tests
        id: tests
        run: |
          npm test
          
      - name: Self-Healing Guard
        if: failure()
        uses: your-org/self-healing-pipeline-guard@v1
        with:
          github-token: ${{ secrets.GITHUB_TOKEN }}
          strategy: 'auto-detect'
          max-retries: 3
          
          # Optional integrations
          slack-webhook: ${{ secrets.SLACK_WEBHOOK }}
          jira-token: ${{ secrets.JIRA_TOKEN }}
          pagerduty-token: ${{ secrets.PAGERDUTY_TOKEN }}

As a Python Library

pip install self-healing-pipeline-guard

For GitLab CI

Add to .gitlab-ci.yml:

include:
  - remote: 'https://your-org.com/self-healing-guard/gitlab-template.yml'

variables:
  HEALING_STRATEGY: "progressive"
  HEALING_MAX_RETRIES: 3

⚡ Quick Start

Basic GitHub Action Setup

Create .github/healing-config.yml:

healing:
  strategies:
    - name: "flaky-test-detector"
      enabled: true
      threshold: 0.3  # 30% failure rate
      action: "retry-with-backoff"
      
    - name: "oom-detector"
      enabled: true
      action: "increase-resources"
      
    - name: "dependency-failure"
      enabled: true
      action: "clear-cache-retry"
      
  notifications:
    slack:
      channel: "#ci-alerts"
      on_recovery: true
      on_failure: true
      
  cost_tracking:
    enabled: true
    alert_threshold: 100  # Alert if healing costs > $100/day
    
  escalation:
    after_failures: 3
    create_ticket: true
    assign_to: "@oncall"

The guard automatically activates on pipeline failures!

Python Library Usage

from healing_guard import PipelineGuard, Strategy

# Initialize guard
guard = PipelineGuard(
    ci_provider="github",
    repo="your-org/your-repo"
)

# Configure strategies
guard.add_strategy(
    Strategy.FLAKY_TEST_RETRY,
    confidence_threshold=0.8,
    max_retries=3
)

# Analyze failure
failure = guard.analyze_failure(
    job_id="12345",
    logs=failure_logs
)

# Execute healing
result = guard.heal(failure)
print(f"Healing result: {result.status}")
print(f"Time saved: {result.time_saved_minutes} minutes")

🏗️ Architecture

graph TB
    subgraph "CI/CD Pipeline"
        A[Job Execution] --> B{Failure Detected}
        B -->|Yes| C[Guard Webhook]
    end
    
    subgraph "Self-Healing Guard"
        C --> D[Failure Analyzer]
        D --> E{Pattern Matcher}
        
        E -->|Flaky Test| F[Retry Strategy]
        E -->|OOM| G[Resource Strategy]
        E -->|Network| H[Backoff Strategy]
        E -->|Unknown| I[Escalation Strategy]
        
        F & G & H --> J[Remediation Engine]
        J --> K[Execute Fix]
        
        I --> L[Create Incident]
    end
    
    subgraph "Observability"
        K --> M[ROI Calculator]
        M --> N[Lang-Observatory]
        L --> O[Jira/ServiceNow]
    end

⚙️ Configuration

Strategy Configuration

Create .github/healing-strategies.yml:

strategies:
  flaky_test_detector:
    enabled: true
    ml_model: "gradient_boost"  # or "neural_net"
    features:
      - "test_name"
      - "failure_rate"
      - "time_of_day"
      - "recent_commits"
    thresholds:
      confidence: 0.75
      historical_failures: 3
    actions:
      primary: "retry_with_isolation"
      fallback: "skip_and_notify"
      
  resource_exhaustion:
    enabled: true
    patterns:
      - "Cannot allocate memory"
      - "No space left on device"
      - "OOMKilled"
    actions:
      - increase_memory: "+2GB"
      - clear_caches: true
      - scale_horizontally: true
      
  dependency_conflicts:
    enabled: true
    detection_method: "semantic_diff"
    actions:
      - clear_package_cache: true
      - pin_dependencies: true
      - isolate_environment: true

Cost Control Configuration

cost_control:
  enabled: true
  
  limits:
    daily_budget: 500  # USD
    per_repo_budget: 50
    per_job_retry_limit: 10
    
  optimization:
    prefer_spot_instances: true
    downgrade_after_failures: 2
    cache_aggressive: true
    
  alerts:
    slack: "#cost-alerts"
    threshold_percentages: [50, 80, 95]
    
  reporting:
    export_to: "lang-observatory"
    granularity: "hourly"

🔧 Remediation Strategies

Built-in Strategies

Strategy	Detection	Action	Success Rate
`flaky-test-retry`	Test fails < 50% of time	Retry with isolation	89%
`oom-recovery`	Memory errors in logs	Increase resources	94%
`network-backoff`	Connection timeouts	Exponential backoff	87%
`cache-corruption`	Checksum mismatches	Clear and rebuild	91%
`race-condition`	Timing-dependent failures	Sequential execution	85%
`dependency-hell`	Version conflicts	Pin and isolate	88%

Custom Strategy Example

from healing_guard import CustomStrategy, RemediationAction

class MonorepoStrategy(CustomStrategy):
    """Handle monorepo-specific failures."""
    
    def detect(self, failure_context):
        # Detect if failure is in changed packages only
        changed_packages = self.get_changed_packages(
            failure_context.commit_range
        )
        failed_packages = self.extract_failed_packages(
            failure_context.logs
        )
        
        return bool(failed_packages & changed_packages)
    
    def remediate(self, failure_context):
        # Only rebuild affected packages
        affected = self.get_affected_packages(failure_context)
        
        return RemediationAction(
            type="selective-rebuild",
            params={
                "packages": affected,
                "parallel": True,
                "cache": "warm"
            }
        )

# Register custom strategy
guard.register_strategy(MonorepoStrategy())

📊 Pattern Library

Flaky Test Patterns

patterns:
  timing_sensitive:
    regex: "(timeout|deadline|elapsed)"
    indicators:
      - "assertion failed after \\d+ms"
      - "expected .* within \\d+ seconds"
    remediation:
      - increase_timeout: 2x
      - add_retry_logic: true
      
  external_dependency:
    regex: "(connection refused|service unavailable)"
    indicators:
      - "failed to connect to"
      - "HTTP 5\\d\\d"
    remediation:
      - mock_external_services: true
      - add_circuit_breaker: true

Infrastructure Patterns

patterns:
  docker_layer_cache:
    indicators:
      - "layer does not exist"
      - "blob unknown"
    remediation:
      - rebuild_without_cache: true
      
  disk_pressure:
    indicators:
      - "no space left on device"
      - "disk quota exceeded"
    remediation:
      - cleanup_artifacts: true
      - prune_docker: true
      - expand_volume: "+10GB"

📈 Monitoring

ROI Dashboard

The guard automatically tracks and reports:

Time Saved: Minutes of developer time saved
Cost Reduction: Cloud costs avoided through smart retries
Reliability Improvement: MTTR and success rate changes
Pattern Insights: Most common failure types

Access at: https://your-org.com/healing-dashboard

Metrics Exported

metrics:
  # Healing effectiveness
  healing_guard_recovery_rate
  healing_guard_time_saved_minutes
  healing_guard_cost_saved_usd
  
  # Failure patterns
  healing_guard_failure_classification
  healing_guard_flaky_test_detection
  
  # Performance
  healing_guard_analysis_duration_seconds
  healing_guard_remediation_duration_seconds

📚 API Reference

REST API

# Get healing status
GET /api/v1/repos/{owner}/{repo}/healing/status

# Get failure analysis
GET /api/v1/repos/{owner}/{repo}/failures/{job_id}/analysis

# Trigger manual healing
POST /api/v1/repos/{owner}/{repo}/failures/{job_id}/heal
{
  "strategy": "aggressive",
  "notify": true
}

Python SDK

from healing_guard import GuardClient

client = GuardClient(api_key="your-key")

# Get repository health
health = client.get_repo_health("your-org/your-repo")
print(f"Flaky test rate: {health.flaky_rate}%")
print(f"Self-healing success: {health.healing_success_rate}%")

# Analyze specific failure
analysis = client.analyze_failure(
    repo="your-org/your-repo",
    job_id="12345"
)

# Get recommendations
for rec in analysis.recommendations:
    print(f"- {rec.action}: {rec.confidence}% confidence")

🤝 Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

Development Setup

# Clone repository
git clone https://github.com/your-org/self-healing-pipeline-guard
cd self-healing-pipeline-guard

# Install dependencies
poetry install --with dev

# Run tests
pytest tests/ -v

# Run linting
black . && flake8 . && mypy .

📄 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

🔗 Related Projects

Lang-Observatory - Observability platform
Claude-Flow - Agent orchestration
Pipeline-Analyzer - Deep pipeline analytics

📞 Support

📧 Email: [email protected]
💬 Discord: Join our community
📖 Documentation: Full docs
🎫 Issues: GitHub Issues

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
.devcontainer		.devcontainer
.github		.github
.terragon		.terragon
.vscode		.vscode
.zap		.zap
claude_improvements_20250830_235218		claude_improvements_20250830_235218
config		config
deploy		deploy
docs		docs
healing_guard		healing_guard
helm/healing-guard		helm/healing-guard
k8s		k8s
monitoring		monitoring
scripts		scripts
tests		tests
translations		translations
.editorconfig		.editorconfig
.env.example		.env.example
.flake8		.flake8
.gitignore		.gitignore
.markdownlint.json		.markdownlint.json
.pre-commit-config.yaml		.pre-commit-config.yaml
.releaserc.js		.releaserc.js
.releaserc.json		.releaserc.json
.secrets.baseline		.secrets.baseline
.yamllint.yml		.yamllint.yml
API_DOCUMENTATION.md		API_DOCUMENTATION.md
ARCHITECTURE.md		ARCHITECTURE.md
AUTONOMOUS_DEPLOYMENT_SUMMARY.md		AUTONOMOUS_DEPLOYMENT_SUMMARY.md
AUTONOMOUS_SDLC_COMPLETION.md		AUTONOMOUS_SDLC_COMPLETION.md
AUTONOMOUS_SDLC_COMPLETION_SUMMARY.md		AUTONOMOUS_SDLC_COMPLETION_SUMMARY.md
AUTONOMOUS_SDLC_EXECUTION_REPORT.md		AUTONOMOUS_SDLC_EXECUTION_REPORT.md
AUTONOMOUS_SDLC_EXECUTION_SUMMARY.md		AUTONOMOUS_SDLC_EXECUTION_SUMMARY.md
AUTONOMOUS_SDLC_FINAL_EXECUTION_REPORT.md		AUTONOMOUS_SDLC_FINAL_EXECUTION_REPORT.md
AUTONOMOUS_SDLC_FINAL_REPORT.md		AUTONOMOUS_SDLC_FINAL_REPORT.md
AUTONOMOUS_SDLC_SUMMARY.md		AUTONOMOUS_SDLC_SUMMARY.md
BACKLOG.md		BACKLOG.md
CHANGELOG.md		CHANGELOG.md
CLAUDE_IMPROVEMENTS.md		CLAUDE_IMPROVEMENTS.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
Dockerfile.dev		Dockerfile.dev
Dockerfile.production		Dockerfile.production
Dockerfile.test		Dockerfile.test
FINAL_AUTONOMOUS_SDLC_REPORT.md		FINAL_AUTONOMOUS_SDLC_REPORT.md
GITHUB_WORKFLOWS_IMPLEMENTATION.md		GITHUB_WORKFLOWS_IMPLEMENTATION.md
IMPLEMENTATION_GUIDE.md		IMPLEMENTATION_GUIDE.md
IMPROVEMENT_PLAN.md		IMPROVEMENT_PLAN.md
LICENSE		LICENSE
MATURITY_ASSESSMENT.md		MATURITY_ASSESSMENT.md
Makefile		Makefile
PRODUCTION_DEPLOYMENT.md		PRODUCTION_DEPLOYMENT.md
PROJECT_CHARTER.md		PROJECT_CHARTER.md
QUALITY_GATES_REPORT.md		QUALITY_GATES_REPORT.md
QUALITY_GATES_VALIDATION.md		QUALITY_GATES_VALIDATION.md
QUALITY_VALIDATION_REPORT.md		QUALITY_VALIDATION_REPORT.md
README-DEPLOYMENT.md		README-DEPLOYMENT.md
README.md		README.md
RESEARCH_FINDINGS_PUBLICATION.md		RESEARCH_FINDINGS_PUBLICATION.md
RESEARCH_VALIDATION_SUMMARY_20250816_232631.md		RESEARCH_VALIDATION_SUMMARY_20250816_232631.md
SDLC_STATUS.md		SDLC_STATUS.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md
autonomous_quality_gates.py		autonomous_quality_gates.py
context.md		context.md
core_research_validation.py		core_research_validation.py
core_research_validation_report.json		core_research_validation_report.json
demo_results.json		demo_results.json
demo_robust.py		demo_robust.py
demo_robust_results.json		demo_robust_results.json
demo_scalable.py		demo_scalable.py
demo_scalable_results.json		demo_scalable_results.json
demo_simple.py		demo_simple.py
docker-compose.dev.yml		docker-compose.dev.yml
docker-compose.monitoring.yml		docker-compose.monitoring.yml
docker-compose.prod.yml		docker-compose.prod.yml
docker-compose.production.yml		docker-compose.production.yml
docker-compose.test.yml		docker-compose.test.yml
docker-compose.yml		docker-compose.yml
enhanced_research_validation.py		enhanced_research_validation.py
error-handler.js		error-handler.js
final_autonomous_validation.py		final_autonomous_validation.py
final_autonomous_validation_assessment.json		final_autonomous_validation_assessment.json
final_comprehensive_validation.py		final_comprehensive_validation.py
final_comprehensive_validation_report.json		final_comprehensive_validation_report.json
final_research_validation.py		final_research_validation.py
final_validation.py		final_validation.py
global_compliance_audit.json		global_compliance_audit.json
global_compliance_system.py		global_compliance_system.py
global_deployment_manifest.json		global_deployment_manifest.json
mkdocs.yml		mkdocs.yml
package.json		package.json
poetry.lock		poetry.lock
production_research_validation.py		production_research_validation.py
pyproject.toml		pyproject.toml
quality_gates_report.json		quality_gates_report.json
quality_gates_runner.py		quality_gates_runner.py
quality_validation.py		quality_validation.py
quality_validation_report.json		quality_validation_report.json

License

danieleschmidt/self-healing-pipeline-guard

Folders and files

Latest commit

History

Repository files navigation