Skip to content

acrylJonny/datahub-recipes-manager

Repository files navigation

DataHub Recipes Manager

A comprehensive solution for managing DataHub metadata, ingestion recipes, and policies through version control, CI/CD pipelines, and a modern web interface. This project enables teams to treat DataHub metadata as code with proper versioning, testing, and deployment practices.

πŸš€ Features

Core Capabilities

  • πŸ“‹ Recipe Management: Create, deploy, and manage DataHub ingestion recipes with templates and environment-specific configurations
  • 🏷️ Metadata Management: Manage tags, glossary terms, domains, structured properties, data contracts, and assertions
  • πŸ” Policy Management: Version control DataHub access policies with automated deployment
  • 🌍 Multi-Environment Support: Separate configurations for dev, staging, and production environments
  • πŸ”„ CI/CD Integration: Automated workflows for testing, validation, and deployment via GitHub Actions
  • 🌐 Web Interface: Modern Django-based UI for managing all aspects of your DataHub metadata
  • πŸ“Š Staging Workflow: Stage changes locally before deploying to DataHub instances

Metadata-as-Code Features

  • Individual JSON Files: Each metadata entity (assertion, tag, glossary term, etc.) is stored as individual JSON files
  • MCP File Processing: Uses DataHub's metadata-file source for batch operations
  • URN Mutation: Automatically mutates URNs for cross-environment deployments
  • GitHub Integration: Seamless integration with GitHub for version control and CI/CD

Advanced Features

  • Platform Instance Mapping: Map entities between different platform instances across environments
  • Mutation System: Apply transformations to metadata when moving between environments
  • Secrets Management: Secure handling of sensitive data through GitHub Secrets integration
  • Connection Management: Multiple DataHub connection configurations with health monitoring

πŸ—οΈ Architecture Overview

The system separates metadata management into distinct layers:

  1. Templates - Reusable, parameterized patterns with environment variable placeholders
  2. Environment Variables - Environment-specific configuration values stored as YAML
  3. Instances - Combination of templates and environment variables for deployment
  4. Metadata Files - Individual JSON files for each metadata entity (tags, glossary, domains, etc.)
  5. MCP Files - Batch metadata change proposals processed by DataHub workflows

πŸ“ Project Structure

datahub-recipes-manager/
β”œβ”€β”€ .github/workflows/           # GitHub Actions for CI/CD
β”‚   β”œβ”€β”€ manage-*.yml            # Metadata processing workflows
β”‚   β”œβ”€β”€ manage-assertions.yml   # Assertion processing
β”‚   └── README.md               # Workflow documentation
β”œβ”€β”€ docker/                     # Container deployment
β”‚   β”œβ”€β”€ Dockerfile              # Application container
β”‚   β”œβ”€β”€ docker-compose.yml      # Development setup
β”‚   └── nginx/                  # Web server configuration
β”œβ”€β”€ docs/                       # Comprehensive documentation
β”‚   β”œβ”€β”€ Environment_Variables.md # DataHub CLI environment variables
β”‚   β”œβ”€β”€ Troubleshooting.md      # Common issues and solutions
β”‚   └── *.md                    # Additional guides
β”œβ”€β”€ helm/                       # Kubernetes deployment charts
β”œβ”€β”€ metadata-manager/           # Metadata entity storage
β”‚   β”œβ”€β”€ dev/                    # Development environment
β”‚   β”‚   β”œβ”€β”€ assertions/         # Individual assertion JSON files
β”‚   β”‚   β”œβ”€β”€ tags/              # Tag MCP files
β”‚   β”‚   β”œβ”€β”€ glossary/          # Glossary MCP files
β”‚   β”‚   β”œβ”€β”€ domains/           # Domain MCP files
β”‚   β”‚   β”œβ”€β”€ structured_properties/ # Properties MCP files
β”‚   β”‚   β”œβ”€β”€ metadata_tests/    # Test MCP files
β”‚   β”‚   └── data_products/     # Data product MCP files
β”‚   β”œβ”€β”€ staging/               # Staging environment
β”‚   └── prod/                  # Production environment
β”œβ”€β”€ recipes/                   # DataHub ingestion recipes
β”‚   β”œβ”€β”€ templates/             # Parameterized recipe templates
β”‚   β”œβ”€β”€ instances/             # Environment-specific instances
β”‚   └── pulled/                # Recipes pulled from DataHub
β”œβ”€β”€ params/                    # Environment variables and parameters
β”‚   β”œβ”€β”€ environments/          # Environment-specific YAML configs
β”‚   └── default_params.yaml    # Global defaults
β”œβ”€β”€ policies/                  # DataHub access policies
β”œβ”€β”€ scripts/                   # Python utilities and automation
β”‚   β”œβ”€β”€ mcps/                  # Metadata change proposal utilities
β”‚   β”œβ”€β”€ assertions/            # Assertion management scripts
β”‚   β”œβ”€β”€ domains/               # Domain management scripts
β”‚   β”œβ”€β”€ glossary/              # Glossary management scripts
β”‚   └── tags/                  # Tag management scripts
β”œβ”€β”€ web_ui/                    # Django web application
β”‚   β”œβ”€β”€ metadata_manager/      # Metadata management app
β”‚   β”œβ”€β”€ templates/             # HTML templates
β”‚   β”œβ”€β”€ static/                # CSS, JavaScript, images
β”‚   └── migrations/            # Database schema migrations
└── utils/                     # Shared utilities and DataHub API wrappers

πŸš€ Quick Start

Prerequisites

  • Python 3.8+ with pip
  • DataHub instance with API access
  • Personal Access Token for DataHub authentication
  • Git for version control
  • GitHub account (for CI/CD features)

Installation

  1. Clone the repository:

    git clone https://github.com/your-org/datahub-recipes-manager.git
    cd datahub-recipes-manager
  2. Set up Python environment:

    # Create virtual environment
    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
    
    # Install dependencies
    pip install -r requirements.txt
  3. Configure environment variables:

    # Copy example environment file
    cp .env.example .env
    
    # Edit with your DataHub connection details
    nano .env

    Required environment variables:

    DATAHUB_GMS_URL=http://your-datahub-instance:8080
    DATAHUB_GMS_TOKEN=your-personal-access-token
    DJANGO_SECRET_KEY=your-secret-key
  4. Initialize the database:

    cd web_ui
    python manage.py migrate
    python manage.py collectstatic --noinput
    cd ..
  5. Start the web interface:

    cd web_ui
    python manage.py runserver

    Access the web interface at: http://localhost:8000

🌐 Web Interface

The Django-based web interface provides comprehensive management capabilities:

Dashboard Features

  • Connection Status: Real-time DataHub connectivity monitoring
  • Metadata Overview: Summary of tags, glossary terms, domains, and other entities
  • Recent Activity: Latest changes and deployments
  • Environment Health: Status across dev, staging, and production

Metadata Management

  • Tags: Create, edit, and manage DataHub tags with hierarchical relationships
  • Glossary: Manage business glossary terms and their relationships
  • Domains: Organize data assets into logical business domains
  • Structured Properties: Define and manage custom metadata properties
  • Data Contracts: Version control data quality contracts
  • Assertions: Create and manage data quality assertions

Recipe Management

  • Templates: Create reusable recipe patterns with parameterization
  • Instances: Deploy recipes with environment-specific configurations
  • Environment Variables: Manage secrets and configuration separately
  • Deployment Status: Track which recipes are deployed where

Policy Management

  • Access Policies: Define who can access what data
  • Platform Policies: Manage platform-level permissions
  • Policy Templates: Create reusable policy patterns

πŸ”„ Metadata-as-Code Workflow

Individual JSON Files (Assertions)

Assertions are stored as individual JSON files in metadata-manager/<environment>/assertions/:

{
  "id": "assertion_123",
  "type": "FRESHNESS",
  "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:postgres,public.users,PROD)",
  "source": "EXTERNAL",
  "config": {
    "type": "FRESHNESS",
    "freshnessAssertion": {
      "schedule": {
        "cron": "0 9 * * *",
        "timezone": "UTC"
      }
    }
  }
}

MCP Files (Batch Operations)

Other metadata entities use MCP (Metadata Change Proposal) files for batch operations:

{
  "version": "1.0",
  "source": {
    "type": "metadata-file",
    "config": {}
  },
  "sink": {
    "type": "datahub-rest",
    "config": {
      "server": "${DATAHUB_GMS_URL}",
      "token": "${DATAHUB_GMS_TOKEN}"
    }
  },
  "entities": [
    {
      "entityType": "tag",
      "entityUrn": "urn:li:tag:PII",
      "aspects": [
        {
          "aspectName": "tagKey",
          "aspect": {
            "name": "PII"
          }
        }
      ]
    }
  ]
}

πŸ”§ Environment Configuration

Environment variables are stored as structured YAML files in params/environments/<env>/:

# params/environments/prod/analytics-db.yml
name: "Analytics Database"
description: "Production analytics database connection"
recipe_type: "postgres"
parameters:
  POSTGRES_HOST: "analytics-db.company.com"
  POSTGRES_PORT: 5432
  POSTGRES_DATABASE: "analytics"
  INCLUDE_VIEWS: true
  INCLUDE_TABLES: "*"
secret_references:
  - POSTGRES_USERNAME
  - POSTGRES_PASSWORD

πŸš€ GitHub Actions Workflows

The project includes comprehensive CI/CD workflows:

Metadata Processing Workflows

  • manage-tags.yml - Processes tag MCP files
  • manage-glossary.yml - Processes glossary MCP files
  • manage-domains.yml - Processes domain MCP files
  • manage-structured-properties.yml - Processes properties MCP files
  • manage-metadata-tests.yml - Processes test MCP files
  • manage-data-products.yml - Processes data product MCP files
  • manage-assertions.yml - Processes individual assertion JSON files

Key Features

  • Automatic Triggers: Run on changes to relevant files
  • Environment Matrix: Process dev, staging, and prod separately
  • Dry Run Support: PRs run in validation mode
  • Error Handling: Comprehensive error reporting and rollback
  • Artifact Storage: Store processing results and logs

GitHub Secrets Setup

Configure these secrets in your GitHub repository:

# Global DataHub connection
DATAHUB_GMS_URL=https://your-datahub.company.com:8080
DATAHUB_GMS_TOKEN=your-global-token

# Environment-specific (optional)
DATAHUB_GMS_URL_DEV=https://dev-datahub.company.com:8080
DATAHUB_GMS_TOKEN_DEV=your-dev-token
DATAHUB_GMS_URL_PROD=https://prod-datahub.company.com:8080
DATAHUB_GMS_TOKEN_PROD=your-prod-token

πŸ” Security Best Practices

Secrets Management

  • GitHub Secrets: Store sensitive values in GitHub repository secrets
  • Environment Separation: Use environment-specific secrets when possible
  • Token Rotation: Regularly rotate DataHub Personal Access Tokens
  • Least Privilege: Grant minimal necessary permissions

Access Control

  • Branch Protection: Require reviews for production deployments
  • Environment Protection: Use GitHub Environments for sensitive deployments
  • Audit Logging: All changes are tracked in Git history

🐳 Docker Deployment

Development

# Start development environment
docker-compose up -d

# Access web interface
open http://localhost:8000

Production

# Build and deploy production
docker-compose -f docker-compose.prod.yml up -d

# With NGINX reverse proxy
docker-compose -f docker-compose.prod.yml up -d nginx

Kubernetes

# Deploy with Helm
helm install datahub-recipes-manager ./helm/datahub-recipes-manager/

# Configure values
helm upgrade datahub-recipes-manager ./helm/datahub-recipes-manager/ \
  --set datahub.url=https://your-datahub.com:8080 \
  --set datahub.token=your-token

πŸ”§ DataHub CLI Environment Variables

This project uses the DataHub CLI extensively. The following environment variables are supported:

Connection Configuration

  • DATAHUB_GMS_URL (default: http://localhost:8080) - DataHub GMS instance URL
  • DATAHUB_GMS_TOKEN (default: None) - Personal Access Token for authentication
  • DATAHUB_GMS_HOST (default: localhost) - GMS host (prefer using DATAHUB_GMS_URL)
  • DATAHUB_GMS_PORT (default: 8080) - GMS port (prefer using DATAHUB_GMS_URL)
  • DATAHUB_GMS_PROTOCOL (default: http) - Protocol (prefer using DATAHUB_GMS_URL)

CLI Behavior

  • DATAHUB_SKIP_CONFIG (default: false) - Skip creating configuration file
  • DATAHUB_TELEMETRY_ENABLED (default: true) - Enable/disable telemetry
  • DATAHUB_TELEMETRY_TIMEOUT (default: 10) - Telemetry timeout in seconds
  • DATAHUB_DEBUG (default: false) - Enable debug logging

Docker Configuration

  • DATAHUB_VERSION (default: head) - DataHub docker image version
  • ACTIONS_VERSION (default: head) - DataHub actions container version
  • DATAHUB_ACTIONS_IMAGE (default: acryldata/datahub-actions) - Actions image name

For detailed configuration guidance, see Environment Variables Documentation.

πŸ§ͺ Testing

Run Tests Locally

# Run all tests
bash test/run_all_tests.sh

# Run specific test categories
python -m pytest scripts/test_*.py
python -m pytest web_ui/tests/

# Test DataHub connectivity
python scripts/test_connection.py

CI/CD Testing

  • Automated Testing: All PRs run comprehensive test suites
  • Integration Tests: Validate DataHub API interactions
  • Mock Testing: Test workflows without live DataHub connections
  • Validation: Ensure all metadata files are properly formatted

πŸ› οΈ Troubleshooting

Common Connection Issues

  1. 403 Forbidden Errors

    # Check environment variables
    echo $DATAHUB_GMS_URL
    echo $DATAHUB_GMS_TOKEN
    
    # Test connectivity
    curl -f $DATAHUB_GMS_URL/health
    datahub check
  2. Token Issues

    • Generate new token: DataHub UI β†’ Settings β†’ Access Tokens
    • Verify token permissions in DataHub
    • Check token expiration
  3. URL Format Issues

    # Correct formats
    export DATAHUB_GMS_URL="http://localhost:8080"
    export DATAHUB_GMS_URL="https://your-instance.acryl.io/gms"

For comprehensive troubleshooting, see Troubleshooting Guide.

πŸ“š Documentation

🀝 Contributing

We welcome contributions! Please follow these steps:

  1. Fork the repository and create a feature branch
  2. Make your changes with appropriate tests
  3. Follow code style guidelines (use black for Python formatting)
  4. Update documentation as needed
  5. Submit a pull request with a clear description

Development Setup

# Install development dependencies
pip install -r requirements-dev.txt

# Run code formatting
black .

# Run linting
flake8 .

# Run tests before submitting
bash test/run_all_tests.sh

πŸ“„ License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

πŸ™ Acknowledgments

  • DataHub Project - The open-source metadata platform
  • Acryl Data - Maintainers of the DataHub SDK
  • Community Contributors - Everyone who has helped improve this project

πŸ†˜ Support

  • GitHub Issues: Report bugs and request features
  • Discussions: Ask questions and share ideas
  • Documentation: Check the docs/ directory for detailed guides
  • Community: Join the DataHub Slack community

Ready to get started? Follow the Quick Start guide above, or jump into the web interface for a guided experience!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published