A comprehensive solution for managing DataHub metadata, ingestion recipes, and policies through version control, CI/CD pipelines, and a modern web interface. This project enables teams to treat DataHub metadata as code with proper versioning, testing, and deployment practices.
- π Recipe Management: Create, deploy, and manage DataHub ingestion recipes with templates and environment-specific configurations
- π·οΈ Metadata Management: Manage tags, glossary terms, domains, structured properties, data contracts, and assertions
- π Policy Management: Version control DataHub access policies with automated deployment
- π Multi-Environment Support: Separate configurations for dev, staging, and production environments
- π CI/CD Integration: Automated workflows for testing, validation, and deployment via GitHub Actions
- π Web Interface: Modern Django-based UI for managing all aspects of your DataHub metadata
- π Staging Workflow: Stage changes locally before deploying to DataHub instances
- Individual JSON Files: Each metadata entity (assertion, tag, glossary term, etc.) is stored as individual JSON files
- MCP File Processing: Uses DataHub's metadata-file source for batch operations
- URN Mutation: Automatically mutates URNs for cross-environment deployments
- GitHub Integration: Seamless integration with GitHub for version control and CI/CD
- Platform Instance Mapping: Map entities between different platform instances across environments
- Mutation System: Apply transformations to metadata when moving between environments
- Secrets Management: Secure handling of sensitive data through GitHub Secrets integration
- Connection Management: Multiple DataHub connection configurations with health monitoring
The system separates metadata management into distinct layers:
- Templates - Reusable, parameterized patterns with environment variable placeholders
- Environment Variables - Environment-specific configuration values stored as YAML
- Instances - Combination of templates and environment variables for deployment
- Metadata Files - Individual JSON files for each metadata entity (tags, glossary, domains, etc.)
- MCP Files - Batch metadata change proposals processed by DataHub workflows
datahub-recipes-manager/
βββ .github/workflows/ # GitHub Actions for CI/CD
β βββ manage-*.yml # Metadata processing workflows
β βββ manage-assertions.yml # Assertion processing
β βββ README.md # Workflow documentation
βββ docker/ # Container deployment
β βββ Dockerfile # Application container
β βββ docker-compose.yml # Development setup
β βββ nginx/ # Web server configuration
βββ docs/ # Comprehensive documentation
β βββ Environment_Variables.md # DataHub CLI environment variables
β βββ Troubleshooting.md # Common issues and solutions
β βββ *.md # Additional guides
βββ helm/ # Kubernetes deployment charts
βββ metadata-manager/ # Metadata entity storage
β βββ dev/ # Development environment
β β βββ assertions/ # Individual assertion JSON files
β β βββ tags/ # Tag MCP files
β β βββ glossary/ # Glossary MCP files
β β βββ domains/ # Domain MCP files
β β βββ structured_properties/ # Properties MCP files
β β βββ metadata_tests/ # Test MCP files
β β βββ data_products/ # Data product MCP files
β βββ staging/ # Staging environment
β βββ prod/ # Production environment
βββ recipes/ # DataHub ingestion recipes
β βββ templates/ # Parameterized recipe templates
β βββ instances/ # Environment-specific instances
β βββ pulled/ # Recipes pulled from DataHub
βββ params/ # Environment variables and parameters
β βββ environments/ # Environment-specific YAML configs
β βββ default_params.yaml # Global defaults
βββ policies/ # DataHub access policies
βββ scripts/ # Python utilities and automation
β βββ mcps/ # Metadata change proposal utilities
β βββ assertions/ # Assertion management scripts
β βββ domains/ # Domain management scripts
β βββ glossary/ # Glossary management scripts
β βββ tags/ # Tag management scripts
βββ web_ui/ # Django web application
β βββ metadata_manager/ # Metadata management app
β βββ templates/ # HTML templates
β βββ static/ # CSS, JavaScript, images
β βββ migrations/ # Database schema migrations
βββ utils/ # Shared utilities and DataHub API wrappers
- Python 3.8+ with pip
- DataHub instance with API access
- Personal Access Token for DataHub authentication
- Git for version control
- GitHub account (for CI/CD features)
-
Clone the repository:
git clone https://github.com/your-org/datahub-recipes-manager.git cd datahub-recipes-manager
-
Set up Python environment:
# Create virtual environment python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate # Install dependencies pip install -r requirements.txt
-
Configure environment variables:
# Copy example environment file cp .env.example .env # Edit with your DataHub connection details nano .env
Required environment variables:
DATAHUB_GMS_URL=http://your-datahub-instance:8080 DATAHUB_GMS_TOKEN=your-personal-access-token DJANGO_SECRET_KEY=your-secret-key
-
Initialize the database:
cd web_ui python manage.py migrate python manage.py collectstatic --noinput cd ..
-
Start the web interface:
cd web_ui python manage.py runserver
Access the web interface at: http://localhost:8000
The Django-based web interface provides comprehensive management capabilities:
- Connection Status: Real-time DataHub connectivity monitoring
- Metadata Overview: Summary of tags, glossary terms, domains, and other entities
- Recent Activity: Latest changes and deployments
- Environment Health: Status across dev, staging, and production
- Tags: Create, edit, and manage DataHub tags with hierarchical relationships
- Glossary: Manage business glossary terms and their relationships
- Domains: Organize data assets into logical business domains
- Structured Properties: Define and manage custom metadata properties
- Data Contracts: Version control data quality contracts
- Assertions: Create and manage data quality assertions
- Templates: Create reusable recipe patterns with parameterization
- Instances: Deploy recipes with environment-specific configurations
- Environment Variables: Manage secrets and configuration separately
- Deployment Status: Track which recipes are deployed where
- Access Policies: Define who can access what data
- Platform Policies: Manage platform-level permissions
- Policy Templates: Create reusable policy patterns
Assertions are stored as individual JSON files in metadata-manager/<environment>/assertions/
:
{
"id": "assertion_123",
"type": "FRESHNESS",
"entityUrn": "urn:li:dataset:(urn:li:dataPlatform:postgres,public.users,PROD)",
"source": "EXTERNAL",
"config": {
"type": "FRESHNESS",
"freshnessAssertion": {
"schedule": {
"cron": "0 9 * * *",
"timezone": "UTC"
}
}
}
}
Other metadata entities use MCP (Metadata Change Proposal) files for batch operations:
{
"version": "1.0",
"source": {
"type": "metadata-file",
"config": {}
},
"sink": {
"type": "datahub-rest",
"config": {
"server": "${DATAHUB_GMS_URL}",
"token": "${DATAHUB_GMS_TOKEN}"
}
},
"entities": [
{
"entityType": "tag",
"entityUrn": "urn:li:tag:PII",
"aspects": [
{
"aspectName": "tagKey",
"aspect": {
"name": "PII"
}
}
]
}
]
}
Environment variables are stored as structured YAML files in params/environments/<env>/
:
# params/environments/prod/analytics-db.yml
name: "Analytics Database"
description: "Production analytics database connection"
recipe_type: "postgres"
parameters:
POSTGRES_HOST: "analytics-db.company.com"
POSTGRES_PORT: 5432
POSTGRES_DATABASE: "analytics"
INCLUDE_VIEWS: true
INCLUDE_TABLES: "*"
secret_references:
- POSTGRES_USERNAME
- POSTGRES_PASSWORD
The project includes comprehensive CI/CD workflows:
manage-tags.yml
- Processes tag MCP filesmanage-glossary.yml
- Processes glossary MCP filesmanage-domains.yml
- Processes domain MCP filesmanage-structured-properties.yml
- Processes properties MCP filesmanage-metadata-tests.yml
- Processes test MCP filesmanage-data-products.yml
- Processes data product MCP filesmanage-assertions.yml
- Processes individual assertion JSON files
- Automatic Triggers: Run on changes to relevant files
- Environment Matrix: Process dev, staging, and prod separately
- Dry Run Support: PRs run in validation mode
- Error Handling: Comprehensive error reporting and rollback
- Artifact Storage: Store processing results and logs
Configure these secrets in your GitHub repository:
# Global DataHub connection
DATAHUB_GMS_URL=https://your-datahub.company.com:8080
DATAHUB_GMS_TOKEN=your-global-token
# Environment-specific (optional)
DATAHUB_GMS_URL_DEV=https://dev-datahub.company.com:8080
DATAHUB_GMS_TOKEN_DEV=your-dev-token
DATAHUB_GMS_URL_PROD=https://prod-datahub.company.com:8080
DATAHUB_GMS_TOKEN_PROD=your-prod-token
- GitHub Secrets: Store sensitive values in GitHub repository secrets
- Environment Separation: Use environment-specific secrets when possible
- Token Rotation: Regularly rotate DataHub Personal Access Tokens
- Least Privilege: Grant minimal necessary permissions
- Branch Protection: Require reviews for production deployments
- Environment Protection: Use GitHub Environments for sensitive deployments
- Audit Logging: All changes are tracked in Git history
# Start development environment
docker-compose up -d
# Access web interface
open http://localhost:8000
# Build and deploy production
docker-compose -f docker-compose.prod.yml up -d
# With NGINX reverse proxy
docker-compose -f docker-compose.prod.yml up -d nginx
# Deploy with Helm
helm install datahub-recipes-manager ./helm/datahub-recipes-manager/
# Configure values
helm upgrade datahub-recipes-manager ./helm/datahub-recipes-manager/ \
--set datahub.url=https://your-datahub.com:8080 \
--set datahub.token=your-token
This project uses the DataHub CLI extensively. The following environment variables are supported:
DATAHUB_GMS_URL
(default:http://localhost:8080
) - DataHub GMS instance URLDATAHUB_GMS_TOKEN
(default:None
) - Personal Access Token for authenticationDATAHUB_GMS_HOST
(default:localhost
) - GMS host (prefer using DATAHUB_GMS_URL)DATAHUB_GMS_PORT
(default:8080
) - GMS port (prefer using DATAHUB_GMS_URL)DATAHUB_GMS_PROTOCOL
(default:http
) - Protocol (prefer using DATAHUB_GMS_URL)
DATAHUB_SKIP_CONFIG
(default:false
) - Skip creating configuration fileDATAHUB_TELEMETRY_ENABLED
(default:true
) - Enable/disable telemetryDATAHUB_TELEMETRY_TIMEOUT
(default:10
) - Telemetry timeout in secondsDATAHUB_DEBUG
(default:false
) - Enable debug logging
DATAHUB_VERSION
(default:head
) - DataHub docker image versionACTIONS_VERSION
(default:head
) - DataHub actions container versionDATAHUB_ACTIONS_IMAGE
(default:acryldata/datahub-actions
) - Actions image name
For detailed configuration guidance, see Environment Variables Documentation.
# Run all tests
bash test/run_all_tests.sh
# Run specific test categories
python -m pytest scripts/test_*.py
python -m pytest web_ui/tests/
# Test DataHub connectivity
python scripts/test_connection.py
- Automated Testing: All PRs run comprehensive test suites
- Integration Tests: Validate DataHub API interactions
- Mock Testing: Test workflows without live DataHub connections
- Validation: Ensure all metadata files are properly formatted
-
403 Forbidden Errors
# Check environment variables echo $DATAHUB_GMS_URL echo $DATAHUB_GMS_TOKEN # Test connectivity curl -f $DATAHUB_GMS_URL/health datahub check
-
Token Issues
- Generate new token: DataHub UI β Settings β Access Tokens
- Verify token permissions in DataHub
- Check token expiration
-
URL Format Issues
# Correct formats export DATAHUB_GMS_URL="http://localhost:8080" export DATAHUB_GMS_URL="https://your-instance.acryl.io/gms"
For comprehensive troubleshooting, see Troubleshooting Guide.
- Environment Variables Guide - Complete environment variable reference
- Troubleshooting Guide - Common issues and solutions
- Workflow Documentation - GitHub Actions workflow details
- DataHub Documentation - Official DataHub documentation
We welcome contributions! Please follow these steps:
- Fork the repository and create a feature branch
- Make your changes with appropriate tests
- Follow code style guidelines (use
black
for Python formatting) - Update documentation as needed
- Submit a pull request with a clear description
# Install development dependencies
pip install -r requirements-dev.txt
# Run code formatting
black .
# Run linting
flake8 .
# Run tests before submitting
bash test/run_all_tests.sh
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
- DataHub Project - The open-source metadata platform
- Acryl Data - Maintainers of the DataHub SDK
- Community Contributors - Everyone who has helped improve this project
- GitHub Issues: Report bugs and request features
- Discussions: Ask questions and share ideas
- Documentation: Check the docs/ directory for detailed guides
- Community: Join the DataHub Slack community
Ready to get started? Follow the Quick Start guide above, or jump into the web interface for a guided experience!