Skip to content

This **Data Lake Platform** is a comprehensive Docker-based solution for modern data engineering workflows

License

Notifications You must be signed in to change notification settings

comprealugueagora/datalake

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Thanks

Compre & Alugue Agora

πŸ“Œ Project Overview

This Docker Compose configuration provides a complete data lake architecture with all necessary components for modern data processing:

  • Data ingestion (batch and streaming)
  • Storage (structured and unstructured)
  • Processing (batch and real-time)
  • Orchestration
  • Visualization
  • Data version control

πŸ—οΈ Architecture Components

πŸ“₯ Ingestion Layer

Service Description Ports Credentials
Redpanda Kafka-compatible message broker with Schema Registry 9092, 8081-8082, 29092 -
Debezium Change Data Capture (CDC) connector 8083 -
MongoDB RS 3-node replica set with Percona Server for MongoDB 27017-29017 admin/pass
MinIO S3-compatible object storage with pre-configured "datalake" bucket 9000-9001 admin/password

πŸ’Ύ Storage Layer

Service Description Ports Credentials
PostgreSQL Metadata storage for Airflow and Superset 5432 bn_airflow/bitnami1 (Airflow), superset/superset
Redis Task queue backend for Airflow and Superset 6379 bitnami1456
MongoDB RS Document storage for processed data 27017-29017 admin/pass

βš™οΈ Processing Layer

Service Description Ports Credentials
Apache Spark Distributed processing with Jupyter Lab 28080, 7077, 8081, 8888 -
Dremio SQL query engine with data virtualization 9047, 31010, 32010 -
Nessie Git-like data version control for tables 19120 -

πŸ“Š Visualization & Monitoring

Service Description Ports Credentials
Superset Business intelligence platform with Redis caching 8088 -
Redpanda Console Kafka topic monitoring and management 8090 -
Airflow UI Workflow orchestration and monitoring 8080 user/bitnami123
Spark UI Spark cluster monitoring 28080, 4040-4045 -

πŸ”„ Orchestration

Service Description Ports Credentials
Airflow Complete workflow management system with Celery 8080 user/bitnami123
Redis Backend for Airflow and Superset task queues 6379 bitnami1456

πŸš€ Deployment Guide

Prerequisites

  • Docker 20.10+
  • Docker Compose 2.4+
  • 8GB+ RAM recommended (16GB for optimal performance)
  • At least 20GB disk space

Starting the Environment

docker-compose up -d

Design

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        Data Producers                         β”‚
β”‚  (Applications, Databases, IoT Devices, Files, APIs, etc.)   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β”‚
                        v
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      Ingestion Layer                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ Debezium CDCβ”‚    β”‚   Spark     β”‚    β”‚   Redpanda       β”‚  β”‚
β”‚  β”‚ (Mongo/Post)β”‚    β”‚ (Streaming) β”‚    β”‚ (Message Queue)  β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚         β”‚                  β”‚                    β”‚            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚                  β”‚                    β”‚
          v                  v                    v
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        Storage Layer                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚  MinIO      β”‚    β”‚ MongoDB Replica  β”‚    β”‚ PostgreSQL  β”‚   β”‚
β”‚  β”‚ (Raw Zone)  β”‚    β”‚ (Processed Data) β”‚    β”‚ (Metadata)  β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚                    β”‚                     β”‚
          v                    v                     v
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      Processing Layer                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚   Spark     β”‚    β”‚     Dremio       β”‚    β”‚   Nessie    β”‚   β”‚
β”‚  β”‚ (Batch/ML)  β”‚    β”‚ (SQL Virtualiz.) β”‚    β”‚ (Data Git)  β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚                    β”‚                     β”‚
          v                    v                     v
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     Visualization Layer                       β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  Superset   β”‚    β”‚    Airflow UI    β”‚    β”‚ Spark UI    β”‚  β”‚
β”‚  β”‚ (BI Tools)  β”‚    β”‚ (Orchestration)  β”‚    β”‚ (Monitoring)β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ”„ Initialization Process

  1. MongoDB replica set initialization
    Automatic via rs-init container (3-node replica set: rs101, rs102, rs103)

  2. MinIO bucket creation
    Pre-configured "datalake" bucket with sample data from ./minio-data directory

  3. Airflow database initialization
    PostgreSQL database automatically created with:

    • Database: bitnami_airflow
    • User: bn_airflow
    • Password: bitnami1
  4. Superset database initialization
    PostgreSQL database automatically created with:

    • Database: superset
    • User: superset
    • Password: superset

🌐 Accessing Services

Service URL Credentials
Airflow http://localhost:8080 user / bitnami123
Superset http://localhost:8088 (Setup during first access)
MinIO Console http://localhost:9001 admin / password
Spark UI http://localhost:28080 -
Jupyter Lab http://localhost:8888 (No authentication)
Redpanda Console http://localhost:8090 -
Dremio http://localhost:9047 (Setup during first access)

πŸ”§ Configuration Details

πŸ’» Resource Allocation

  • MongoDB Nodes: 1 CPU, 1GB RAM each
    (rs101, rs102, rs103 containers)
  • Redpanda: 1 CPU, 1GB RAM
    (Kafka-compatible message broker)
  • Spark: 2 CPU, 2GB RAM
    (Includes Jupyter Lab and Spark cluster)
  • Airflow: 1 CPU, 1GB RAM (each component)
    (scheduler, worker, and webserver)
  • Superset: 1 CPU, 1GB RAM (each component)
    (webserver and worker)

πŸ’Ύ Persistent Volumes

  • psmdb_data1-3: MongoDB data volumes
    (For replica set members)
  • postgres_data: PostgreSQL database storage
    (Used by Airflow and Superset)
  • redis_data: Redis task queue storage
    (For Airflow and Superset caching)
  • ./minio-data: MinIO local folder
    (Mounted to container for initial data)

πŸ› οΈ Troubleshooting

πŸ”„ Initialization Issues

# Check MongoDB replica set status
docker exec -it rs101 mongosh --eval "rs.status()"

# Verify MinIO bucket creation
docker logs minio

πŸ“š Additional Resources

Core Technologies

Data Storage

Data Processing

Visualization

About

This **Data Lake Platform** is a comprehensive Docker-based solution for modern data engineering workflows

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published