This Docker Compose configuration provides a complete data lake architecture with all necessary components for modern data processing:
- Data ingestion (batch and streaming)
- Storage (structured and unstructured)
- Processing (batch and real-time)
- Orchestration
- Visualization
- Data version control
Service | Description | Ports | Credentials |
---|---|---|---|
Redpanda | Kafka-compatible message broker with Schema Registry | 9092, 8081-8082, 29092 | - |
Debezium | Change Data Capture (CDC) connector | 8083 | - |
MongoDB RS | 3-node replica set with Percona Server for MongoDB | 27017-29017 | admin/pass |
MinIO | S3-compatible object storage with pre-configured "datalake" bucket | 9000-9001 | admin/password |
Service | Description | Ports | Credentials |
---|---|---|---|
PostgreSQL | Metadata storage for Airflow and Superset | 5432 | bn_airflow/bitnami1 (Airflow), superset/superset |
Redis | Task queue backend for Airflow and Superset | 6379 | bitnami1456 |
MongoDB RS | Document storage for processed data | 27017-29017 | admin/pass |
Service | Description | Ports | Credentials |
---|---|---|---|
Apache Spark | Distributed processing with Jupyter Lab | 28080, 7077, 8081, 8888 | - |
Dremio | SQL query engine with data virtualization | 9047, 31010, 32010 | - |
Nessie | Git-like data version control for tables | 19120 | - |
Service | Description | Ports | Credentials |
---|---|---|---|
Superset | Business intelligence platform with Redis caching | 8088 | - |
Redpanda Console | Kafka topic monitoring and management | 8090 | - |
Airflow UI | Workflow orchestration and monitoring | 8080 | user/bitnami123 |
Spark UI | Spark cluster monitoring | 28080, 4040-4045 | - |
Service | Description | Ports | Credentials |
---|---|---|---|
Airflow | Complete workflow management system with Celery | 8080 | user/bitnami123 |
Redis | Backend for Airflow and Superset task queues | 6379 | bitnami1456 |
- Docker 20.10+
- Docker Compose 2.4+
- 8GB+ RAM recommended (16GB for optimal performance)
- At least 20GB disk space
docker-compose up -d
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Data Producers β
β (Applications, Databases, IoT Devices, Files, APIs, etc.) β
βββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ
β
v
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Ingestion Layer β
β βββββββββββββββ βββββββββββββββ ββββββββββββββββββββ β
β β Debezium CDCβ β Spark β β Redpanda β β
β β (Mongo/Post)β β (Streaming) β β (Message Queue) β β
β ββββββββ¬βββββββ ββββββββ¬βββββββ ββββββββββ¬ββββββββββ β
β β β β β
βββββββββββΌβββββββββββββββββββΌβββββββββββββββββββββΌβββββββββββββ
β β β
v v v
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Storage Layer β
β βββββββββββββββ ββββββββββββββββββββ βββββββββββββββ β
β β MinIO β β MongoDB Replica β β PostgreSQL β β
β β (Raw Zone) β β (Processed Data) β β (Metadata) β β
β ββββββββ¬βββββββ ββββββββββ¬ββββββββββ ββββββββ¬βββββββ β
βββββββββββΌβββββββββββββββββββββΌββββββββββββββββββββββΌβββββββββββ
β β β
v v v
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Processing Layer β
β βββββββββββββββ ββββββββββββββββββββ βββββββββββββββ β
β β Spark β β Dremio β β Nessie β β
β β (Batch/ML) β β (SQL Virtualiz.) β β (Data Git) β β
β ββββββββ¬βββββββ ββββββββββ¬ββββββββββ ββββββββ¬βββββββ β
βββββββββββΌβββββββββββββββββββββΌββββββββββββββββββββββΌβββββββββββ
β β β
v v v
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Visualization Layer β
β βββββββββββββββ ββββββββββββββββββββ βββββββββββββββ β
β β Superset β β Airflow UI β β Spark UI β β
β β (BI Tools) β β (Orchestration) β β (Monitoring)β β
β βββββββββββββββ ββββββββββββββββββββ βββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
-
MongoDB replica set initialization
Automatic viars-init
container (3-node replica set: rs101, rs102, rs103) -
MinIO bucket creation
Pre-configured "datalake" bucket with sample data from./minio-data
directory -
Airflow database initialization
PostgreSQL database automatically created with:- Database:
bitnami_airflow
- User:
bn_airflow
- Password:
bitnami1
- Database:
-
Superset database initialization
PostgreSQL database automatically created with:- Database:
superset
- User:
superset
- Password:
superset
- Database:
Service | URL | Credentials |
---|---|---|
Airflow | http://localhost:8080 | user / bitnami123 |
Superset | http://localhost:8088 | (Setup during first access) |
MinIO Console | http://localhost:9001 | admin / password |
Spark UI | http://localhost:28080 | - |
Jupyter Lab | http://localhost:8888 | (No authentication) |
Redpanda Console | http://localhost:8090 | - |
Dremio | http://localhost:9047 | (Setup during first access) |
- MongoDB Nodes: 1 CPU, 1GB RAM each
(rs101, rs102, rs103 containers) - Redpanda: 1 CPU, 1GB RAM
(Kafka-compatible message broker) - Spark: 2 CPU, 2GB RAM
(Includes Jupyter Lab and Spark cluster) - Airflow: 1 CPU, 1GB RAM (each component)
(scheduler, worker, and webserver) - Superset: 1 CPU, 1GB RAM (each component)
(webserver and worker)
psmdb_data1-3
: MongoDB data volumes
(For replica set members)postgres_data
: PostgreSQL database storage
(Used by Airflow and Superset)redis_data
: Redis task queue storage
(For Airflow and Superset caching)./minio-data
: MinIO local folder
(Mounted to container for initial data)
# Check MongoDB replica set status
docker exec -it rs101 mongosh --eval "rs.status()"
# Verify MinIO bucket creation
docker logs minio
- Redpanda Documentation - Kafka-compatible streaming platform
- Apache Spark Documentation - Distributed processing framework
- Apache Airflow Documentation - Workflow orchestration
- Apache Superset Documentation - Business intelligence tool
- MongoDB Replica Sets - Official replication guide
- MinIO Documentation - S3-compatible object storage
- PostgreSQL Docs - Relational database system
- Dremio Documentation - SQL query engine
- Nessie Documentation - Data version control
- Debezium Docs - Change Data Capture (CDC)
- Superset Tutorials - Creating dashboards
- Redpanda Console - Kafka topic management