GitHub - cherrera20/spark-ecosystem-cluster: Fully operational local setup to experiment with a Spark-based ecosystem

Spark Ecosystem Cluster 🚀🔥

Welcome to Spark Ecosystem, a fully operational local setup to experiment with a Spark-based ecosystem. This project leverages docker-compose to create a cluster with Spark, Kafka, LocalStack, RabbitMQ, PostgreSQL, Hive, Airflow, and other supporting tools—all running locally! 🐳✨

Features ✨

Apache Spark: A Spark master and two workers to execute distributed computations.
Kafka: Message streaming with Kafka, including topic initialization.
Hive Metastore: Centralized metadata storage for Spark SQL.
LocalStack: Mock AWS services for development and testing (e.g., S3, SQS, Events).
RabbitMQ: Message broker for your pub/sub or task queue needs.
PostgreSQL: RDBMS support for applications like Hive and Airflow.
Airflow: Workflow orchestration made easy.
Superset: Data visualization and dashboarding tool.
Jupyter: Interactive notebooks for data exploration with PySpark.
Prometheus & Grafana: Monitoring and visualization of your cluster.

Architecture 🏛️

Here's a simplified text-based diagram representing the cluster architecture:

+---------------------+    +---------------------+    +---------------------+
|   Spark Master      |    |   Spark Worker A    |    |   Spark Worker B    |
| (spark-master:9090) |    | (spark-worker-a:9091)|    | (spark-worker-b:9093)|
+---------+-----------+    +---------+-----------+    +---------+-----------+
          |                        |                        |
          +------------------------+------------------------+
                      | Spark Cluster Network (spark-kafka) |
                      |                                    |
+---------------------+    +---------------------+    +---------------------+
|     Kafka           |    |   Hive Metastore    |    |   PostgreSQL        |
|   (kafka:9092)      |    | (hive-metastore:9083)|    | (postgres:5432)     |
+---------+-----------+    +---------+-----------+    +---------+-----------+
          |                        |                        |
          +------------------------+------------------------+
                      | Infrastructure Services            |
                      |                                    |
+---------------------+    +---------------------+    +---------------------+
|   LocalStack        |    |   RabbitMQ          |    |   Airflow           |
| (localstack:4566)   |    | (rabbitmq:5672/15672)|    | (airflow-webserver:8080)|
+---------+-----------+    +---------+-----------+    +---------+-----------+
          |                        |                        |
          +------------------------+------------------------+
                      | Monitoring & Data Exploration      |
                      |                                    |
+---------------------+    +---------------------+    +---------------------+
|   Prometheus        |    |   Grafana           |    |   Jupyter           |
| (prometheus:19090)  |    | (grafana:3000)      |    | (jupyter:8888)      |
+---------------------+    +---------------------+    +---------------------+
|     Superset        |
|   (superset:8081)   |
+---------------------+

Note: Ports shown are host ports. Container ports might differ but are mapped accordingly. All services communicate within the spark-kafka Docker network.

Use Cases and Examples 🚀

This cluster is designed for learning and experimentation. Here are a few examples of what you can do:

Run PySpark jobs in Jupyter Notebook:

Access Jupyter at http://localhost:8888.
Create a new PySpark notebook and start experimenting with Spark DataFrames, SQL, and MLlib.
Example PySpark code (you can run this in Jupyter):

from pyspark.sql import SparkSession
import time

spark = SparkSession.builder.appName("EasyApp").getOrCreate()

data = [("Anna", 25), ("Peter", 40), ("Joan", 35), ("Maria", 28)]
columns = ["Name", "Age"]

df = spark.createDataFrame(data, columns)

df.show()

filtered_df = df.filter(df.Age > 30)

filtered_df.show()
time.sleep(300) # Keep app running for UI exploration

spark.stop()

Stream data with Kafka and process with Spark Streaming:
- Kafka is running and topics can be created using the init-kafka service.
- You can produce and consume messages to Kafka topics from Spark Streaming applications running in Jupyter or via spark-submit.
Query data using Spark SQL and Hive Metastore:
- Hive Metastore is configured to use PostgreSQL as the backend.
- You can create tables in Hive Metastore and query them using Spark SQL through Jupyter or Spark Thrift Server.
- Connect to Spark Thrift Server ( localhost:10000 ) using tools like Beeline or JDBC/ODBC clients.
Visualize data with Superset:
- Access Superset at http://localhost:8081.
- Connect Superset to Spark Thrift Server or directly to PostgreSQL to create dashboards and visualizations.
Experiment with AWS services locally using LocalStack:
- LocalStack provides mock AWS services like S3, SQS, and Events.
- Configure Spark to use S3-like storage pointing to LocalStack endpoint ( http://localhost:4566 ).
- Use the provided SQS scripts in the sqs/ directory to interact with mock SQS queues.
Orchestrate workflows with Airflow:
- Access Airflow at http://localhost:8080.
- Create and schedule DAGs to orchestrate Spark jobs, data pipelines, or other tasks within the ecosystem.

Getting Started 🛠️

Prerequisites 📝

Ensure you have the following installed:

Docker 🐋
Docker Compose 📦

Setup & Run ▶️

Clone this repository:

git clone [[email protected]:cherrera20/spark-ecosystem-cluster.git]([email protected]:cherrera20/spark-ecosystem-cluster.git)
cd spark-ecosystem-cluster

Build and start the services:
```
docker-compose up --build
```

Access the services using their respective URLs/Ports:

Service	URL/Port
Spark Master	http://localhost:9090
Spark Worker A	http://localhost:9091
Spark Worker B	http://localhost:9093
Kafka	http://localhost:9092
Hive Metastore	http://localhost:9093
Airflow	http://localhost:8080
Superset	http://localhost:8081
Grafana	http://localhost:3000
Prometheus	http://localhost:19090
Jupyter Notebook	http://localhost:8888

Stopping the cluster: To stop and remove all containers, networks and volumes defined in docker-compose.yml, run:
```
docker-compose down
```

Directory Structure 🗂️

.
├── docker-compose.yml         # Compose file defining the ecosystem
├── spark/                     # Spark configurations and Dockerfiles
├── data/                      # Data storage for Hive and Spark
├── notebooks/                 # Jupyter notebooks for PySpark experiments
├── prometheus/                # Prometheus configuration
├── superset/                  # Superset setup
├── airflow-data/              # Airflow DAGs, logs, and config
├── hive/                      # Hive-specific configurations
├── sqs/                       # Scripts for interacting with mock SQS (LocalStack)
├── config/                    # (New) For storing service-specific config files (future improvement)
├── docs/                      # (New) For more extensive documentation (future improvement)
├── scripts/                   # (New) Utility scripts (future improvement)

Note: The config/, docs/, and scripts/ directories are currently empty and are suggested for future organizational improvements.

Customization ⚙️

Spark Configuration:
- Modify spark/config/spark-defaults.conf (you might need to create the config/ directory first) to adjust Spark settings like memory allocation, cores, etc.
- Edit spark/config/metrics.properties for Prometheus monitoring configuration.
Kafka Topics:
- Kafka topics are initialized by the init-kafka service in docker-compose.yml.
- Modify the command section of the init-kafka service in docker-compose.yml to create different or additional Kafka topics upon cluster startup.
Service Ports:
- Service ports are defined in the ports section of each service in docker-compose.yml.
- Change the host ports (e.g., 9090:8080 - 9090 is the host port) in docker-compose.yml if you need to avoid conflicts with other applications running on your host machine. Be careful not to change the container ports (e.g., 8080 in 9090:8080) unless you know what you are doing.
Environment Variables:
- Many services are configured via environment variables in docker-compose.yml.
- Adjust environment variables in the environment section of each service to customize service behavior (e.g., Spark worker memory, Kafka settings, database credentials, etc.).

Troubleshooting 🐛

Port Conflicts: If you encounter errors related to ports already being in use, modify the host ports in the docker-compose.yml file as described in the "Customization" section.
Service Startup Issues: Check the logs of the failing service using docker-compose logs <service_name>. Common issues might be related to:
Dependencies not met: Ensure services are started in the correct order (check depends_on in docker-compose.yml).
Configuration errors: Double-check configuration files (e.g., hive-site.xml, prometheus.yml, docker-compose.yml) for typos or incorrect settings.
Resource limits: If your system has limited resources, try reducing resource allocation for services (e.g., Spark worker memory, cores) in docker-compose.yml.
Jupyter Connection Issues: If you cannot connect to the Spark context from Jupyter notebooks, ensure:
The Spark Master service is running correctly.
The SPARK_MASTER environment variable in the jupyter service in docker-compose.yml is correctly set to spark://spark-master:7077.
There are no network connectivity issues between Jupyter and Spark Master containers (they should be in the same spark-kafka network).

Contributing 🤝

If you'd like to contribute to this project, feel free to:

Report issues: If you find bugs or unexpected behavior, please open an issue on GitHub.
Suggest enhancements: If you have ideas for new features or improvements, please open an issue or submit a pull request.
Submit pull requests: If you've implemented a fix or a new feature, submit a pull request with a clear description of your changes.

Purpose and Scope 🎯

This project is intended for local learning and testing purposes. It provides a complete Spark ecosystem setup running locally using Docker Compose. It's designed to be a convenient and isolated environment for:

Learning about Apache Spark and its ecosystem components (Kafka, Hive, etc.).
Experimenting with different Spark configurations and workloads.
Developing and testing Spark applications locally before deploying to production environments.
Exploring data visualization with Superset and workflow orchestration with Airflow in a Spark context.

Limitations:

This is a local development environment and is not intended for production use.
Performance may be limited by the resources of your local machine.
Security configurations are simplified for local development and are not production-grade.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Spark Ecosystem Cluster 🚀🔥

Features ✨

Architecture 🏛️

Use Cases and Examples 🚀

Getting Started 🛠️

Prerequisites 📝

Setup & Run ▶️

Directory Structure 🗂️

Customization ⚙️

Troubleshooting 🐛

Contributing 🤝

Purpose and Scope 🎯

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
hive		hive
prometheus		prometheus
spark		spark
sqs		sqs
superset		superset
.env		.env
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml

cherrera20/spark-ecosystem-cluster

Folders and files

Latest commit

History

Repository files navigation

Spark Ecosystem Cluster 🚀🔥

Features ✨

Architecture 🏛️

Use Cases and Examples 🚀

Getting Started 🛠️

Prerequisites 📝

Setup & Run ▶️

Directory Structure 🗂️

Customization ⚙️

Troubleshooting 🐛

Contributing 🤝

Purpose and Scope 🎯

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages