CamilaJaviera91 / modern-data-pipeline-gcp Public

Notifications You must be signed in to change notification settings
Fork 0
Star 2

This project demonstrates best practices in orchestration, cloud-native development, CI/CD, data quality, and monitoring — all in a real-world data pipeline context.

2 stars 0 forks Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 485 Commits
transformations		transformations
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Repository files navigation

🌐 Modern Data Pipeline GCP

📚 Table of Contents

🔍 Description
🚀 Project Overview
- 🔄 Pipeline Highlights
- ✅ Features
📁 modern-data-pipeline-gcp – Project Root
🛠️ Tech Stack
🚀 Quick Start
- 📋 Requirements
- ⚙️ Setup & Usage
🐳 Using Docker Compose with Airflow
- 💻 Commands
- 🔄 Typical workflow -💡 Tips
🧪 Testing
📡 Monitoring & Logging
🚀 CI/CD
🗺️ Roadmap
🤝 Contributing Guidelines
- 🧰 How to Contribute
- 🧪 Testing
🧰 Resources
- 🧱 DBT
👩‍💻 Author

🔍 Description

A production-grade, modular ETL workflow built with Apache Airflow, DBT, and Google Cloud Platform (GCP) services.
Designed for orchestrated extraction, transformation, and loading, with integrated data quality, monitoring, and CI/CD best practices.

🚀 Project Overview

This repository contains an end-to-end ETL workflow designed for scalability, maintainability, and cloud readiness.
It showcases how to orchestrate data pipelines using Airflow, enrich and transform data with DBT, and deploy the solution using containerized environments and CI/CD pipelines.

🔄 Pipeline Highlights

This repository delivers an end-to-end data pipeline with:

Extraction from a PostgreSQL source and external exchange rate API
Enrichment and Transformation using DBT models
Loading to CSV, Google Sheets, and optionally to BigQuery
Orchestration via Airflow DAGs — modular and containerized
Built-in data quality checks, logging, and alerts

✅ Features

Modular structure:
- extract
- transform
- load
- validate
- notify
Docker + Docker Compose for consistent local execution
DBT for SQL-based modeling and schema tests
Airflow DAG to sequence the pipeline steps
Integration with GCP services:
- BigQuery
- Sheets API
- Secret Manager
Robust logging and optional Stackdriver integration

📁 `modern-data-pipeline-gcp` – Project Root

.
├── .venv                               # Virtual environment 
├── transformations                     # Data transformation logic
│   ├── dags
│   │   ├── dag_etl.py
│   │   ├── dag_rates.py
│   │   └── dag_reports.py
│   ├── data                            # CSV exports
│   │   ├── mock_order_items.csv
│   │   ├── mock_orders.csv
│   │   ├── mock_products.csv
│   │   ├── mock_rates.csv
│   │   └── mock_users.csv
│   ├── dbt
│   │   └── .dbt
│   │       ├── .user.yml
│   │       └── profiles.yml
│   ├── models                          # DBT models
│   │   └── mock
│   │       ├── mock_order_items.sql
│   │       ├── mock_orders.sql
│   │       ├── mock_products.sql
│   │       ├── mock_users.sql
│   │       └── schema.yml
│   ├── reports
│   │   ├── active_clients_without_sales.csv
│   │   ├── order_by_status.csv
│   │   └── sales_by_clients.csv
│   ├── scripts                         # Python scripts for pipeline steps
│   │   ├── export_csv.py
│   │   ├── export_sheets.py
│   │   ├── load_exchange_rates.py
│   │   ├── push_to_bigquery.py
│   │   ├── run.sh
│   │   └── upload_tables.py
│   ├── utils                           # Custom utility functions
|   |   └── quality_checks.py
│   ├── airflow
│   ├── dbt_project.yml                 # DBT project config
│   ├── requirement-dev.txt             # Python dependencies
│   ├── requirements.txt                # Python dependencies
│   ├── run_pipeline.py
│   └── wait-for-postgres.sh
├── .gitignore
├── LICENSE
└── README.md                           # Project documentation

🛠️ Tech Stack

Layer	Technologies
Orchestration	Airflow on Docker (locally) or Cloud Composer
Transformation	DBT models deployed to BigQuery
Source data	PostgreSQL
Destinations	CSV, Google Sheets, BigQuery
Cloud infra	GCP: BigQuery, Sheets API, Secret Manager
Containerization	Docker & Docker Compose
Language	Python
CI/CD	GitHub Actions

🚀 Quick Start

📋 Requirements

Docker & Docker Compose
Python 3.9+ (for local development)
GCP project with access to BigQuery, Sheets API, and Secret Manager
PostgreSQL instance for source data
make (optional, if using Makefile shortcuts)

⚙️ Setup & Usage

1. Clone & Navigate

git clone https://github.com/CamilaJaviera91/modern-data-pipeline-gcp.git
cd modern-data-pipeline-gcp

2. Environment Setup

Copy .env.example to .env and supply:

# ⚙️ Airflow
AIRFLOW_UID=...
AIRFLOW__CORE__EXECUTOR=...
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=...
AIRFLOW__CORE__DAGS_FOLDER=...
AIRFLOW__LOGGING__BASE_LOG_FOLDER=...
AIRFLOW__WEBSERVER__SECRET_KEY=...

# 🐘 DBT / Database (PostgreSQL)
DBT_HOST=...
DBT_HOST_TEST=...
DBT_USER=...
DBT_PASSWORD=...
DBT_DBNAME=...
DBT_SCHEMA=...
DBT_PORT=...

# 💱 Exchange Rates API
EXCHANGE_API_KEY=...

# ☁️ Google Cloud / BigQuery
GOOGLE_CREDENTIALS_PATH=...
BQ_PROJECT_ID=...
BQ_DATASET=...

3. Install Locally (optional)

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pip install -r requirements-dev.txt

4. Start Services (Postgres + Airflow)

sudo systemctl start docker 
sudo systemctl status docker #verify if it's running

docker-compose up --build

5. Initialize DBT

cd transformations
dbt init

6. Run Pipeline

For a daily production run:

./run.sh run

Mock mode only:

./run.sh run --select enrich_exchange_rates

🐳 Using Docker Compose with Airflow

This guide shows the basic commands to start and manage Airflow using Docker Compose.

💻 Commands

1. Stop and clean everything

docker compose down -v --remove-orphans

Stops and removes containers, networks, and volumes.
Use this to reset your environment completely.

2. Initialize Airflow

docker compose run airflow-init

Runs a one-time container to set up Airflow’s database and config.
Run this once before starting Airflow.

3. Build Airflow and Postgres

docker compose build

4. Start Airflow and Postgres in the background

docker compose up -d

Starts all services defined in docker-compose.yml.
Runs containers in detached mode (background).

🔄 Typical workflow

Run these commands in order to start fresh:

docker compose down -v --remove-orphans
docker compose run airflow-init #just once
docker compose build
docker compose up -d --remove-orphans

💡 Tips

Make sure Docker and Docker Compose are installed.
To see logs, use:

docker compose logs -f

🧪 Testing

DBT tests for schema, uniqueness, and relationships
Airflow DAG validation: airflow dags list, airflow dags test
Unit tests for custom Python functions in /scripts or /dags
CI pipeline (planned): Linting, formatting, DAG validation, DBT compile

Run tests to validate your pipeline:

./run.sh test
# or
dbt test --select mock

📡 Monitoring & Logging

Airflow task logs viewable via the web UI
Custom loggers for API responses and ETL steps
Optionally integrates with Stackdriver Logging and Alerting

🚀 CI/CD

GitHub Actions triggers include:
- Linting + formatting checks
- DBT compilation and tests
- Docker image builds
- Deployment to Cloud Composer (planned)

🗺️ Roadmap

✅ Initialize core modular pipeline
🧪 Add unit tests for Python & DBT logic
🔁 Implement full CI/CD with automated deploy to GCP
🔄 Extend support to additional sinks (Snowflake, S3, etc.)
⏰ Enable scheduling on Cloud Composer

🤝 Contributing Guidelines

Thank you for your interest in contributing to this project!

🧰 How to Contribute:

Fork the repository.
Clone your fork:
git clone https://github.com/<your-username>/modern-data-pipeline-gcp.git
Create a new branch:
git checkout -b feature/your-feature-name
Make your changes and commit:
git commit -m "Add new feature"
Push to your fork:
git push origin feature/your-feature-name
Submit a pull request to the main branch.

🧰 Resources

👩‍💻 Author

Camila Javiera Muñoz Navarro
Data Engineer & Analyst | BigQuery | Airflow | Python | GCP
GitHub | LinkedIn | Portfolio

⭐ If you find this project useful, give it a ⭐️ and share your feedback or ideas in Issues!

📄 License

This project is licensed under the MIT License.

About

This project demonstrates best practices in orchestration, cloud-native development, CI/CD, data quality, and monitoring — all in a real-world data pipeline context.

dotenv sql datetime postgresql os pandas requests gspread psycopg2 dbt oauth2-client gspread-dataframe

Report repository

Releases

No releases published

Packages

No packages published

Languages