Skip to content

ndaniel/boat-data-etl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Boat Data - ETL pipeline

A simple, production-grade ETL pipeline for cleaning and validating raw boat sales data. The test data is from Kaggle. This script:

  • Cleans UTF-8 encoded CSV files with non-ASCII characters.
  • Converts currency and year fields into structured formats.
  • Validates schema using Pandera.
  • Jupyter Notebook contains an exploratory analysis of the same data with some graphs

Usage

You can run the ETL pipeline as follows.

  boat-etl \
  -i data/boat_data.csv \
  -o output/validated_boat_data.csv

Docker Usage

You can run the ETL pipeline inside a Docker container for reproducibility and ease of deployment.

1. Build the Docker image

docker build -t boat-etl .

2. Run the ETL pipeline

Run the pipeline using default parameters (as set in CMD of Dockerfile):

docker run --rm boat-etl

Or specify input and output paths:

docker run --rm \
  -v $(pwd)/data:/app/data \
  -v $(pwd)/output:/app/output \
  boat-etl \
  -i data/boat_data.csv \
  -o output/validated_boat_data.csv

About

ETL pipelien

Resources

License

Stars

Watchers

Forks

Packages

No packages published