A simple, production-grade ETL pipeline for cleaning and validating raw boat sales data. The test data is from Kaggle. This script:
- Cleans UTF-8 encoded CSV files with non-ASCII characters.
- Converts currency and year fields into structured formats.
- Validates schema using Pandera.
- Jupyter Notebook contains an exploratory analysis of the same data with some graphs
You can run the ETL pipeline as follows.
boat-etl \
-i data/boat_data.csv \
-o output/validated_boat_data.csv
You can run the ETL pipeline inside a Docker container for reproducibility and ease of deployment.
docker build -t boat-etl .
Run the pipeline using default parameters (as set in CMD
of Dockerfile
):
docker run --rm boat-etl
Or specify input and output paths:
docker run --rm \
-v $(pwd)/data:/app/data \
-v $(pwd)/output:/app/output \
boat-etl \
-i data/boat_data.csv \
-o output/validated_boat_data.csv