Skip to content

javiernavarroveliz/takehome-weather-de

Repository files navigation

Data Engineer - Take Home Assignment

This is a take-home assignment for the Data Engineer position, completed by Javier Navarro Véliz.

Project Overview

The goal of this project is to extract, transform, and store weather data from the weather.gov API, and then run analytical queries on the data using SQL.

How to run

1. Clone the repository

git clone https://github.com/chaperyolo/takehome-weather-de
cd takehome-weather-de

2. Set up your environment

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

3. Set up environment variables

Duplicate the example file and edit it as needed

cp .env.example .env

Or create your own .env file with the following structure:

STATION_ID=KGPH
API_BASE_URL=https://api.weather.gov
USER_AGENT=(example_weather_app, [email protected])
DB_PATH=db/weather_data.db
QUERIES_DIR=queries

4. Run the pipeline

python3 main.py

This will:

  • Create a local SQLite database
  • Fetch data from the past 7 days (or from the last record onwards on the next runs)
  • Insert new observations while avoiding duplicates

5. Run the SQL queries

python3 run_queries.py

This script executes all .sql queries inside the queries/ folder and prints the results.

Environment Variables

The following variables should be defined in your .env file:

Variable Description Default (if not provided)
STATION_ID NOAA station ID KGPH
API_BASE_URL Base URL of the weather API https://api.weather.gov
USER_AGENT Required for API access (example_weather_app, [email protected])
DB_PATH SQLite DB location db/weather_data.db
QUERIES_DIR Folder with .sql queries queries

If any of these are not provided, the script will fall back to the default values above without throwing an error.

Notes & Assumptions

  • The pipeline avoids inserting duplicate records using INSERT OR IGNORE.
  • If data already exists in the database, the script fetches only newer observations.
  • Date filtering is based on the observation_timestamp column.
  • Important: One of the queries computes weekly metrics starting from last Monday. Because the pipeline only fetches the last 7 days of data, if the script is run for the first time mid-week (e.g., Wednesday), there may not be enough historical data to cover the full week (Monday and Tuesday would be missing). This design choice follows the assignment’s instruction to fetch only 7 days of data.
  • Creating an index on observation_timestamp in the weather_data table may improve performance for larger datasets (not included in this version for simplicity).
  • All measurements (temperature, wind speed, humidity) are assumed to have consistent units across all observations for a given station. Therefore, the columns for unit of measurement are not included in the database schema. If data from multiple stations with varying measurement units were to be integrated, the schema should be extended to include unit fields (e.g., temperature_unit, wind_speed_unit) to ensure data consistency.

Compatibility

This project has been tested with the following Python versions:

  • Python 3.9
  • Python 3.10
  • Python 3.12

Other versions may work, but these are the ones confirmed to be compatible.

exploration.ipynb

This notebook was used during the development phase to understand the structure of the data retrieved from the API. It helped in:

  • Inspecting raw responses from the station and observations endpoints.
  • Understanding the format, frequency, and variability of the data.
  • Defining which fields to extract and store in the database.
  • Verifying that timestamps, temperature, wind speed, and humidity values were consistent and usable.
  • Performing basic quality checks like null value inspection and date range validation.

It is not required for running the pipeline but is included as documentation of the exploratory data analysis process.

About

take home assignment for data engineer position

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published