This is a take-home assignment for the Data Engineer position, completed by Javier Navarro Véliz.
The goal of this project is to extract, transform, and store weather data from the weather.gov API, and then run analytical queries on the data using SQL.
git clone https://github.com/chaperyolo/takehome-weather-de
cd takehome-weather-depython3 -m venv venv
source venv/bin/activate
pip install -r requirements.txtDuplicate the example file and edit it as needed
cp .env.example .envOr create your own .env file with the following structure:
STATION_ID=KGPH
API_BASE_URL=https://api.weather.gov
USER_AGENT=(example_weather_app, [email protected])
DB_PATH=db/weather_data.db
QUERIES_DIR=queriespython3 main.pyThis will:
- Create a local SQLite database
- Fetch data from the past 7 days (or from the last record onwards on the next runs)
- Insert new observations while avoiding duplicates
python3 run_queries.pyThis script executes all .sql queries inside the queries/ folder and prints the results.
The following variables should be defined in your .env file:
| Variable | Description | Default (if not provided) |
|---|---|---|
STATION_ID |
NOAA station ID | KGPH |
API_BASE_URL |
Base URL of the weather API | https://api.weather.gov |
USER_AGENT |
Required for API access | (example_weather_app, [email protected]) |
DB_PATH |
SQLite DB location | db/weather_data.db |
QUERIES_DIR |
Folder with .sql queries | queries |
If any of these are not provided, the script will fall back to the default values above without throwing an error.
- The pipeline avoids inserting duplicate records using
INSERT OR IGNORE. - If data already exists in the database, the script fetches only newer observations.
- Date filtering is based on the observation_timestamp column.
- Important: One of the queries computes weekly metrics starting from last Monday. Because the pipeline only fetches the last 7 days of data, if the script is run for the first time mid-week (e.g., Wednesday), there may not be enough historical data to cover the full week (Monday and Tuesday would be missing). This design choice follows the assignment’s instruction to fetch only 7 days of data.
- Creating an index on observation_timestamp in the
weather_datatable may improve performance for larger datasets (not included in this version for simplicity). - All measurements (temperature, wind speed, humidity) are assumed to have consistent units across all observations for a given station. Therefore, the columns for unit of measurement are not included in the database schema. If data from multiple stations with varying measurement units were to be integrated, the schema should be extended to include unit fields (e.g., temperature_unit, wind_speed_unit) to ensure data consistency.
This project has been tested with the following Python versions:
- Python 3.9
- Python 3.10
- Python 3.12
Other versions may work, but these are the ones confirmed to be compatible.
This notebook was used during the development phase to understand the structure of the data retrieved from the API. It helped in:
- Inspecting raw responses from the station and observations endpoints.
- Understanding the format, frequency, and variability of the data.
- Defining which fields to extract and store in the database.
- Verifying that timestamps, temperature, wind speed, and humidity values were consistent and usable.
- Performing basic quality checks like null value inspection and date range validation.
It is not required for running the pipeline but is included as documentation of the exploratory data analysis process.