Data Engineer - Take Home Assignment

This is a take-home assignment for the Data Engineer position, completed by Javier Navarro Véliz.

Project Overview

The goal of this project is to extract, transform, and store weather data from the weather.gov API, and then run analytical queries on the data using SQL.

How to run

1. Clone the repository

git clone https://github.com/chaperyolo/takehome-weather-de
cd takehome-weather-de

2. Set up your environment

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

3. Set up environment variables

Duplicate the example file and edit it as needed

cp .env.example .env

Or create your own .env file with the following structure:

STATION_ID=KGPH
API_BASE_URL=https://api.weather.gov
USER_AGENT=(example_weather_app, [email protected])
DB_PATH=db/weather_data.db
QUERIES_DIR=queries

4. Run the pipeline

python3 main.py

This will:

Create a local SQLite database
Fetch data from the past 7 days (or from the last record onwards on the next runs)
Insert new observations while avoiding duplicates

5. Run the SQL queries

python3 run_queries.py

This script executes all .sql queries inside the queries/ folder and prints the results.

Environment Variables

The following variables should be defined in your .env file:

Variable	Description	Default (if not provided)
`STATION_ID`	NOAA station ID	`KGPH`
`API_BASE_URL`	Base URL of the weather API	`https://api.weather.gov`
`USER_AGENT`	Required for API access	`(example_weather_app, [email protected])`
`DB_PATH`	SQLite DB location	`db/weather_data.db`
`QUERIES_DIR`	Folder with .sql queries	`queries`

If any of these are not provided, the script will fall back to the default values above without throwing an error.

Notes & Assumptions

The pipeline avoids inserting duplicate records using INSERT OR IGNORE.
If data already exists in the database, the script fetches only newer observations.
Date filtering is based on the observation_timestamp column.
Important: One of the queries computes weekly metrics starting from last Monday. Because the pipeline only fetches the last 7 days of data, if the script is run for the first time mid-week (e.g., Wednesday), there may not be enough historical data to cover the full week (Monday and Tuesday would be missing). This design choice follows the assignment’s instruction to fetch only 7 days of data.
Creating an index on observation_timestamp in the weather_data table may improve performance for larger datasets (not included in this version for simplicity).
All measurements (temperature, wind speed, humidity) are assumed to have consistent units across all observations for a given station. Therefore, the columns for unit of measurement are not included in the database schema. If data from multiple stations with varying measurement units were to be integrated, the schema should be extended to include unit fields (e.g., temperature_unit, wind_speed_unit) to ensure data consistency.

Compatibility

This project has been tested with the following Python versions:

Python 3.9
Python 3.10
Python 3.12

Other versions may work, but these are the ones confirmed to be compatible.

exploration.ipynb

This notebook was used during the development phase to understand the structure of the data retrieved from the API. It helped in:

Inspecting raw responses from the station and observations endpoints.
Understanding the format, frequency, and variability of the data.
Defining which fields to extract and store in the database.
Verifying that timestamps, temperature, wind speed, and humidity values were consistent and usable.
Performing basic quality checks like null value inspection and date range validation.

It is not required for running the pipeline but is included as documentation of the exploratory data analysis process.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Engineer - Take Home Assignment

Project Overview

How to run

1. Clone the repository

2. Set up your environment

3. Set up environment variables

4. Run the pipeline

5. Run the SQL queries

Environment Variables

Notes & Assumptions

Compatibility

exploration.ipynb

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
queries		queries
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
exploration.ipynb		exploration.ipynb
main.py		main.py
requirements.txt		requirements.txt
run_queries.py		run_queries.py

javiernavarroveliz/takehome-weather-de

Folders and files

Latest commit

History

Repository files navigation

Data Engineer - Take Home Assignment

Project Overview

How to run

1. Clone the repository

2. Set up your environment

3. Set up environment variables

4. Run the pipeline

5. Run the SQL queries

Environment Variables

Notes & Assumptions

Compatibility

exploration.ipynb

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages