Data Ingestion and WMS for OpenStreetMap Notes
This repository handles downloading, processing, and publishing OSM notes data. It provides:
- Notes ingestion from OSM Planet and API
- Real-time synchronization with the main OSM database
- WMS (Web Map Service) layer publication
- Data monitoring and validation
Note: The analytics, data warehouse, and ETL components have been moved to OSM-Notes-Analytics.
Important: This repository contains only code and configuration files. All data processed by this system comes from OpenStreetMap (OSM) and is licensed under the Open Database License (ODbL). The processed data (notes, boundaries, etc.) stored in the database is derived from OSM and must comply with OSM's licensing requirements.
- OSM Data License: Open Database License (ODbL)
- OSM Copyright: OpenStreetMap contributors
- OSM Attribution: Required when using or distributing OSM data
For more information about OSM licensing, see: https://www.openstreetmap.org/copyright
Important: This system processes personal data from OpenStreetMap, including usernames (which may contain real names) and geographic locations (which may reveal where users live or frequent). We are committed to GDPR compliance.
- Privacy Policy: See docs/GDPR_Privacy_Policy.md for detailed information about data processing, retention, and your rights
- GDPR Procedures: See docs/GDPR_Procedures.md for procedures on handling data subject requests
- GDPR SQL Scripts: See sql/gdpr/README.md for SQL scripts to handle GDPR requests
You have the right to:
- Access your personal data (Article 15)
- Rectification of inaccurate data (Article 16)
- Erasure (Right to be forgotten) under specific conditions (Article 17)
- Data portability (Article 20)
- Object to processing (Article 21)
To exercise your rights, contact: [email protected]
For production, install directories for persistent logs:
# Install directories (requires sudo)
sudo bin/scripts/install_directories.shNote: For development/testing, you can skip this step. The system will automatically use fallback mode (/tmp directories). See docs/LOCAL_SETUP.md for details.
For production use, the daemon mode is recommended for lower latency (30-60 seconds vs 15 minutes):
# Install systemd service
sudo cp examples/systemd/osm-notes-api-daemon.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable osm-notes-api-daemon
sudo systemctl start osm-notes-api-daemonSee docs/Process_API.md "Daemon Mode" section for details.
If systemd is not available, you can use cron:
*/15 * * * * ~/OSM-Notes-Ingestion/bin/process/processAPINotes.sh
The configuration file contains the properties needed to configure this tool, especially the database properties.
These are the main functions of this project:
- Notes Ingestion: Download notes from the OSM Planet and keep data in sync with the main OSM database via API calls. This is configured with a scheduler (cron) and it does everything.
- Country Boundaries: Updates the current country and maritime information. This should be run once a month.
- WMS Layer: Copy the note's data to another set of tables to allow the WMS layer publishing. This is configured via triggers on the database on the main tables.
- Data Monitoring: Monitor the sync by comparing the daily Planet dump with the notes on the database. This is optional and can be configured daily with a cron.
For analytics, data warehouse, ETL, and profile generation, see the OSM-Notes-Analytics repository.
For web visualization and interactive exploration of user and country profiles, see the OSM-Notes-Viewer repository.
This project uses a Git submodule for shared code (lib/osm-common/):
- Common Functions (
commonFunctions.sh): Core utility functions - Validation Functions (
validationFunctions.sh): Data validation - Error Handling (
errorHandlingFunctions.sh): Error handling and recovery - Logger (
bash_logger.sh): Logging library (log4j-style)
These functions are shared with OSM-Notes-Analytics via the OSM-Notes-Common submodule.
# Clone with submodules (recommended)
git clone --recurse-submodules https://github.com/angoca/OSM-Notes-Ingestion.git
# Or initialize after cloning
git clone https://github.com/angoca/OSM-Notes-Ingestion.git
cd OSM-Notes-Ingestion
git submodule update --init --recursiveIf you encounter the error /lib/osm-common/commonFunctions.sh: No such file or directory, the submódule has not been initialized. To fix:
# Initialize and update submodules
git submodule update --init --recursive
# Verify submodule exists
ls -la lib/osm-common/commonFunctions.sh
# If still having issues, re-initialize completely
git submodule deinit -f lib/osm-common
git submodule update --init --recursiveTo check submodule status:
git submodule statusIf the submodule is not properly initialized, you'll see a - prefix in the status output.
If you encounter authentication errors:
For SSH (recommended):
# Test SSH connection to GitHub
ssh -T [email protected]
# If connection fails, set up SSH keys:
# 1. Generate SSH key: ssh-keygen -t ed25519
# 2. Add public key to GitHub: cat ~/.ssh/id_ed25519.pub
# 3. Add key at: https://github.com/settings/keysFor HTTPS:
# Use GitHub Personal Access Token instead of password
# Create token at: https://github.com/settings/tokens
# Then clone: git clone https://[email protected]/...See Submodule Troubleshooting Guide for detailed instructions.
If you're new to this project and want to understand the codebase or contribute, follow this reading path:
-
Start Here (15 min)
- Read this README.md (you're here!)
- Understand the project purpose and main functions
- Review the directory structure below
-
Project Context (30 min)
- Read HISTORY.md - Project history and origins (OpenNotesLatam)
- Read docs/Rationale.md - Why this project exists
- Read docs/Documentation.md - System architecture overview
-
Core Processing (45 min)
- Read docs/Process_API.md - API processing workflow
- Read docs/Process_Planet.md - Planet file processing
-
Entry Points (20 min)
- Read bin/ENTRY_POINTS.md - Which scripts can be called directly
- Understand the main entry points:
processAPINotes.sh,processPlanetNotes.sh,updateCountries.sh
-
Testing (30 min)
- Read docs/Testing_Guide.md - How to run and write tests
- Review docs/Test_Execution_Guide.md - Test execution workflows
-
Deep Dive (as needed)
- Explore specific components in
bin/,sql/,tests/ - Review docs/README.md for complete documentation index
- Explore specific components in
OSM-Notes-Ingestion/
├── bin/ # Executable scripts
│ ├── process/ # Main processing scripts (entry points)
│ ├── monitor/ # Monitoring and validation scripts
│ ├── wms/ # WMS layer management
│ ├── scripts/ # Utility scripts
│ └── lib/ # Shared library functions
├── sql/ # SQL scripts (mirrors bin/ structure)
│ ├── process/ # Database operations for processing
│ ├── monitor/ # Monitoring queries
│ ├── wms/ # WMS layer SQL
│ └── analysis/ # Performance analysis scripts
├── tests/ # Comprehensive test suite
│ ├── unit/ # Unit tests (bash, SQL)
│ ├── integration/ # Integration tests
│ └── mock_commands/ # Mock commands for testing
├── docs/ # Complete documentation
│ ├── Documentation.md # System architecture
│ ├── Rationale.md # Project motivation
│ ├── Process_API.md # API processing details
│ └── Process_Planet.md # Planet processing details
├── etc/ # Configuration files
│ └── properties.sh # Main configuration
├── lib/osm-common/ # Git submodule (shared functions)
├── awk/ # AWK scripts (XML to CSV conversion)
├── overpass/ # Overpass API queries
├── json/ # JSON schemas and test data
├── xsd/ # XML Schema definitions
└── data/ # Data files and backups
- Entry Points: Only scripts in
bin/process/should be called directly (see ENTRY_POINTS.md) - Processing Flow: API → Database → WMS (see Documentation.md)
- Testing: 101 test suites covering all components (see Testing_Guide.md)
- Configuration: All settings in
etc/properties.sh
-
Clone with submodules:
git clone --recurse-submodules https://github.com/angoca/OSM-Notes-Ingestion.git
-
Configure database (see Database Configuration section)
-
Run tests:
./tests/run_all_tests.sh
-
Read entry points:
cat bin/ENTRY_POINTS.md
-
Explore documentation:
cat docs/README.md
For complete documentation navigation, see docs/README.md.
The whole process takes several hours, even days to complete before the profile can be used for any user.
Notes initial load
-
12 minutes: Downloading the countries and maritime areas.
- Countries processing: ~10 minutes (6 parallel threads)
- Maritime boundaries processing: ~2.5 minutes (6 parallel threads)
- This process has a pause between calls because the public Overpass turbo is restricted by the number of requests per minute. If another Overpass instance is used that does not block when many requests, the pause could be removed or reduced.
-
1 minute: Download the Planet notes file.
-
5 minutes: Processing XML notes file.
-
15 minutes: Inserting notes into the database.
-
8 minutes: Processing and consolidating notes from partitions.
-
3 hours: Locating notes in the appropriate country (parallel processing).
- This DB process is executed in parallel with multiple threads.
WMS layer
- 1 minute: creating the objects.
Notes synchronization
The synchronization process time depends on the frequency of the calls and the number of comment actions. If the notes API call is executed every 15 minutes, the complete process takes less than 2 minutes to complete.
This is a simplified version of what you need to execute to run this project on Ubuntu.
# Configure the PostgreSQL database.
sudo apt -y install postgresql
sudo systemctl start postgresql.service
sudo su - postgres
psql << EOF
CREATE USER notes SUPERUSER;
CREATE DATABASE notes WITH OWNER notes;
EOF
exit
# PostGIS extension for Postgres.
sudo apt -y install postgis
psql -d notes << EOF
CREATE EXTENSION postgis;
EOF
# Generalized Search Tree extension for Postgres.
psql -d notes << EOF
CREATE EXTENSION btree_gist
EOF
# Tool to download in parallel threads.
sudo apt install -y aria2
# Tools to validate XML (optional, only if SKIP_XML_VALIDATION=false).
sudo apt -y install libxml2-utils
# Process parts in parallel.
sudo apt install parallel
## jq (required for JSON/GeoJSON validation)
sudo apt install -y jq
# Tools to process geometries.
sudo apt -y install npm
sudo npm install -g osmtogeojson
# JSON validator.
sudo npm install ajv
sudo npm install -g ajv-cli
# Mail sender for notifications.
sudo apt install -y mutt
sudo add-apt-repository ppa:ubuntugis/ppa
sudo apt-get -y install gdal-bin
If you do not configure the prerequisites, each script validates the necessary components to work.
To run the notes database synchronization, configure the crontab like (crontab -e):
# Runs the API extraction each 15 minutes.
# processAPINotes.sh automatically handles:
# - Initial setup: Creates tables and loads historical data if missing
# - Regular sync: Planet synchronization when API limit (10,000 notes) reached + new dump available
*/15 * * * * ~/OSM-Notes-Ingestion/bin/process/processAPINotes.sh
# Runs the boundaries update. Once a month.
# Note: Do NOT use --base flag here. The --base flag is only for complete system reset.
0 12 1 * * ~/OSM-Notes-Ingestion/bin/process/updateCountries.sh
Note: Everything is automatic! Simply configure processAPINotes.sh in cron. It will:
- Handle initial setup automatically on first run (creates tables, loads historical data, loads countries)
- Process API notes every 15 minutes
- Automatically sync with Planet when needed (10K notes + new dump)
No manual setup or separate processPlanetNotes.sh cron entry is required.
For ETL and Analytics scheduling, see the OSM-Notes-Analytics repository.
Before everything, you need to configure the database access and other properties. Important: The actual configuration files are not tracked in Git for security reasons. You must create them from the example files:
# Copy example files to create your local configuration
cp etc/properties.sh.example etc/properties.sh
cp etc/wms.properties.sh.example etc/wms.properties.sh
# Edit the files with your database credentials and settings
vi etc/properties.sh
vi etc/wms.properties.shThe example files contain default values and detailed comments. Replace the
example values (like myuser, changeme, [email protected]) with your
actual configuration.
Important: Make sure to update the DOWNLOAD_USER_AGENT email address
in etc/properties.sh. This follows OpenStreetMap best practices and allows
server administrators to contact you if there are issues with your requests.
Main configuration file: etc/properties.sh (created from etc/properties.sh.example)
You specify the database name and the user to access it.
Other properties are related to improving the parallelism to process the note's location, or to use other URLs for Overpass or the API.
There are two ways to download OSM notes:
- Recent notes from the Planet (including all notes on the daily backup).
- Near real-time notes from API.
These two methods are used in this tool to initialize the DB and poll the API
periodically.
The two mechanisms are used, and they are available under the bin directory:
processAPINotes.shprocessPlanetNotes.sh
However, to configure from scratch, you just need to call
processAPINotes.sh.
If processAPINotes.sh cannot find the base tables, then it will invoke
processPlanetNotes.sh and processPlanetNotes.sh --base that will create the
basic elements on the database and populate it:
- Download countries and maritime areas.
- Download the Planet dump, validate it and convert it CSV to import it into the database. The conversion from the XML Planet dump to CSV is done with an XLST.
- Get the location of the notes.
If processAPINotes.sh gets more than 10,000 notes from an API call, then it
will synchronize the database calling processPlanetNotes.sh following this
process:
- Download the notes from the Planet.
- Remove the duplicates from the ones already in the DB.
- Process the new ones.
- Associate new notes with a country or maritime area.
If processAPINotes.sh gets less than 10,000 notes, it will process them
directly.
Note: If during the same day, there are more than 10,000 notes between two
processAPINotes.sh calls, it will remain unsynchronized until the Planet dump
is updated the next UTC day.
That's why it is recommended to perform frequent API calls.
You can run processAPINotes.sh from a crontab every 15 minutes, to process
notes almost in real-time.
You can export the LOG_LEVEL variable, and then call the scripts normally.
export LOG_LEVEL=DEBUG
./processAPINotes.sh
The levels are (case-sensitive):
- TRACE
- DEBUG
- INFO
- WARN
- ERROR
- FATAL
These are the table types on the database:
- Base tables (notes and note_comments) are the most important holding the whole history. They don't belong to a specific schema.
- API tables which contain the data for recently modified notes and comments. The data from these tables are then bulked into base tables. They don't belong to a specific schema, but a suffix.
- Sync tables contain the data from the recent planet download. They don't belong to a specific schema, but a suffix.
- WMS tables which are used to publish the WMS layer.
Their schema is
wms. They contain a simplified version of the notes with only the location and age. dwhschema contains the data warehouse tables (managed by OSM-Notes-Analytics). See OSM-Notes-Analytics for details.- Check tables are used for monitoring to compare the notes on the previous day between the normal behavior with API and the notes on the last day of the Planet.
Some directories have their own README file to explain their content. These files include details about how to run or troubleshoot the scripts.
bincontains all executable scripts for ingestion and WMS.bin/monitorcontains scripts to monitor the notes database to validate it has the same data as the planet, and send email messages with differences.bin/processhas the main scripts to download the notes database, with the Planet dump and via API calls.bin/wmscontains scripts for WMS (Web Map Service) layer management.etcconfiguration file for many scripts.jsonJSON files for schema and testing.liblibraries used in the project. Currently only a modified version of bash logger.overpassqueries to download data with Overpass for the countries and maritime boundaries.sldfiles to format the WMS layer on the GeoServer.sqlcontains most of the SQL statements to be executed in Postgres. It follows the same directory structure from/binwhere the prefix name is the same as the scripts on the other directory. This directory also contains a script to keep a copy of the locations of the notes in case of a re-execution of the whole Planet process. And also the script to remove everything related to this project from the DB.sql/monitorscripts to check the notes database, comparing it with a Planet dump.sql/processhas all SQL scripts to load the notes database.sql/wmsprovides the mechanism to publish a WMS from the notes. This is the only exception to the other files undersqlbecause this feature is supported only on SQL scripts; there is no bash script for this. This is the only location of the files related to the WMS layer publishing.- For DWH/ETL SQL scripts, see OSM-Notes-Analytics.
testset of scripts to perform tests. This is not part of a Unit Test set.xsdcontains the structure of the XML documents to be retrieved - XML Schema. This helps validate the structure of the documents, preventing errors during the import from the Planet dump and API calls.awkcontains all the AWK extraction scripts for the data retrieved from the Planet dump and API calls. They convert XML to CSV efficiently with minimal dependencies.
Periodically, you can run the following script to monitor and validate that
executions are correct, and also that notes processing have not had errors:
processCheckPlanetNotes.sh.
This script will create 2 tables, one for notes and one for comments, with the
suffix _check.
By querying the tables with and without the suffix, you can get the
differences;
however, it better works around 6h UTC when the OSM Planet file is published.
This will compare the differences between the API process and the Planet data.
If you find many differences, especially for comments older than one day, it
means the script failed in the past, and the best is to recreate the database
with the processPlanetNotes.sh script.
It is also recommended to create an issue in this GitHub repository, providing
as much information as possible.
This is the way to create the objects for the WMS layer.
More information is in the README.md file under the sql/wms directory.
Use the WMS manager script for easy installation and management:
# Install WMS components
~/OSM-Notes-Ingestion/bin/wms/wmsManager.sh install
# Check installation status
~/OSM-Notes-Ingestion/bin/wms/wmsManager.sh status
# Remove WMS components
~/OSM-Notes-Ingestion/bin/wms/wmsManager.sh deinstall
# Show help
~/OSM-Notes-Ingestion/bin/wms/wmsManager.sh helpFor manual installation, execute the SQL directly:
psql -d notes -v ON_ERROR_STOP=1 -f ~/OSM-Notes-Ingestion/sql/wms/prepareDatabase.sqlThese are the external dependencies to make it work.
- OSM Planet dump, which creates a daily file with all notes and comments. The file is an XML and it weighs several hundreds of MB of compressed data.
- Overpass to download the current boundaries of the countries and maritimes areas.
- OSM API which is used to get the most recent notes and comments. The current API version supported is 0.6.
- The whole process relies on a PostgreSQL database. It uses intensive SQL action to have a good performance when processing the data.
The external dependencies are almost fixed, however, they could be changed from the properties file.
These are external libraries:
- bash_logger, which is a tool to write log4j-like messages in Bash. This tool is included as part of the project.
- Bash 4 or higher, because the main code is developed in the scripting language.
- Linux and its commands, because it is developed in Bash, which uses a lot of command line instructions.
You can use the following script to remove components from this tool. This is useful if you have to recreate some parts, but the rest is working fine.
# Remove all components from the database (uses default from properties: notes)
~/OSM-Notes-Ingestion/bin/cleanupAll.sh
# Clean only partitions
~/OSM-Notes-Ingestion/bin/cleanupAll.sh -p
# Change database in etc/properties.sh (DBNAME variable)
# Then run cleanup for that database
~/OSM-Notes-Ingestion/bin/cleanupAll.shNote: This script handles all components including partition tables, dependencies, and temporary files automatically. Manual cleanup is not recommended as it may leave partition tables or dependencies unresolved.
You can start looking for help by reading the README.md files. Also, you run the scripts with -h or --help. There are few Github wiki pages with interesting information. You can even take a look at the code, which is highly documented. Finally, you can create an issue or contact the author.
The project includes comprehensive testing infrastructure with 101 test suite files (~1,000+ individual tests) covering all ingestion system components.
# Run all tests (recommended)
./tests/run_all_tests.sh
# Run simple tests (no sudo required)
./tests/run_tests_simple.sh
# Run integration tests
./tests/run_integration_tests.sh
# Run quality tests
./tests/run_quality_tests.sh
# Run logging pattern validation tests
./tests/run_logging_validation_tests.sh
# Run sequential tests by level
./tests/run_tests_sequential.sh quick # 15-20 min- Unit Tests: 86 bash suites + 6 SQL suites
- Integration Tests: 8 end-to-end workflow suites
- Parallel Processing: 1 comprehensive suite with 21 tests
- Validation Tests: Data validation, XML processing, error handling
- Performance Tests: Parallel processing, edge cases, optimization
- Quality Tests: Code quality, conventions, formatting
- Logging Pattern Tests: Logging pattern validation and compliance
- WMS Tests: Web Map Service integration and configuration
- ✅ Data Processing: XML/CSV processing, transformations
- ✅ System Integration: Database operations, API integration, WMS services
- ✅ Quality Assurance: Code quality, error handling, edge cases
- ✅ Infrastructure: Monitoring, configuration, tools and utilities
- ✅ Logging Patterns: Logging pattern validation and compliance across all scripts
- Testing Suites Reference - Complete list of all testing suites
- Testing Guide - Testing guidelines and workflows
- Testing Workflows Overview - CI/CD testing workflows
For detailed testing information, see the Testing Suites Reference documentation.
The project uses PostgreSQL for data storage. Before running the scripts, ensure proper database configuration:
-
Install PostgreSQL:
sudo apt-get update && sudo apt-get install postgresql postgresql-contrib -
Configure authentication (choose one option):
Option A: Trust authentication (recommended for development)
sudo nano /etc/postgresql/15/main/pg_hba.conf # Change 'peer' to 'trust' for local connections sudo systemctl restart postgresqlOption B: Password authentication
echo "localhost:5432:notes:myuser:your_password" > ~/.pgpass chmod 600 ~/.pgpass
-
Test connection:
psql -U myuser -d notes -c "SELECT 1;"
The project is configured to use:
- Database:
notes(default frometc/properties.sh.example) - User:
myuser - Authentication: peer (uses system user)
Configuration is stored in etc/properties.sh (created from etc/properties.sh.example).
Important: Always create etc/properties.sh from the example file before
running the scripts. The actual file is not tracked in Git for security
reasons.
For troubleshooting, check the PostgreSQL logs and ensure proper authentication configuration.
The configuration files (etc/properties.sh and etc/wms.properties.sh) are
already in .gitignore and will not be committed to the repository. This
ensures your local credentials and settings remain secure.
To set up your local configuration:
# Create your local configuration files from the examples
cp etc/properties.sh.example etc/properties.sh
cp etc/wms.properties.sh.example etc/wms.properties.sh
# Edit with your local settings
vi etc/properties.sh
vi etc/wms.properties.shNote: The example files (.example) are tracked in Git and serve as
templates. Your local files (without .example) contain your actual
credentials and are ignored by Git.
Andres Gomez (@AngocA) was the main developer of this idea. He thanks Jose Luis Ceron Sarria for all his help designing the architecture, defining the data modeling and implementing the infrastructure on the cloud.