A modern Python 3.13+ microservices system for processing Discogs database exports into multiple storage backends.
Transform the entire Discogs music database into queryable graph and relational databases. This system downloads monthly data dumps from Discogs, efficiently parses XML files, and stores the data in both Neo4j (graph database) and PostgreSQL (relational database) for different query patterns and use cases.
Discogsography consists of five microservices that work together to process, monitor, and explore the complete Discogs database:
- Dashboard - Real-time monitoring dashboard with WebSocket updates for all services
- Extractor - Downloads Discogs XML dumps from S3, validates checksums, parses XML to JSON, and publishes to message queues
- Graphinator - Consumes messages and builds a graph database in Neo4j with relationships between artists, labels, releases, and masters
- Tableinator - Consumes messages and stores denormalized data in PostgreSQL for fast queries and full-text search
- Discovery - AI-powered music discovery service with semantic search, recommendations, and advanced analytics
graph TD
S3[("Discogs S3<br/>Data Dumps")]
EXT[["Extractor<br/>(XML β JSON)"]]
RMQ{{"RabbitMQ<br/>Message Queue"}}
NEO4J[(Neo4j<br/>Graph DB)]
PG[(PostgreSQL<br/>Relational DB)]
GRAPH[["Graphinator"]]
TABLE[["Tableinator"]]
DASH[["Dashboard<br/>(Monitoring)"]]
DISCO[["Discovery<br/>(AI Analytics)"]]
S3 -->|Download & Parse| EXT
EXT -->|Publish Messages| RMQ
RMQ -->|Consume| GRAPH
RMQ -->|Consume| TABLE
GRAPH -->|Store| NEO4J
TABLE -->|Store| PG
DISCO -.->|Query & Analyze| NEO4J
DISCO -.->|Query & Analyze| PG
DISCO -.->|AI Processing| DISCO
DASH -.->|Monitor| EXT
DASH -.->|Monitor| GRAPH
DASH -.->|Monitor| TABLE
DASH -.->|Monitor| DISCO
DASH -.->|Query Stats| RMQ
DASH -.->|Query Stats| NEO4J
DASH -.->|Query Stats| PG
style S3 fill:#e1f5fe
style RMQ fill:#fff3e0
style NEO4J fill:#f3e5f5
style PG fill:#e8f5e9
style DASH fill:#fce4ec
style DISCO fill:#e3f2fd
- π Automatic Updates: Periodic checking for new Discogs data releases (configurable interval, default 15 days)
- β‘ Efficient Processing: Hash-based deduplication to avoid reprocessing unchanged records
- π High Performance: Multi-threaded XML parsing and concurrent message processing
- π‘οΈ Fault Tolerance: Message acknowledgment, automatic retries, and graceful shutdown
- π Progress Tracking: Real-time progress monitoring with detailed statistics
- π Production-Ready Docker: Full Docker Compose setup with security hardening
- π Type Safety: Comprehensive type hints and strict mypy validation
- π Security First: Bandit scanning, secure coding practices, and container hardening
- π CLAUDE.md - Detailed technical documentation for development
- π€ Task Automation - Taskipy commands and workflows
- π Docker Security - Container security best practices
- ποΈ Dockerfile Standards - Dockerfile implementation standards
- Python: 3.13+ (Install with uv)
- Docker: Docker Desktop or Docker Engine with Compose
- Storage: ~100GB free disk space for Discogs data
- Memory: 8GB+ RAM recommended
- Network: Stable internet for initial download (~50GB)
-
Clone the repository:
git clone https://github.com/SimplicityGuy/discogsography.git cd discogsography
-
Start all services:
docker-compose up -d
-
Monitor the logs:
docker-compose logs -f extractor
-
Access the services:
Service URL/Connection Credentials π Dashboard http://localhost:8003 No auth required π΅ Discovery http://localhost:8005 No auth required π° RabbitMQ http://localhost:15672 discogsography / discogsography π Neo4j Browser http://localhost:7474 neo4j / discogsography π PostgreSQL localhost:5433 discogsography / discogsography
-
Install uv package manager:
curl -LsSf https://astral.sh/uv/install.sh | sh
-
Install dependencies:
uv sync --all-extras
-
Set up pre-commit hooks:
uv run pre-commit install
-
Set environment variables:
export AMQP_CONNECTION="amqp://guest:guest@localhost:5672/" export NEO4J_ADDRESS="bolt://localhost:7687" export NEO4J_USERNAME="neo4j" export NEO4J_PASSWORD="password" export POSTGRES_ADDRESS="localhost:5433" export POSTGRES_USERNAME="postgres" export POSTGRES_PASSWORD="password" export POSTGRES_DATABASE="discogsography"
-
Run services:
# Terminal 1 - Dashboard uv run python dashboard/dashboard.py # Terminal 2 - Extractor uv run python extractor/extractor.py # Terminal 3 - Graphinator uv run python graphinator/graphinator.py # Terminal 4 - Tableinator uv run python tableinator/tableinator.py
All services are configured via environment variables. Copy .env.example
to .env
and customize as needed:
Variable | Description | Default | Service |
---|---|---|---|
AMQP_CONNECTION |
RabbitMQ connection string | Required | All |
DISCOGS_ROOT |
Path for downloaded files | /discogs-data |
Extractor |
PERIODIC_CHECK_DAYS |
Days between update checks | 15 |
Extractor |
NEO4J_ADDRESS |
Neo4j bolt address | Required | Dashboard, Graphinator |
NEO4J_USERNAME |
Neo4j username | Required | Dashboard, Graphinator |
NEO4J_PASSWORD |
Neo4j password | Required | Dashboard, Graphinator |
POSTGRES_ADDRESS |
PostgreSQL host:port | Required | Dashboard, Tableinator |
POSTGRES_USERNAME |
PostgreSQL username | Required | Dashboard, Tableinator |
POSTGRES_PASSWORD |
PostgreSQL password | Required | Dashboard, Tableinator |
POSTGRES_DATABASE |
PostgreSQL database | Required | Dashboard, Tableinator |
The complete Discogs dataset includes:
Data Type | Count | Storage |
---|---|---|
π Releases | ~15 million | ~40GB |
π€ Artists | ~2 million | ~5GB |
π΅ Masters | ~2 million | ~3GB |
π’ Labels | ~1.5 million | ~2GB |
Total Requirements:
- π₯ Download: ~50GB compressed XML files
- πΎ Storage: ~100GB for extracted data
- β±οΈ Processing: 2-6 hours (varies by hardware)
Explore complex relationships in the music industry:
MATCH (a:Artist {name: "Pink Floyd"})-[:BY]-(r:Release)
RETURN r.title, r.year
ORDER BY r.year
LIMIT 10
MATCH (member:Artist)-[:MEMBER_OF]->(band:Artist {name: "The Beatles"})
RETURN member.name, member.real_name
MATCH (r:Release)-[:ON]->(l:Label {name: "Blue Note"})
WHERE r.year >= 1950 AND r.year <= 1970
RETURN r.title, r.artist, r.year
ORDER BY r.year
MATCH (a1:Artist {name: "Miles Davis"})-[:COLLABORATED_WITH]-(a2:Artist)
RETURN DISTINCT a2.name
ORDER BY a2.name
Fast structured queries on denormalized data:
SELECT
data->>'title' as title,
data->>'artist' as artist,
data->>'year' as year
FROM releases
WHERE data->>'title' ILIKE '%dark side%'
ORDER BY (data->>'year')::int DESC
LIMIT 10;
SELECT
data->>'title' as title,
data->>'year' as year,
data->'genres' as genres
FROM releases
WHERE data->>'artist' = 'Miles Davis'
AND (data->>'year')::int BETWEEN 1950 AND 1960
ORDER BY (data->>'year')::int;
SELECT
genre,
COUNT(*) as release_count,
MIN((data->>'year')::int) as first_release,
MAX((data->>'year')::int) as last_release
FROM releases,
jsonb_array_elements_text(data->'genres') as genre
GROUP BY genre
ORDER BY release_count DESC
LIMIT 20;
Access the real-time monitoring dashboard at http://localhost:8003:
- Service Health: Live status of all microservices
- Queue Metrics: Message rates, depths, and consumer counts
- Database Stats: Connection pools and storage usage
- Activity Log: Recent system events and processing updates
- WebSocket Updates: Real-time data without page refresh
Monitor and debug your system with built-in tools:
# Check service logs for errors
uv run task check-errors
# Monitor RabbitMQ queues in real-time
uv run task monitor
# Comprehensive system health dashboard
uv run task system-monitor
# View logs for all services
uv run task logs
Each service provides detailed telemetry:
- Processing Rates: Records/second for each data type
- Queue Health: Depth, consumer count, throughput
- Error Tracking: Failed messages, retry counts
- Performance: Processing time, memory usage
- Stall Detection: Alerts when processing stops
The project leverages cutting-edge Python tooling:
Tool | Purpose | Configuration |
---|---|---|
uv | 10-100x faster package management | pyproject.toml |
ruff | Lightning-fast linting & formatting | pyproject.toml |
mypy | Strict static type checking | pyproject.toml |
bandit | Security vulnerability scanning | pyproject.toml |
pre-commit | Git hooks for code quality | .pre-commit-config.yaml |
Comprehensive test coverage with multiple test types:
# Run all tests (excluding E2E)
uv run task test
# Run with coverage report
uv run task test-cov
# Run specific test suites
uv run pytest tests/extractor/ # Extractor tests
uv run pytest tests/graphinator/ # Graphinator tests
uv run pytest tests/tableinator/ # Tableinator tests
uv run pytest tests/dashboard/ # Dashboard tests
# One-time browser setup
uv run playwright install chromium
uv run playwright install-deps chromium
# Run E2E tests (automatic server management)
uv run task test-e2e
# Run with specific browser
uv run pytest tests/dashboard/test_dashboard_ui.py -m e2e --browser firefox
# Setup development environment
uv sync --all-extras
uv run task init # Install pre-commit hooks
# Before committing
uv run task lint # Run linting
uv run task format # Format code
uv run task test # Run tests
uv run task security # Security scan
# Or run everything at once
uv run pre-commit run --all-files
discogsography/
βββ π¦ common/ # Shared utilities and configuration
β βββ config.py # Centralized configuration management
β βββ health_server.py # Health check endpoint server
βββ π dashboard/ # Real-time monitoring dashboard
β βββ dashboard.py # FastAPI backend with WebSocket
β βββ static/ # Frontend HTML/CSS/JS
βββ π₯ extractor/ # Discogs data ingestion service
β βββ extractor.py # Main processing logic
β βββ discogs.py # S3 download and validation
βββ π graphinator/ # Neo4j graph database service
β βββ graphinator.py # Graph relationship builder
βββ π tableinator/ # PostgreSQL storage service
β βββ tableinator.py # Relational data management
βββ π§ utilities/ # Operational tools
β βββ check_errors.py # Log analysis
β βββ monitor_queues.py # Real-time queue monitoring
β βββ system_monitor.py # System health dashboard
βββ π§ͺ tests/ # Comprehensive test suite
βββ π docs/ # Additional documentation
βββ π docker-compose.yml # Container orchestration
βββ π¦ pyproject.toml # Project configuration
All logger calls (logger.info
, logger.warning
, logger.error
) in this project follow a consistent emoji pattern for visual clarity. Each message starts with an emoji followed by exactly one space before the message text.
Emoji | Usage | Example |
---|---|---|
π | Startup messages | logger.info("π Starting service...") |
β | Success/completion messages | logger.info("β
Operation completed successfully") |
β | Errors | logger.error("β Failed to connect to database") |
Warnings | logger.warning("β οΈ Connection timeout, retrying...") |
|
π | Shutdown/stop messages | logger.info("π Shutting down gracefully") |
π | Progress/statistics | logger.info("π Processed 1000 records") |
π₯ | Downloads | logger.info("π₯ Starting download of data") |
β¬οΈ | Downloading files | logger.info("β¬οΈ Downloading file.xml") |
π | Processing operations | logger.info("π Processing batch of messages") |
β³ | Waiting/pending | logger.info("β³ Waiting for messages...") |
π | Metadata operations | logger.info("π Loaded metadata from cache") |
π | Checking/searching | logger.info("π Checking for updates...") |
π | File operations | logger.info("π File created successfully") |
π | New versions | logger.info("π Found newer version available") |
β° | Periodic operations | logger.info("β° Running periodic check") |
π§ | Setup/configuration | logger.info("π§ Creating database indexes") |
π° | RabbitMQ connections | logger.info("π° Connected to RabbitMQ") |
π | Neo4j connections | logger.info("π Connected to Neo4j") |
π | PostgreSQL operations | logger.info("π Connected to PostgreSQL") |
πΎ | Database save operations | logger.info("πΎ Updated artist ID=123 in Neo4j") |
π₯ | Health server | logger.info("π₯ Health server started on port 8001") |
β© | Skipping operations | logger.info("β© Skipped artist ID=123 (no changes)") |
logger.info("π Starting Discogs data extractor")
logger.error("β Failed to connect to Neo4j: connection refused")
logger.warning("β οΈ Slow consumer detected, processing delayed")
logger.info("β
All files processed successfully")
The graph database models complex music industry relationships:
Node | Description | Key Properties |
---|---|---|
Artist |
Musicians, bands, producers | id, name, real_name, profile |
Label |
Record labels and imprints | id, name, profile, parent_label |
Master |
Master recordings | id, title, year, main_release |
Release |
Physical/digital releases | id, title, year, country, format |
Genre |
Musical genres | name |
Style |
Sub-genres and styles | name |
π€ Artist Relationships:
βββ MEMBER_OF βββββββ Artist (band membership)
βββ ALIAS_OF ββββββββ Artist (alternative names)
βββ COLLABORATED_WITH β Artist (collaborations)
βββ PERFORMED_ON ββββ Release (credits)
π Release Relationships:
βββ BY βββββββββββββ Artist (performer credits)
βββ ON βββββββββββββ Label (release label)
βββ DERIVED_FROM βββ Master (master recording)
βββ IS βββββββββββββ Genre (genre classification)
βββ IS βββββββββββββ Style (style classification)
π’ Label Relationships:
βββ SUBLABEL_OF ββββ Label (parent/child labels)
π΅ Classification:
βββ Style -[:PART_OF]β Genre (hierarchy)
Optimized for fast queries and full-text search:
-- Artists table with JSONB for flexible schema
CREATE TABLE artists (
data_id VARCHAR PRIMARY KEY,
hash VARCHAR NOT NULL UNIQUE,
data JSONB NOT NULL
);
CREATE INDEX idx_artists_name ON artists ((data->>'name'));
CREATE INDEX idx_artists_gin ON artists USING GIN (data);
-- Labels table
CREATE TABLE labels (
data_id VARCHAR PRIMARY KEY,
hash VARCHAR NOT NULL UNIQUE,
data JSONB NOT NULL
);
CREATE INDEX idx_labels_name ON labels ((data->>'name'));
-- Masters table
CREATE TABLE masters (
data_id VARCHAR PRIMARY KEY,
hash VARCHAR NOT NULL UNIQUE,
data JSONB NOT NULL
);
CREATE INDEX idx_masters_title ON masters ((data->>'title'));
CREATE INDEX idx_masters_year ON masters ((data->>'year'));
-- Releases table with extensive indexing
CREATE TABLE releases (
data_id VARCHAR PRIMARY KEY,
hash VARCHAR NOT NULL UNIQUE,
data JSONB NOT NULL
);
CREATE INDEX idx_releases_title ON releases ((data->>'title'));
CREATE INDEX idx_releases_artist ON releases ((data->>'artist'));
CREATE INDEX idx_releases_year ON releases ((data->>'year'));
CREATE INDEX idx_releases_gin ON releases USING GIN (data);
Typical processing rates on modern hardware:
Service | Records/Second | Bottleneck |
---|---|---|
π₯ Extractor | 5,000-10,000 | XML parsing, I/O |
π Graphinator | 1,000-2,000 | Neo4j transactions |
π Tableinator | 3,000-5,000 | PostgreSQL inserts |
- CPU: 4 cores
- RAM: 8GB
- Storage: 200GB HDD
- Network: 10 Mbps
- CPU: 8+ cores
- RAM: 16GB+
- Storage: 200GB+ SSD (NVMe preferred)
- Network: 100 Mbps+
Neo4j Configuration:
# neo4j.conf
dbms.memory.heap.initial_size=4g
dbms.memory.heap.max_size=4g
dbms.memory.pagecache.size=2g
PostgreSQL Configuration:
-- postgresql.conf
shared_buffers = 4GB
work_mem = 256MB
maintenance_work_mem = 1GB
effective_cache_size = 12GB
# RabbitMQ prefetch for consumers
PREFETCH_COUNT: 100 # Adjust based on processing speed
- Use SSD/NVMe for
/discogs-data
directory - Enable compression for PostgreSQL tables
- Configure Neo4j for SSD optimization
- Use separate disks for databases if possible
# Check connectivity
curl -I https://discogs-data-dumps.s3.us-west-2.amazonaws.com
# Verify disk space
df -h /discogs-data
# Check permissions
ls -la /discogs-data
Solutions:
- β Ensure internet connectivity
- β Verify 100GB+ free space
- β Check directory permissions
# Check RabbitMQ status
docker-compose ps rabbitmq
docker-compose logs rabbitmq
# Test connection
curl -u discogsography:discogsography http://localhost:15672/api/overview
Solutions:
- β Wait for RabbitMQ startup (30-60s)
- β Check firewall settings
- β
Verify credentials in
.env
Neo4j:
# Check Neo4j status
docker-compose logs neo4j
curl http://localhost:7474
# Test bolt connection
echo "MATCH (n) RETURN count(n);" | cypher-shell -u neo4j -p discogsography
PostgreSQL:
# Check PostgreSQL status
docker-compose logs postgres
# Test connection
PGPASSWORD=discogsography psql -h localhost -U discogsography -d discogsography -c "SELECT 1;"
-
π Check Service Health
curl http://localhost:8000/health # Extractor curl http://localhost:8001/health # Graphinator curl http://localhost:8002/health # Tableinator curl http://localhost:8003/health # Dashboard curl http://localhost:8004/health # Discovery
-
π Monitor Real-time Logs
# All services uv run task logs # Specific service docker-compose logs -f extractor
-
π Analyze Errors
# Check for errors across all services uv run task check-errors # Monitor queue health uv run task monitor
-
ποΈ Verify Data Storage
-- Neo4j: Check node counts MATCH (n) RETURN labels(n)[0] as type, count(n) as count;
-- PostgreSQL: Check table counts SELECT 'artists' as table_name, COUNT(*) FROM artists UNION ALL SELECT 'releases', COUNT(*) FROM releases UNION ALL SELECT 'labels', COUNT(*) FROM labels UNION ALL SELECT 'masters', COUNT(*) FROM masters;
We welcome contributions! Please follow these guidelines:
-
Fork & Clone
git clone https://github.com/YOUR_USERNAME/discogsography.git cd discogsography
-
Setup Development Environment
uv sync --all-extras uv run task init # Install pre-commit hooks
-
Create Feature Branch
git checkout -b feature/amazing-feature
-
Make Changes
- Write clean, documented code
- Add comprehensive tests
- Update relevant documentation
-
Validate Changes
uv run task lint # Fix any linting issues uv run task test # Ensure tests pass uv run task security # Check for vulnerabilities
-
Commit with Conventional Commits
git commit -m "feat: add amazing feature" # Types: feat, fix, docs, style, refactor, test, chore
-
Push & Create PR
git push origin feature/amazing-feature
- Code Style: Follow ruff and black formatting
- Type Hints: Required for all functions
- Tests: Maintain >80% coverage
- Docs: Update README and docstrings
- Logging: Use emoji conventions (see above)
- Security: Pass bandit checks
Keep dependencies up-to-date with the provided upgrade script:
# Safely upgrade all dependencies (minor/patch versions)
./scripts/upgrade-packages.sh
# Preview what would be upgraded
./scripts/upgrade-packages.sh --dry-run
# Include major version upgrades
./scripts/upgrade-packages.sh --major
The script includes:
- π Automatic backups before upgrades
- β Git safety checks (requires clean working directory)
- π§ͺ Automatic testing after upgrades
- π¦ Comprehensive dependency management across all services
See scripts/README.md for more maintenance scripts.
This project is licensed under the MIT License - see the LICENSE file for details.
- π΅ Discogs for providing the monthly data dumps
- π The Python community for excellent libraries and tools
- π All contributors who help improve this project
- π uv for blazing-fast package management
- π₯ Ruff for lightning-fast linting
- π Bug Reports: GitHub Issues
- π‘ Feature Requests: GitHub Discussions
- π¬ Questions: Discussions Q&A
- π CLAUDE.md - Detailed technical documentation
- π€ Task Automation - Available tasks and workflows
- π Docker Security - Security best practices
- ποΈ Dockerfile Standards - Container standards
- π¦ Service READMEs - Individual service documentation
This project is actively maintained. We welcome contributions, bug reports, and feature requests!