Skip to content

🎢 Using the discogs database export for local graph exploration. 🎢

License

Notifications You must be signed in to change notification settings

SimplicityGuy/discogsography

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

discogsography

discogsography License: MIT Python 3.13+ uv Ruff pre-commit mypy Bandit Docker

A modern Python 3.13+ microservices system for processing Discogs database exports into multiple storage backends.

Transform the entire Discogs music database into queryable graph and relational databases. This system downloads monthly data dumps from Discogs, efficiently parses XML files, and stores the data in both Neo4j (graph database) and PostgreSQL (relational database) for different query patterns and use cases.

Overview

Discogsography consists of five microservices that work together to process, monitor, and explore the complete Discogs database:

  1. Dashboard - Real-time monitoring dashboard with WebSocket updates for all services
  2. Extractor - Downloads Discogs XML dumps from S3, validates checksums, parses XML to JSON, and publishes to message queues
  3. Graphinator - Consumes messages and builds a graph database in Neo4j with relationships between artists, labels, releases, and masters
  4. Tableinator - Consumes messages and stores denormalized data in PostgreSQL for fast queries and full-text search
  5. Discovery - AI-powered music discovery service with semantic search, recommendations, and advanced analytics

Architecture

graph TD
    S3[("Discogs S3<br/>Data Dumps")]
    EXT[["Extractor<br/>(XML β†’ JSON)"]]
    RMQ{{"RabbitMQ<br/>Message Queue"}}
    NEO4J[(Neo4j<br/>Graph DB)]
    PG[(PostgreSQL<br/>Relational DB)]
    GRAPH[["Graphinator"]]
    TABLE[["Tableinator"]]
    DASH[["Dashboard<br/>(Monitoring)"]]
    DISCO[["Discovery<br/>(AI Analytics)"]]

    S3 -->|Download & Parse| EXT
    EXT -->|Publish Messages| RMQ
    RMQ -->|Consume| GRAPH
    RMQ -->|Consume| TABLE
    GRAPH -->|Store| NEO4J
    TABLE -->|Store| PG

    DISCO -.->|Query & Analyze| NEO4J
    DISCO -.->|Query & Analyze| PG
    DISCO -.->|AI Processing| DISCO

    DASH -.->|Monitor| EXT
    DASH -.->|Monitor| GRAPH
    DASH -.->|Monitor| TABLE
    DASH -.->|Monitor| DISCO
    DASH -.->|Query Stats| RMQ
    DASH -.->|Query Stats| NEO4J
    DASH -.->|Query Stats| PG

    style S3 fill:#e1f5fe
    style RMQ fill:#fff3e0
    style NEO4J fill:#f3e5f5
    style PG fill:#e8f5e9
    style DASH fill:#fce4ec
    style DISCO fill:#e3f2fd
Loading

Key Features

  • πŸ”„ Automatic Updates: Periodic checking for new Discogs data releases (configurable interval, default 15 days)
  • ⚑ Efficient Processing: Hash-based deduplication to avoid reprocessing unchanged records
  • πŸš€ High Performance: Multi-threaded XML parsing and concurrent message processing
  • πŸ›‘οΈ Fault Tolerance: Message acknowledgment, automatic retries, and graceful shutdown
  • πŸ“Š Progress Tracking: Real-time progress monitoring with detailed statistics
  • πŸ‹ Production-Ready Docker: Full Docker Compose setup with security hardening
  • πŸ”’ Type Safety: Comprehensive type hints and strict mypy validation
  • πŸ” Security First: Bandit scanning, secure coding practices, and container hardening

Documentation

πŸš€ Quick Start

Prerequisites

  • Python: 3.13+ (Install with uv)
  • Docker: Docker Desktop or Docker Engine with Compose
  • Storage: ~100GB free disk space for Discogs data
  • Memory: 8GB+ RAM recommended
  • Network: Stable internet for initial download (~50GB)

Using Docker Compose (Recommended)

  1. Clone the repository:

    git clone https://github.com/SimplicityGuy/discogsography.git
    cd discogsography
  2. Start all services:

    docker-compose up -d
  3. Monitor the logs:

    docker-compose logs -f extractor
  4. Access the services:

    Service URL/Connection Credentials
    πŸ“Š Dashboard http://localhost:8003 No auth required
    🎡 Discovery http://localhost:8005 No auth required
    🐰 RabbitMQ http://localhost:15672 discogsography / discogsography
    πŸ”— Neo4j Browser http://localhost:7474 neo4j / discogsography
    🐘 PostgreSQL localhost:5433 discogsography / discogsography

Local Development

  1. Install uv package manager:

    curl -LsSf https://astral.sh/uv/install.sh | sh
  2. Install dependencies:

    uv sync --all-extras
  3. Set up pre-commit hooks:

    uv run pre-commit install
  4. Set environment variables:

    export AMQP_CONNECTION="amqp://guest:guest@localhost:5672/"
    export NEO4J_ADDRESS="bolt://localhost:7687"
    export NEO4J_USERNAME="neo4j"
    export NEO4J_PASSWORD="password"
    export POSTGRES_ADDRESS="localhost:5433"
    export POSTGRES_USERNAME="postgres"
    export POSTGRES_PASSWORD="password"
    export POSTGRES_DATABASE="discogsography"
  5. Run services:

    # Terminal 1 - Dashboard
    uv run python dashboard/dashboard.py
    
    # Terminal 2 - Extractor
    uv run python extractor/extractor.py
    
    # Terminal 3 - Graphinator
    uv run python graphinator/graphinator.py
    
    # Terminal 4 - Tableinator
    uv run python tableinator/tableinator.py

βš™οΈ Configuration

Environment Variables

All services are configured via environment variables. Copy .env.example to .env and customize as needed:

Variable Description Default Service
AMQP_CONNECTION RabbitMQ connection string Required All
DISCOGS_ROOT Path for downloaded files /discogs-data Extractor
PERIODIC_CHECK_DAYS Days between update checks 15 Extractor
NEO4J_ADDRESS Neo4j bolt address Required Dashboard, Graphinator
NEO4J_USERNAME Neo4j username Required Dashboard, Graphinator
NEO4J_PASSWORD Neo4j password Required Dashboard, Graphinator
POSTGRES_ADDRESS PostgreSQL host:port Required Dashboard, Tableinator
POSTGRES_USERNAME PostgreSQL username Required Dashboard, Tableinator
POSTGRES_PASSWORD PostgreSQL password Required Dashboard, Tableinator
POSTGRES_DATABASE PostgreSQL database Required Dashboard, Tableinator

πŸ’Ώ Data Volume

The complete Discogs dataset includes:

Data Type Count Storage
πŸ“€ Releases ~15 million ~40GB
🎀 Artists ~2 million ~5GB
🎡 Masters ~2 million ~3GB
🏒 Labels ~1.5 million ~2GB

Total Requirements:

  • πŸ“₯ Download: ~50GB compressed XML files
  • πŸ’Ύ Storage: ~100GB for extracted data
  • ⏱️ Processing: 2-6 hours (varies by hardware)

πŸ“Š Usage Examples

πŸ”— Neo4j Graph Queries

Explore complex relationships in the music industry:

Find all albums by an artist

MATCH (a:Artist {name: "Pink Floyd"})-[:BY]-(r:Release)
RETURN r.title, r.year
ORDER BY r.year
LIMIT 10

Discover band members

MATCH (member:Artist)-[:MEMBER_OF]->(band:Artist {name: "The Beatles"})
RETURN member.name, member.real_name

Explore label catalogs

MATCH (r:Release)-[:ON]->(l:Label {name: "Blue Note"})
WHERE r.year >= 1950 AND r.year <= 1970
RETURN r.title, r.artist, r.year
ORDER BY r.year

Find artist collaborations

MATCH (a1:Artist {name: "Miles Davis"})-[:COLLABORATED_WITH]-(a2:Artist)
RETURN DISTINCT a2.name
ORDER BY a2.name

🐘 PostgreSQL Queries

Fast structured queries on denormalized data:

Full-text search releases

SELECT
    data->>'title' as title,
    data->>'artist' as artist,
    data->>'year' as year
FROM releases
WHERE data->>'title' ILIKE '%dark side%'
ORDER BY (data->>'year')::int DESC
LIMIT 10;

Artist discography

SELECT
    data->>'title' as title,
    data->>'year' as year,
    data->'genres' as genres
FROM releases
WHERE data->>'artist' = 'Miles Davis'
AND (data->>'year')::int BETWEEN 1950 AND 1960
ORDER BY (data->>'year')::int;

Genre statistics

SELECT
    genre,
    COUNT(*) as release_count,
    MIN((data->>'year')::int) as first_release,
    MAX((data->>'year')::int) as last_release
FROM releases,
     jsonb_array_elements_text(data->'genres') as genre
GROUP BY genre
ORDER BY release_count DESC
LIMIT 20;

πŸ“ˆ Monitoring & Operations

πŸ“Š Dashboard

Access the real-time monitoring dashboard at http://localhost:8003:

  • Service Health: Live status of all microservices
  • Queue Metrics: Message rates, depths, and consumer counts
  • Database Stats: Connection pools and storage usage
  • Activity Log: Recent system events and processing updates
  • WebSocket Updates: Real-time data without page refresh

πŸ” Debug Utilities

Monitor and debug your system with built-in tools:

# Check service logs for errors
uv run task check-errors

# Monitor RabbitMQ queues in real-time
uv run task monitor

# Comprehensive system health dashboard
uv run task system-monitor

# View logs for all services
uv run task logs

πŸ“Š Metrics

Each service provides detailed telemetry:

  • Processing Rates: Records/second for each data type
  • Queue Health: Depth, consumer count, throughput
  • Error Tracking: Failed messages, retry counts
  • Performance: Processing time, memory usage
  • Stall Detection: Alerts when processing stops

πŸ‘¨β€πŸ’» Development

πŸ› οΈ Modern Python Stack

The project leverages cutting-edge Python tooling:

Tool Purpose Configuration
uv 10-100x faster package management pyproject.toml
ruff Lightning-fast linting & formatting pyproject.toml
mypy Strict static type checking pyproject.toml
bandit Security vulnerability scanning pyproject.toml
pre-commit Git hooks for code quality .pre-commit-config.yaml

πŸ§ͺ Testing

Comprehensive test coverage with multiple test types:

# Run all tests (excluding E2E)
uv run task test

# Run with coverage report
uv run task test-cov

# Run specific test suites
uv run pytest tests/extractor/      # Extractor tests
uv run pytest tests/graphinator/    # Graphinator tests
uv run pytest tests/tableinator/    # Tableinator tests
uv run pytest tests/dashboard/      # Dashboard tests

🎭 E2E Testing with Playwright

# One-time browser setup
uv run playwright install chromium
uv run playwright install-deps chromium

# Run E2E tests (automatic server management)
uv run task test-e2e

# Run with specific browser
uv run pytest tests/dashboard/test_dashboard_ui.py -m e2e --browser firefox

πŸ”§ Development Workflow

# Setup development environment
uv sync --all-extras
uv run task init  # Install pre-commit hooks

# Before committing
uv run task lint     # Run linting
uv run task format   # Format code
uv run task test     # Run tests
uv run task security # Security scan

# Or run everything at once
uv run pre-commit run --all-files

πŸ“ Project Structure

discogsography/
β”œβ”€β”€ πŸ“¦ common/              # Shared utilities and configuration
β”‚   β”œβ”€β”€ config.py           # Centralized configuration management
β”‚   └── health_server.py    # Health check endpoint server
β”œβ”€β”€ πŸ“Š dashboard/           # Real-time monitoring dashboard
β”‚   β”œβ”€β”€ dashboard.py        # FastAPI backend with WebSocket
β”‚   └── static/             # Frontend HTML/CSS/JS
β”œβ”€β”€ πŸ“₯ extractor/           # Discogs data ingestion service
β”‚   β”œβ”€β”€ extractor.py        # Main processing logic
β”‚   └── discogs.py          # S3 download and validation
β”œβ”€β”€ πŸ”— graphinator/         # Neo4j graph database service
β”‚   └── graphinator.py      # Graph relationship builder
β”œβ”€β”€ 🐘 tableinator/         # PostgreSQL storage service
β”‚   └── tableinator.py      # Relational data management
β”œβ”€β”€ πŸ”§ utilities/           # Operational tools
β”‚   β”œβ”€β”€ check_errors.py     # Log analysis
β”‚   β”œβ”€β”€ monitor_queues.py   # Real-time queue monitoring
β”‚   └── system_monitor.py   # System health dashboard
β”œβ”€β”€ πŸ§ͺ tests/               # Comprehensive test suite
β”œβ”€β”€ πŸ“ docs/                # Additional documentation
β”œβ”€β”€ πŸ‹ docker-compose.yml   # Container orchestration
└── πŸ“¦ pyproject.toml       # Project configuration

Logging Conventions

All logger calls (logger.info, logger.warning, logger.error) in this project follow a consistent emoji pattern for visual clarity. Each message starts with an emoji followed by exactly one space before the message text.

Emoji Key

Emoji Usage Example
πŸš€ Startup messages logger.info("πŸš€ Starting service...")
βœ… Success/completion messages logger.info("βœ… Operation completed successfully")
❌ Errors logger.error("❌ Failed to connect to database")
⚠️ Warnings logger.warning("⚠️ Connection timeout, retrying...")
πŸ›‘ Shutdown/stop messages logger.info("πŸ›‘ Shutting down gracefully")
πŸ“Š Progress/statistics logger.info("πŸ“Š Processed 1000 records")
πŸ“₯ Downloads logger.info("πŸ“₯ Starting download of data")
⬇️ Downloading files logger.info("⬇️ Downloading file.xml")
πŸ”„ Processing operations logger.info("πŸ”„ Processing batch of messages")
⏳ Waiting/pending logger.info("⏳ Waiting for messages...")
πŸ“‹ Metadata operations logger.info("πŸ“‹ Loaded metadata from cache")
πŸ” Checking/searching logger.info("πŸ” Checking for updates...")
πŸ“„ File operations logger.info("πŸ“„ File created successfully")
πŸ†• New versions logger.info("πŸ†• Found newer version available")
⏰ Periodic operations logger.info("⏰ Running periodic check")
πŸ”§ Setup/configuration logger.info("πŸ”§ Creating database indexes")
🐰 RabbitMQ connections logger.info("🐰 Connected to RabbitMQ")
πŸ”— Neo4j connections logger.info("πŸ”— Connected to Neo4j")
🐘 PostgreSQL operations logger.info("🐘 Connected to PostgreSQL")
πŸ’Ύ Database save operations logger.info("πŸ’Ύ Updated artist ID=123 in Neo4j")
πŸ₯ Health server logger.info("πŸ₯ Health server started on port 8001")
⏩ Skipping operations logger.info("⏩ Skipped artist ID=123 (no changes)")

Example Usage

logger.info("πŸš€ Starting Discogs data extractor")
logger.error("❌ Failed to connect to Neo4j: connection refused")
logger.warning("⚠️ Slow consumer detected, processing delayed")
logger.info("βœ… All files processed successfully")

πŸ—„οΈ Data Schema

πŸ”— Neo4j Graph Model

The graph database models complex music industry relationships:

Node Types

Node Description Key Properties
Artist Musicians, bands, producers id, name, real_name, profile
Label Record labels and imprints id, name, profile, parent_label
Master Master recordings id, title, year, main_release
Release Physical/digital releases id, title, year, country, format
Genre Musical genres name
Style Sub-genres and styles name

Relationships

🎀 Artist Relationships:
β”œβ”€β”€ MEMBER_OF ──────→ Artist (band membership)
β”œβ”€β”€ ALIAS_OF ───────→ Artist (alternative names)
β”œβ”€β”€ COLLABORATED_WITH β†’ Artist (collaborations)
└── PERFORMED_ON ───→ Release (credits)

πŸ“€ Release Relationships:
β”œβ”€β”€ BY ────────────→ Artist (performer credits)
β”œβ”€β”€ ON ────────────→ Label (release label)
β”œβ”€β”€ DERIVED_FROM ──→ Master (master recording)
β”œβ”€β”€ IS ────────────→ Genre (genre classification)
└── IS ────────────→ Style (style classification)

🏒 Label Relationships:
└── SUBLABEL_OF ───→ Label (parent/child labels)

🎡 Classification:
└── Style -[:PART_OF]β†’ Genre (hierarchy)

🐘 PostgreSQL Schema

Optimized for fast queries and full-text search:

-- Artists table with JSONB for flexible schema
CREATE TABLE artists (
    data_id VARCHAR PRIMARY KEY,
    hash VARCHAR NOT NULL UNIQUE,
    data JSONB NOT NULL
);
CREATE INDEX idx_artists_name ON artists ((data->>'name'));
CREATE INDEX idx_artists_gin ON artists USING GIN (data);

-- Labels table
CREATE TABLE labels (
    data_id VARCHAR PRIMARY KEY,
    hash VARCHAR NOT NULL UNIQUE,
    data JSONB NOT NULL
);
CREATE INDEX idx_labels_name ON labels ((data->>'name'));

-- Masters table
CREATE TABLE masters (
    data_id VARCHAR PRIMARY KEY,
    hash VARCHAR NOT NULL UNIQUE,
    data JSONB NOT NULL
);
CREATE INDEX idx_masters_title ON masters ((data->>'title'));
CREATE INDEX idx_masters_year ON masters ((data->>'year'));

-- Releases table with extensive indexing
CREATE TABLE releases (
    data_id VARCHAR PRIMARY KEY,
    hash VARCHAR NOT NULL UNIQUE,
    data JSONB NOT NULL
);
CREATE INDEX idx_releases_title ON releases ((data->>'title'));
CREATE INDEX idx_releases_artist ON releases ((data->>'artist'));
CREATE INDEX idx_releases_year ON releases ((data->>'year'));
CREATE INDEX idx_releases_gin ON releases USING GIN (data);

⚑ Performance & Optimization

πŸ“Š Processing Speed

Typical processing rates on modern hardware:

Service Records/Second Bottleneck
πŸ“₯ Extractor 5,000-10,000 XML parsing, I/O
πŸ”— Graphinator 1,000-2,000 Neo4j transactions
🐘 Tableinator 3,000-5,000 PostgreSQL inserts

πŸ’» Hardware Requirements

Minimum Specifications

  • CPU: 4 cores
  • RAM: 8GB
  • Storage: 200GB HDD
  • Network: 10 Mbps

Recommended Specifications

  • CPU: 8+ cores
  • RAM: 16GB+
  • Storage: 200GB+ SSD (NVMe preferred)
  • Network: 100 Mbps+

πŸš€ Optimization Guide

Database Tuning

Neo4j Configuration:

# neo4j.conf
dbms.memory.heap.initial_size=4g
dbms.memory.heap.max_size=4g
dbms.memory.pagecache.size=2g

PostgreSQL Configuration:

-- postgresql.conf
shared_buffers = 4GB
work_mem = 256MB
maintenance_work_mem = 1GB
effective_cache_size = 12GB

Message Queue Optimization

# RabbitMQ prefetch for consumers
PREFETCH_COUNT: 100  # Adjust based on processing speed

Storage Performance

  • Use SSD/NVMe for /discogs-data directory
  • Enable compression for PostgreSQL tables
  • Configure Neo4j for SSD optimization
  • Use separate disks for databases if possible

πŸ”§ Troubleshooting

❌ Common Issues & Solutions

Extractor Download Failures

# Check connectivity
curl -I https://discogs-data-dumps.s3.us-west-2.amazonaws.com

# Verify disk space
df -h /discogs-data

# Check permissions
ls -la /discogs-data

Solutions:

  • βœ… Ensure internet connectivity
  • βœ… Verify 100GB+ free space
  • βœ… Check directory permissions

RabbitMQ Connection Issues

# Check RabbitMQ status
docker-compose ps rabbitmq
docker-compose logs rabbitmq

# Test connection
curl -u discogsography:discogsography http://localhost:15672/api/overview

Solutions:

  • βœ… Wait for RabbitMQ startup (30-60s)
  • βœ… Check firewall settings
  • βœ… Verify credentials in .env

Database Connection Errors

Neo4j:

# Check Neo4j status
docker-compose logs neo4j
curl http://localhost:7474

# Test bolt connection
echo "MATCH (n) RETURN count(n);" | cypher-shell -u neo4j -p discogsography

PostgreSQL:

# Check PostgreSQL status
docker-compose logs postgres

# Test connection
PGPASSWORD=discogsography psql -h localhost -U discogsography -d discogsography -c "SELECT 1;"

πŸ› Debugging Guide

  1. πŸ“‹ Check Service Health

    curl http://localhost:8000/health  # Extractor
    curl http://localhost:8001/health  # Graphinator
    curl http://localhost:8002/health  # Tableinator
    curl http://localhost:8003/health  # Dashboard
    curl http://localhost:8004/health  # Discovery
  2. πŸ“Š Monitor Real-time Logs

    # All services
    uv run task logs
    
    # Specific service
    docker-compose logs -f extractor
  3. πŸ” Analyze Errors

    # Check for errors across all services
    uv run task check-errors
    
    # Monitor queue health
    uv run task monitor
  4. πŸ—„οΈ Verify Data Storage

    -- Neo4j: Check node counts
    MATCH (n) RETURN labels(n)[0] as type, count(n) as count;
    -- PostgreSQL: Check table counts
    SELECT 'artists' as table_name, COUNT(*) FROM artists
    UNION ALL
    SELECT 'releases', COUNT(*) FROM releases
    UNION ALL
    SELECT 'labels', COUNT(*) FROM labels
    UNION ALL
    SELECT 'masters', COUNT(*) FROM masters;

🀝 Contributing

We welcome contributions! Please follow these guidelines:

πŸ“‹ Contribution Process

  1. Fork & Clone

    git clone https://github.com/YOUR_USERNAME/discogsography.git
    cd discogsography
  2. Setup Development Environment

    uv sync --all-extras
    uv run task init  # Install pre-commit hooks
  3. Create Feature Branch

    git checkout -b feature/amazing-feature
  4. Make Changes

    • Write clean, documented code
    • Add comprehensive tests
    • Update relevant documentation
  5. Validate Changes

    uv run task lint      # Fix any linting issues
    uv run task test      # Ensure tests pass
    uv run task security  # Check for vulnerabilities
  6. Commit with Conventional Commits

    git commit -m "feat: add amazing feature"
    # Types: feat, fix, docs, style, refactor, test, chore
  7. Push & Create PR

    git push origin feature/amazing-feature

πŸ“ Development Standards

  • Code Style: Follow ruff and black formatting
  • Type Hints: Required for all functions
  • Tests: Maintain >80% coverage
  • Docs: Update README and docstrings
  • Logging: Use emoji conventions (see above)
  • Security: Pass bandit checks

πŸ”§ Maintenance

Package Upgrades

Keep dependencies up-to-date with the provided upgrade script:

# Safely upgrade all dependencies (minor/patch versions)
./scripts/upgrade-packages.sh

# Preview what would be upgraded
./scripts/upgrade-packages.sh --dry-run

# Include major version upgrades
./scripts/upgrade-packages.sh --major

The script includes:

  • πŸ”’ Automatic backups before upgrades
  • βœ… Git safety checks (requires clean working directory)
  • πŸ§ͺ Automatic testing after upgrades
  • πŸ“¦ Comprehensive dependency management across all services

See scripts/README.md for more maintenance scripts.

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • 🎡 Discogs for providing the monthly data dumps
  • 🐍 The Python community for excellent libraries and tools
  • 🌟 All contributors who help improve this project
  • πŸš€ uv for blazing-fast package management
  • πŸ”₯ Ruff for lightning-fast linting

πŸ’¬ Support & Community

Get Help

Documentation

Project Status

This project is actively maintained. We welcome contributions, bug reports, and feature requests!


Made with ❀️ by the Discogsography community

Sponsor this project

Contributors 3

  •  
  •  
  •