Skip to content

Architecture documentation

Nils Ove Tendenes edited this page Dec 5, 2025 · 43 revisions

Introduction

This wiki provides a structured overview of the system architecture behind data.norge.no — a national data catalog designed to support the sharing and discovery of public sector data in Norway — as well as supporting systems.

The architecture consists of the following subsystems:

  • Registration – For registering and maintaining metadata.
  • Harvesting – For ingesting and enriching metadata.
  • Portal – For searching and browsing metadata.
  • Metadata Quality – For computing scores on metadata about datasets.
  • Identity & Access Management (IAM) – For authentication and authorization.

System context diagram

and supports the following resource types (standards):

and a standard for Dataset Quality

💡 Tip: Click on the images to get a full view of the diagrams. In the full view, every container (deployable unit) is also clickable and will redirect you to the actual repository on GitHub.

Registration

Registration is a system for registering and maintaining metadata, intended for organizations that do not have their own data catalog. Employees within these organizations use Registration to create and manage metadata for:

  • Dataset
  • Data Service
  • Service
  • Concept

The resulting data catalogs are treated like any other external data catalog and are harvested through the same harvesting process.

Registration container diagram

Harvesting

Harvesting is a system for ingesting metadata from data catalogs. These catalogs can either be external, from organizations that maintain their own catalogs, or internal, from catalogs created using the Registration system.

Harvesting ingests:

  • Dataset
  • Data Service
  • Service
  • Event
  • Concept
  • Information Model

Harvesting consumes metadata in RDF format, enriches the content, and transforms it into formats that are better suited for frontend applications and search indexing.

Harvesting container diagram

Portal

Portal is a system for searching and browsing metadata related to:

  • Dataset
  • Data Service
  • Service
  • Concept
  • Event
  • Information Model

Portal consumes metadata from Harvesting and builds various indexes and databases to support:

  • Full-text search
  • Semantic search (e.g., using RAG)
  • SPARQL queries
  • And more advanced discovery features

Portal also integrates metadata quality scores from the Metadata Quality system to help users assess the quality of the information they are viewing.

Portal container diagram

Metadata Quality

Metadata Quality is a system for computing quality scores on metadata about datasets. It consumes metadata on datasets from Harvesting and evaluates them based on predefined criteria to produce a quality score.

This score is then used by the Portal to present an indication of metadata quality to end users.

Metadata Quality container diagram

Identity & Access Management (IAM)

IAM is a system for handling authentication and authorization across the platform. It integrates with multiple identity providers and supports the OpenID Connect protocol to enable secure and flexible user access.

IAM ensures that users are properly authenticated and that access to features or data is controlled based on roles and permissions.

Identity and Access Management container diagram

Open-Source Alternatives for Proprietary Dependencies

This document outlines alternative open-source solutions for proprietary dependencies identified in the system architecture, along with migration strategies and implementation considerations.

Overview

The current system architecture includes several components that rely on proprietary or non-open-source licenses. This section provides guidance on replacing these dependencies with fully open-source alternatives while maintaining system functionality and performance.

Search and Data Processing Services

Confluent Platform Components

Current Dependencies:

  • Confluent Kafka (Community License)
  • Confluent Schema Registry (Community License)
  • Elasticsearch 8.10+ (SSPL/Elastic License)

Open-Source Alternatives:

Apache Kafka

  • Replacement for: Confluent Kafka
  • License: Apache 2.0
  • Benefits: Identical core functionality to Confluent Kafka
  • Migration Complexity: Low - configuration changes only

Schema Registry Alternatives

  • Apicurio Registry

    • License: Apache 2.0
    • Features: Avro schema management, REST API compatibility
    • Migration Complexity: Low - drop-in replacement
  • Karapace

    • License: Apache 2.0
    • Features: Kafka-native schema registry, Avro support
    • Migration Complexity: Low - configuration updates required

Search Engine Alternatives

  • OpenSearch

    • License: Apache 2.0
    • Features: Full Elasticsearch compatibility, enhanced security
    • Migration Complexity: Low - API compatible
  • Elasticsearch 8.16+

    • License: AGPLv3 (OSI approved)
    • Features: Latest features, improved performance
    • Migration Complexity: Low - version upgrade

Migration Strategy:

  1. Configuration Updates: Replace Confluent-specific configurations with Apache Kafka equivalents
  2. Schema Registry Migration: Export existing schemas and import to new registry
  3. Serializer Updates: Switch to native Avro serializers
  4. Docker Compose Updates: Replace Confluent images with Apache Kafka and alternative registry
  5. Testing: Validate Avro schema compatibility and message format integrity

Impact Assessment:

  • Core business logic remains unchanged
  • Infrastructure and configuration changes only
  • No application code modifications required

Data Storage and Persistence

MongoDB Replacement

Current Dependency:

  • MongoDB (SSPL License)

Open-Source Alternative:

  • PostgreSQL
    • License: PostgreSQL License (OSI approved)
    • Features: JSON/JSONB support, full-text search, ACID compliance
    • Migration Complexity: Low to Moderate

Migration Strategy:

Metadata Collections (Low Complexity)

  1. Dependency Update: Replace spring-boot-starter-data-mongodb with spring-boot-starter-data-jpa
  2. Driver Addition: Add PostgreSQL driver dependency
  3. Annotation Conversion: Convert MongoDB annotations to JPA annotations
  4. Configuration Update: Update application.yml with PostgreSQL connection properties

RDF Storage Collections (Low Complexity)

  1. Custom Logic Refactoring: Minor updates to RDF storage logic
  2. JSON/JSONB Integration: Leverage PostgreSQL's JSON capabilities for enhanced indexing
  3. Repository Abstraction: Maintain Spring Data repository pattern

Benefits:

  • Dual storage strategy remains identical
  • Enhanced indexing capabilities with JSONB
  • Better ACID compliance
  • Reduced licensing concerns

Implementation

See this page for more information about how to implement the migration.

Identity and Access Management

Current IAM Integration

Current Dependency:

  • Altinn integration (Norwegian government service)
  • OpenID Connect implementation

Open-Source Alternatives:

  • Keycloak

    • License: Apache 2.0
    • Features: Full OpenID Connect support, user federation, social login
    • Migration Complexity: Low - standard OpenID Connect implementation
  • Auth0 Community Edition

    • License: Various (check current terms)
    • Features: OpenID Connect, social providers, enterprise features
    • Migration Complexity: Low - configuration changes
  • Custom OpenID Connect Provider

    • License: Depends on implementation
    • Features: Full control over user management and authentication flows
    • Migration Complexity: Moderate - requires development effort

Migration Considerations:

  • OpenID Connect standard ensures compatibility
  • User data migration may be required
  • Authentication flows remain consistent
  • Authorization policies may need adjustment

Implementation Roadmap

Phase 1: Search and Data Processing

  1. Elasticsearch Migration: Upgrade to AGPLv3 version or migrate to OpenSearch
  2. Schema Registry Replacement: Implement Apicurio Registry or Karapace
  3. Kafka Migration: Switch from Confluent to Apache Kafka

Phase 2: Data Storage

  1. MongoDB Migration: Implement PostgreSQL with JSONB support
  2. Repository Updates: Convert Spring Data repositories
  3. Testing and Validation: Ensure data integrity and performance

Phase 3: Identity Management

  1. IAM Evaluation: Assess current OpenID Connect implementation
  2. Provider Selection: Choose appropriate open-source IAM solution
  3. Migration Planning: Develop user data migration strategy

Benefits of Migration

Legal and Compliance

  • Full Open Source Compliance: Eliminates proprietary license restrictions
  • Reduced Legal Risk: Clear licensing terms for all components
  • Vendor Independence: Freedom from vendor lock-in

Technical Benefits

  • Community Support: Access to broader open-source community
  • Transparency: Full visibility into component source code
  • Customization: Ability to modify and extend components as needed

Operational Benefits

  • Cost Reduction: Elimination of proprietary licensing costs
  • Flexibility: Freedom to deploy in any environment
  • Innovation: Ability to contribute back to open-source projects

Risk Mitigation

Technical Risks

  • Compatibility Testing: Comprehensive testing of all integrations
  • Performance Validation: Ensure performance meets requirements
  • Rollback Planning: Maintain ability to revert changes if needed

Operational Risks

  • Training Requirements: Team education on new components
  • Documentation Updates: Maintain current system documentation
  • Monitoring Updates: Adjust monitoring and alerting systems

Conclusion

The migration to fully open-source alternatives is feasible with moderate complexity. The primary effort involves infrastructure and configuration changes rather than application code modifications. The benefits of full open-source compliance, reduced licensing costs, and increased flexibility outweigh the migration effort required.

Each component can be migrated independently, allowing for a phased approach that minimizes risk and ensures system stability throughout the transition process.