Architecture documentation

Introduction

This wiki provides a structured overview of the system architecture behind data.norge.no — a national data catalog designed to support the sharing and discovery of public sector data in Norway — as well as supporting systems.

The architecture consists of the following subsystems:

Registration – For registering and maintaining metadata.
Harvesting – For ingesting and enriching metadata.
Portal – For searching and browsing metadata.
Metadata Quality – For computing scores on metadata about datasets.
Identity & Access Management (IAM) – For authentication and authorization.

and supports the following resource types (standards):

and a standard for Dataset Quality

💡 Tip: Click on the images to get a full view of the diagrams. In the full view, every container (deployable unit) is also clickable and will redirect you to the actual repository on GitHub.

Registration

Registration is a system for registering and maintaining metadata, intended for organizations that do not have their own data catalog. Employees within these organizations use Registration to create and manage metadata for:

Dataset
Data Service
Service
Concept

The resulting data catalogs are treated like any other external data catalog and are harvested through the same harvesting process.

Harvesting

Harvesting is a system for ingesting metadata from data catalogs. These catalogs can either be external, from organizations that maintain their own catalogs, or internal, from catalogs created using the Registration system.

Harvesting ingests:

Dataset
Data Service
Service
Event
Concept
Information Model

Harvesting consumes metadata in RDF format, enriches the content, and transforms it into formats that are better suited for frontend applications and search indexing.

Portal

Portal is a system for searching and browsing metadata related to:

Dataset
Data Service
Service
Concept
Event
Information Model

Portal consumes metadata from Harvesting and builds various indexes and databases to support:

Full-text search
Semantic search (e.g., using RAG)
SPARQL queries
And more advanced discovery features

Portal also integrates metadata quality scores from the Metadata Quality system to help users assess the quality of the information they are viewing.

Metadata Quality

Metadata Quality is a system for computing quality scores on metadata about datasets. It consumes metadata on datasets from Harvesting and evaluates them based on predefined criteria to produce a quality score.

This score is then used by the Portal to present an indication of metadata quality to end users.

Identity & Access Management (IAM)

IAM is a system for handling authentication and authorization across the platform. It integrates with multiple identity providers and supports the OpenID Connect protocol to enable secure and flexible user access.

IAM ensures that users are properly authenticated and that access to features or data is controlled based on roles and permissions.

Open-Source Alternatives for Proprietary Dependencies

This document outlines alternative open-source solutions for proprietary dependencies identified in the system architecture, along with migration strategies and implementation considerations.

Overview

The current system architecture includes several components that rely on proprietary or non-open-source licenses. This section provides guidance on replacing these dependencies with fully open-source alternatives while maintaining system functionality and performance.

Search and Data Processing Services

Confluent Platform Components

Current Dependencies:

Confluent Kafka (Community License)
Confluent Schema Registry (Community License)
Elasticsearch 8.10+ (SSPL/Elastic License)

Open-Source Alternatives:

Apache Kafka

Replacement for: Confluent Kafka
License: Apache 2.0
Benefits: Identical core functionality to Confluent Kafka
Migration Complexity: Low - configuration changes only

Schema Registry Alternatives

Apicurio Registry
- License: Apache 2.0
- Features: Avro schema management, REST API compatibility
- Migration Complexity: Low - drop-in replacement
Karapace
- License: Apache 2.0
- Features: Kafka-native schema registry, Avro support
- Migration Complexity: Low - configuration updates required

Search Engine Alternatives

OpenSearch
- License: Apache 2.0
- Features: Full Elasticsearch compatibility, enhanced security
- Migration Complexity: Low - API compatible
Elasticsearch 8.16+
- License: AGPLv3 (OSI approved)
- Features: Latest features, improved performance
- Migration Complexity: Low - version upgrade

Migration Strategy:

Configuration Updates: Replace Confluent-specific configurations with Apache Kafka equivalents
Schema Registry Migration: Export existing schemas and import to new registry
Serializer Updates: Switch to native Avro serializers
Docker Compose Updates: Replace Confluent images with Apache Kafka and alternative registry
Testing: Validate Avro schema compatibility and message format integrity

Impact Assessment:

Core business logic remains unchanged
Infrastructure and configuration changes only
No application code modifications required

Data Storage and Persistence

MongoDB Replacement

Current Dependency:

MongoDB (SSPL License)

Open-Source Alternative:

PostgreSQL
- License: PostgreSQL License (OSI approved)
- Features: JSON/JSONB support, full-text search, ACID compliance
- Migration Complexity: Low to Moderate

Migration Strategy:

Metadata Collections (Low Complexity)

Dependency Update: Replace spring-boot-starter-data-mongodb with spring-boot-starter-data-jpa
Driver Addition: Add PostgreSQL driver dependency
Annotation Conversion: Convert MongoDB annotations to JPA annotations
Configuration Update: Update application.yml with PostgreSQL connection properties

RDF Storage Collections (Low Complexity)

Custom Logic Refactoring: Minor updates to RDF storage logic
JSON/JSONB Integration: Leverage PostgreSQL's JSON capabilities for enhanced indexing
Repository Abstraction: Maintain Spring Data repository pattern

Benefits:

Dual storage strategy remains identical
Enhanced indexing capabilities with JSONB
Better ACID compliance
Reduced licensing concerns

Implementation

See this page for more information about how to implement the migration.

Identity and Access Management

Current IAM Integration

Current Dependency:

Altinn integration (Norwegian government service)
OpenID Connect implementation

Open-Source Alternatives:

Keycloak
- License: Apache 2.0
- Features: Full OpenID Connect support, user federation, social login
- Migration Complexity: Low - standard OpenID Connect implementation
Auth0 Community Edition
- License: Various (check current terms)
- Features: OpenID Connect, social providers, enterprise features
- Migration Complexity: Low - configuration changes
Custom OpenID Connect Provider
- License: Depends on implementation
- Features: Full control over user management and authentication flows
- Migration Complexity: Moderate - requires development effort

Migration Considerations:

OpenID Connect standard ensures compatibility
User data migration may be required
Authentication flows remain consistent
Authorization policies may need adjustment

Implementation Roadmap

Phase 1: Search and Data Processing

Elasticsearch Migration: Upgrade to AGPLv3 version or migrate to OpenSearch
Schema Registry Replacement: Implement Apicurio Registry or Karapace
Kafka Migration: Switch from Confluent to Apache Kafka

Phase 2: Data Storage

MongoDB Migration: Implement PostgreSQL with JSONB support
Repository Updates: Convert Spring Data repositories
Testing and Validation: Ensure data integrity and performance

Phase 3: Identity Management

IAM Evaluation: Assess current OpenID Connect implementation
Provider Selection: Choose appropriate open-source IAM solution
Migration Planning: Develop user data migration strategy

Benefits of Migration

Legal and Compliance

Full Open Source Compliance: Eliminates proprietary license restrictions
Reduced Legal Risk: Clear licensing terms for all components
Vendor Independence: Freedom from vendor lock-in

Technical Benefits

Community Support: Access to broader open-source community
Transparency: Full visibility into component source code
Customization: Ability to modify and extend components as needed

Operational Benefits

Cost Reduction: Elimination of proprietary licensing costs
Flexibility: Freedom to deploy in any environment
Innovation: Ability to contribute back to open-source projects

Risk Mitigation

Technical Risks

Compatibility Testing: Comprehensive testing of all integrations
Performance Validation: Ensure performance meets requirements
Rollback Planning: Maintain ability to revert changes if needed

Operational Risks

Training Requirements: Team education on new components
Documentation Updates: Maintain current system documentation
Monitoring Updates: Adjust monitoring and alerting systems

Conclusion

The migration to fully open-source alternatives is feasible with moderate complexity. The primary effort involves infrastructure and configuration changes rather than application code modifications. The benefits of full open-source compliance, reduced licensing costs, and increased flexibility outweigh the migration effort required.

Each component can be migrated independently, allowing for a phased approach that minimizes risk and ensures system stability throughout the transition process.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Architecture documentation

Introduction

Registration

Harvesting

Portal

Metadata Quality

Identity & Access Management (IAM)

Open-Source Alternatives for Proprietary Dependencies

Overview

Search and Data Processing Services

Confluent Platform Components

Apache Kafka

Schema Registry Alternatives

Search Engine Alternatives

Data Storage and Persistence

MongoDB Replacement

Metadata Collections (Low Complexity)

RDF Storage Collections (Low Complexity)

Implementation

Identity and Access Management

Current IAM Integration

Implementation Roadmap

Phase 1: Search and Data Processing

Phase 2: Data Storage

Phase 3: Identity Management

Benefits of Migration

Legal and Compliance

Technical Benefits

Operational Benefits

Risk Mitigation

Technical Risks

Operational Risks

Conclusion

Uh oh!

Uh oh!

Clone this wiki locally