-
Notifications
You must be signed in to change notification settings - Fork 0
Architecture documentation
This wiki provides a structured overview of the system architecture behind data.norge.no — a national data catalog designed to support the sharing and discovery of public sector data in Norway — as well as supporting systems.
The architecture consists of the following subsystems:
- Registration – For registering and maintaining metadata.
- Harvesting – For ingesting and enriching metadata.
- Portal – For searching and browsing metadata.
- Metadata Quality – For computing scores on metadata about datasets.
- Identity & Access Management (IAM) – For authentication and authorization.
and supports the following resource types (standards):
and a standard for Dataset Quality
💡 Tip: Click on the images to get a full view of the diagrams. In the full view, every container (deployable unit) is also clickable and will redirect you to the actual repository on GitHub.
Registration is a system for registering and maintaining metadata, intended for organizations that do not have their own data catalog. Employees within these organizations use Registration to create and manage metadata for:
- Dataset
- Data Service
- Service
- Concept
The resulting data catalogs are treated like any other external data catalog and are harvested through the same harvesting process.
Harvesting is a system for ingesting metadata from data catalogs. These catalogs can either be external, from organizations that maintain their own catalogs, or internal, from catalogs created using the Registration system.
Harvesting ingests:
- Dataset
- Data Service
- Service
- Event
- Concept
- Information Model
Harvesting consumes metadata in RDF format, enriches the content, and transforms it into formats that are better suited for frontend applications and search indexing.
Portal is a system for searching and browsing metadata related to:
- Dataset
- Data Service
- Service
- Concept
- Event
- Information Model
Portal consumes metadata from Harvesting and builds various indexes and databases to support:
- Full-text search
- Semantic search (e.g., using RAG)
- SPARQL queries
- And more advanced discovery features
Portal also integrates metadata quality scores from the Metadata Quality system to help users assess the quality of the information they are viewing.
Metadata Quality is a system for computing quality scores on metadata about datasets. It consumes metadata on datasets from Harvesting and evaluates them based on predefined criteria to produce a quality score.
This score is then used by the Portal to present an indication of metadata quality to end users.
IAM is a system for handling authentication and authorization across the platform. It integrates with multiple identity providers and supports the OpenID Connect protocol to enable secure and flexible user access.
IAM ensures that users are properly authenticated and that access to features or data is controlled based on roles and permissions.
This document outlines alternative open-source solutions for proprietary dependencies identified in the system architecture, along with migration strategies and implementation considerations.
The current system architecture includes several components that rely on proprietary or non-open-source licenses. This section provides guidance on replacing these dependencies with fully open-source alternatives while maintaining system functionality and performance.
Current Dependencies:
- Confluent Kafka (Community License)
- Confluent Schema Registry (Community License)
- Elasticsearch 8.10+ (SSPL/Elastic License)
Open-Source Alternatives:
- Replacement for: Confluent Kafka
- License: Apache 2.0
- Benefits: Identical core functionality to Confluent Kafka
- Migration Complexity: Low - configuration changes only
-
Apicurio Registry
- License: Apache 2.0
- Features: Avro schema management, REST API compatibility
- Migration Complexity: Low - drop-in replacement
-
Karapace
- License: Apache 2.0
- Features: Kafka-native schema registry, Avro support
- Migration Complexity: Low - configuration updates required
-
OpenSearch
- License: Apache 2.0
- Features: Full Elasticsearch compatibility, enhanced security
- Migration Complexity: Low - API compatible
-
Elasticsearch 8.16+
- License: AGPLv3 (OSI approved)
- Features: Latest features, improved performance
- Migration Complexity: Low - version upgrade
Migration Strategy:
- Configuration Updates: Replace Confluent-specific configurations with Apache Kafka equivalents
- Schema Registry Migration: Export existing schemas and import to new registry
- Serializer Updates: Switch to native Avro serializers
- Docker Compose Updates: Replace Confluent images with Apache Kafka and alternative registry
- Testing: Validate Avro schema compatibility and message format integrity
Impact Assessment:
- Core business logic remains unchanged
- Infrastructure and configuration changes only
- No application code modifications required
Current Dependency:
- MongoDB (SSPL License)
Open-Source Alternative:
-
PostgreSQL
- License: PostgreSQL License (OSI approved)
- Features: JSON/JSONB support, full-text search, ACID compliance
- Migration Complexity: Low to Moderate
Migration Strategy:
-
Dependency Update: Replace
spring-boot-starter-data-mongodbwithspring-boot-starter-data-jpa - Driver Addition: Add PostgreSQL driver dependency
- Annotation Conversion: Convert MongoDB annotations to JPA annotations
-
Configuration Update: Update
application.ymlwith PostgreSQL connection properties
- Custom Logic Refactoring: Minor updates to RDF storage logic
- JSON/JSONB Integration: Leverage PostgreSQL's JSON capabilities for enhanced indexing
- Repository Abstraction: Maintain Spring Data repository pattern
Benefits:
- Dual storage strategy remains identical
- Enhanced indexing capabilities with JSONB
- Better ACID compliance
- Reduced licensing concerns
See this page for more information about how to implement the migration.
Current Dependency:
- Altinn integration (Norwegian government service)
- OpenID Connect implementation
Open-Source Alternatives:
-
Keycloak
- License: Apache 2.0
- Features: Full OpenID Connect support, user federation, social login
- Migration Complexity: Low - standard OpenID Connect implementation
-
Auth0 Community Edition
- License: Various (check current terms)
- Features: OpenID Connect, social providers, enterprise features
- Migration Complexity: Low - configuration changes
-
Custom OpenID Connect Provider
- License: Depends on implementation
- Features: Full control over user management and authentication flows
- Migration Complexity: Moderate - requires development effort
Migration Considerations:
- OpenID Connect standard ensures compatibility
- User data migration may be required
- Authentication flows remain consistent
- Authorization policies may need adjustment
- Elasticsearch Migration: Upgrade to AGPLv3 version or migrate to OpenSearch
- Schema Registry Replacement: Implement Apicurio Registry or Karapace
- Kafka Migration: Switch from Confluent to Apache Kafka
- MongoDB Migration: Implement PostgreSQL with JSONB support
- Repository Updates: Convert Spring Data repositories
- Testing and Validation: Ensure data integrity and performance
- IAM Evaluation: Assess current OpenID Connect implementation
- Provider Selection: Choose appropriate open-source IAM solution
- Migration Planning: Develop user data migration strategy
- Full Open Source Compliance: Eliminates proprietary license restrictions
- Reduced Legal Risk: Clear licensing terms for all components
- Vendor Independence: Freedom from vendor lock-in
- Community Support: Access to broader open-source community
- Transparency: Full visibility into component source code
- Customization: Ability to modify and extend components as needed
- Cost Reduction: Elimination of proprietary licensing costs
- Flexibility: Freedom to deploy in any environment
- Innovation: Ability to contribute back to open-source projects
- Compatibility Testing: Comprehensive testing of all integrations
- Performance Validation: Ensure performance meets requirements
- Rollback Planning: Maintain ability to revert changes if needed
- Training Requirements: Team education on new components
- Documentation Updates: Maintain current system documentation
- Monitoring Updates: Adjust monitoring and alerting systems
The migration to fully open-source alternatives is feasible with moderate complexity. The primary effort involves infrastructure and configuration changes rather than application code modifications. The benefits of full open-source compliance, reduced licensing costs, and increased flexibility outweigh the migration effort required.
Each component can be migrated independently, allowing for a phased approach that minimizes risk and ensures system stability throughout the transition process.