From 3062e73ad055bfa867f6950cb16e94f23c210eb8 Mon Sep 17 00:00:00 2001 From: Razvan-Daniel Mihai <84674+razvan@users.noreply.github.com> Date: Tue, 6 Dec 2022 12:08:27 +0100 Subject: [PATCH 01/19] First draft of ADR025. --- .../pages/adr/ADR025-database-connection.adoc | 106 ++++++++++++++++++ 1 file changed, 106 insertions(+) create mode 100644 modules/contributor/pages/adr/ADR025-database-connection.adoc diff --git a/modules/contributor/pages/adr/ADR025-database-connection.adoc b/modules/contributor/pages/adr/ADR025-database-connection.adoc new file mode 100644 index 000000000..441a2f403 --- /dev/null +++ b/modules/contributor/pages/adr/ADR025-database-connection.adoc @@ -0,0 +1,106 @@ += [short title of solved problem and solution] +Doc Writer +v0.1, YYYY-MM-DD +:status: draft + +* Status: {draft} +* Deciders: [list everyone involved in the decision] +* Date: [YYYY-MM-DD when the decision was last updated] + +Technical Story: https://github.com/stackabletech/hive-operator/issues/148 + +== Context and Problem Statement + +Many products supported by the Stackable Data Platform require databases to store metadata. Currently there is no uniform, consistent way to define database conections. In addition, some Stackable operators define database credentials to be provided inline and in plain text in the cluster definitions. + +A quick analysis of the status-quo regarding database connection definitions shows how different operators handle them: + +* Apache Hive : the cluster custom resource defined a field called "database" with access credentials in clear text. +* Apache Airflow and Apache Superset: uses a field called "credentialSecret" that contains multiple different database connection definitions. In case of Airflow, this secret only supports the Celery executor. +* Apache Druid: uses a field called "metadataStorageDatabase" where access crdentials are expected to be inline and in plain text. + +== Decision Drivers + +Here we attempt to standardize the way database connections are defined across the Stackable platform in such a way that: + +* Different database systems are supported. +* Access credentials are dinfined in Kubernetes Secret objects +* Database connections can be reused across product services and even product installations. + +== Considered Options + +To achieve the acceptance criteria defined above, we propose a new Kubernetes resource called `DatabaseConnection` with the following fields: + +[cols="1,1"] +|=== +|Field name | Description +|credentials +|A string with name of a `Secret` containing at least a user name and a password field. Additional fields are allowed. +|driver +|A string with the database driver named. This is a generic field that identifies the type of the database used. +|protocol +|The protocol prefix of the final connection string. Most Java based products will use `jdbc:`. +|host +|A string with the host name to connect to. +|instance +|A string with the database instance to connect to. Optional. +|port +|A positive integer with the TCP port used for the connection. Optional. +|properties +|A dictionary of addtional properties for driver tuning like number of client threads, various buffer sizes and so on. Some drivers, like `derby` use this to define the database name and whether the DB should by automatically created or not. Optional +|=== + +The `Secret` object referenced by `credentials` must contain two fields named `USER_NAME` and `PASSWORD` but can contain additional fields like first name, last name, email, user role and so on. + +=== Examples + +These examples showcase the spec change required from the current status: + +The current Druid metadata database connection + +[source,yaml] +--- +metadataStorageDatabase: + dbType: postgresql + connString: jdbc:postgresql://druid-postgresql/druid + host: druid-postgresql + port: 5432 + user: druid + password: druid + +becomes + +[source,yaml] +--- +metadataStorageDatabase: druid-metadata-connection + +where `druid-metadata-connection` is a standalone `DatabaseConnection` resource defined as follows + +[source,yaml] +--- +apiVersion: databaseconnection.stackable.tech/v1alpha1 +kind: DatabaseConnection +metadata: + name: druid-metadata-connection +spec: + driver: postgresql + host: druid-postgresql + port: 5432 + protocol: jdbc:postgresql + instance: druid + credentials: druid-metadata-credentials + +and the credentials field contains the name of a Kubernetes `Secret` defined as: + +[source,yaml] +--- +apiVersion: v1 +kind: Secret +metadata: + name: druid-metadata-credentials +type: Opaque +data: + USER_NAME: druid + PASSWORD: druid + +== Decision Outcome From 44b4f8b182214e8a8a992d1a7e7a6202495f131b Mon Sep 17 00:00:00 2001 From: Razvan-Daniel Mihai <84674+razvan@users.noreply.github.com> Date: Tue, 6 Dec 2022 13:18:12 +0100 Subject: [PATCH 02/19] Fix typo --- modules/contributor/pages/adr/ADR025-database-connection.adoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/modules/contributor/pages/adr/ADR025-database-connection.adoc b/modules/contributor/pages/adr/ADR025-database-connection.adoc index 441a2f403..d66f29d81 100644 --- a/modules/contributor/pages/adr/ADR025-database-connection.adoc +++ b/modules/contributor/pages/adr/ADR025-database-connection.adoc @@ -24,7 +24,7 @@ A quick analysis of the status-quo regarding database connection definitions sho Here we attempt to standardize the way database connections are defined across the Stackable platform in such a way that: * Different database systems are supported. -* Access credentials are dinfined in Kubernetes Secret objects +* Access credentials are defined in Kubernetes `Secret`` objects. * Database connections can be reused across product services and even product installations. == Considered Options From 181f74e9e999b71bb4b080eceaae856807dc4372 Mon Sep 17 00:00:00 2001 From: Razvan-Daniel Mihai <84674+razvan@users.noreply.github.com> Date: Tue, 6 Dec 2022 13:20:07 +0100 Subject: [PATCH 03/19] Update modules/contributor/pages/adr/ADR025-database-connection.adoc Co-authored-by: Sebastian Bernauer --- modules/contributor/pages/adr/ADR025-database-connection.adoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/modules/contributor/pages/adr/ADR025-database-connection.adoc b/modules/contributor/pages/adr/ADR025-database-connection.adoc index d66f29d81..938a00b6d 100644 --- a/modules/contributor/pages/adr/ADR025-database-connection.adoc +++ b/modules/contributor/pages/adr/ADR025-database-connection.adoc @@ -16,7 +16,7 @@ Many products supported by the Stackable Data Platform require databases to stor A quick analysis of the status-quo regarding database connection definitions shows how different operators handle them: * Apache Hive : the cluster custom resource defined a field called "database" with access credentials in clear text. -* Apache Airflow and Apache Superset: uses a field called "credentialSecret" that contains multiple different database connection definitions. In case of Airflow, this secret only supports the Celery executor. +* Apache Airflow and Apache Superset: uses a field called "credentialSecret" that contains multiple different database connection definitions. Even worse, it contains credentials not related to a database, such as a secret to encrypt the cookies. In case of Airflow, this secret only supports the Celery executor. * Apache Druid: uses a field called "metadataStorageDatabase" where access crdentials are expected to be inline and in plain text. == Decision Drivers From 38ce7f9a2b0a2da770bd8d3f4064a9cc79bd81ed Mon Sep 17 00:00:00 2001 From: Razvan-Daniel Mihai <84674+razvan@users.noreply.github.com> Date: Tue, 6 Dec 2022 13:20:21 +0100 Subject: [PATCH 04/19] Update modules/contributor/pages/adr/ADR025-database-connection.adoc Co-authored-by: Sebastian Bernauer --- modules/contributor/pages/adr/ADR025-database-connection.adoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/modules/contributor/pages/adr/ADR025-database-connection.adoc b/modules/contributor/pages/adr/ADR025-database-connection.adoc index 938a00b6d..f5597318b 100644 --- a/modules/contributor/pages/adr/ADR025-database-connection.adoc +++ b/modules/contributor/pages/adr/ADR025-database-connection.adoc @@ -15,7 +15,7 @@ Many products supported by the Stackable Data Platform require databases to stor A quick analysis of the status-quo regarding database connection definitions shows how different operators handle them: -* Apache Hive : the cluster custom resource defined a field called "database" with access credentials in clear text. +* Apache Hive: the cluster custom resource defined a field called "database" with access credentials in clear text. * Apache Airflow and Apache Superset: uses a field called "credentialSecret" that contains multiple different database connection definitions. Even worse, it contains credentials not related to a database, such as a secret to encrypt the cookies. In case of Airflow, this secret only supports the Celery executor. * Apache Druid: uses a field called "metadataStorageDatabase" where access crdentials are expected to be inline and in plain text. From 47e1dc67d097fa583fe05d33d86649d344d9e89a Mon Sep 17 00:00:00 2001 From: Sebastian Bernauer Date: Tue, 6 Dec 2022 14:24:52 +0100 Subject: [PATCH 05/19] Update ADR025-database-connection.adoc --- .../pages/adr/ADR025-database-connection.adoc | 27 +++++++++++++++++++ 1 file changed, 27 insertions(+) diff --git a/modules/contributor/pages/adr/ADR025-database-connection.adoc b/modules/contributor/pages/adr/ADR025-database-connection.adoc index f5597318b..11617ff5a 100644 --- a/modules/contributor/pages/adr/ADR025-database-connection.adoc +++ b/modules/contributor/pages/adr/ADR025-database-connection.adoc @@ -103,4 +103,31 @@ data: USER_NAME: druid PASSWORD: druid +An alternative aproach could look like the following. We will discuss this approach on Wednesday, 07.12.2022 + +[source,yaml] +--- +apiVersion: databaseconnection.stackable.tech/v1alpha1 +kind: DatabaseConnection +metadata: + name: druid-metadata-connection + namespace: default +spec: + database: + postgresql: + host: druid-postgresql # mandatory + port: 5432 # defaults to some port number - depending on wether tls is enabled + schema: druid # defaults to druid + credentials: druid-postgresql-credentials # mandatory. key username and password + parameters: "" # optional + redis: + host: airflow-redis-master # mandatory + port: 6379 # defaults to some port number - depending on wether tls is enabled + schema: druid # defaults to druid + credentials: airflow-redis-credentials # optional. key password. In case redis also supports usernames key username and password + parameters: "" # optional + derby: + location: /tmp/derby/ # optional, defaults to /tmp/derby-{metadata.name}/derby.db + parameters: "create=true" # optional + == Decision Outcome From 1f0b9869f0300bbbfb55c2b20edc36f9ea7eba38 Mon Sep 17 00:00:00 2001 From: Razvan-Daniel Mihai <84674+razvan@users.noreply.github.com> Date: Tue, 6 Dec 2022 16:11:47 +0100 Subject: [PATCH 06/19] Change api version and a bit more structure. --- .../pages/adr/ADR025-database-connection.adoc | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/modules/contributor/pages/adr/ADR025-database-connection.adoc b/modules/contributor/pages/adr/ADR025-database-connection.adoc index 11617ff5a..86c114c58 100644 --- a/modules/contributor/pages/adr/ADR025-database-connection.adoc +++ b/modules/contributor/pages/adr/ADR025-database-connection.adoc @@ -29,6 +29,11 @@ Here we attempt to standardize the way database connections are defined across t == Considered Options +1. A generic resource definition. +2. Database driver specific resource definition. + +=== A generic resource definition + To achieve the acceptance criteria defined above, we propose a new Kubernetes resource called `DatabaseConnection` with the following fields: [cols="1,1"] @@ -78,7 +83,7 @@ where `druid-metadata-connection` is a standalone `DatabaseConnection` resource [source,yaml] --- -apiVersion: databaseconnection.stackable.tech/v1alpha1 +apiVersion: db.stackable.tech/v1alpha1 kind: DatabaseConnection metadata: name: druid-metadata-connection @@ -103,6 +108,8 @@ data: USER_NAME: druid PASSWORD: druid +=== Database driver specific resource definition + An alternative aproach could look like the following. We will discuss this approach on Wednesday, 07.12.2022 [source,yaml] From 7b122bc28eaa243291a2a915c9802734c8f666af Mon Sep 17 00:00:00 2001 From: Sebastian Bernauer Date: Wed, 7 Dec 2022 10:52:47 +0100 Subject: [PATCH 07/19] Update ADR025-database-connection.adoc --- .../pages/adr/ADR025-database-connection.adoc | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/modules/contributor/pages/adr/ADR025-database-connection.adoc b/modules/contributor/pages/adr/ADR025-database-connection.adoc index 86c114c58..55b3946b0 100644 --- a/modules/contributor/pages/adr/ADR025-database-connection.adoc +++ b/modules/contributor/pages/adr/ADR025-database-connection.adoc @@ -126,15 +126,18 @@ spec: port: 5432 # defaults to some port number - depending on wether tls is enabled schema: druid # defaults to druid credentials: druid-postgresql-credentials # mandatory. key username and password - parameters: "" # optional + parameters: {} # optional redis: host: airflow-redis-master # mandatory port: 6379 # defaults to some port number - depending on wether tls is enabled schema: druid # defaults to druid - credentials: airflow-redis-credentials # optional. key password. In case redis also supports usernames key username and password - parameters: "" # optional + credentials: airflow-redis-credentials # optional. key password + parameters: {} # optional derby: location: /tmp/derby/ # optional, defaults to /tmp/derby-{metadata.name}/derby.db - parameters: "create=true" # optional + parameters: # optional + create: "true" + custom: + connectionString: "postgresql://superset:superset@superset-postgresql.default.svc.cluster.local/superset" == Decision Outcome From abb0534cb16ba3a411f99ad2a3820d8d12355605 Mon Sep 17 00:00:00 2001 From: Razvan-Daniel Mihai <84674+razvan@users.noreply.github.com> Date: Thu, 8 Dec 2022 11:32:20 +0100 Subject: [PATCH 08/19] Update ADR025-database-connection.adoc wip --- .../pages/adr/ADR025-database-connection.adoc | 20 +++++++++++++++++-- 1 file changed, 18 insertions(+), 2 deletions(-) diff --git a/modules/contributor/pages/adr/ADR025-database-connection.adoc b/modules/contributor/pages/adr/ADR025-database-connection.adoc index 55b3946b0..a3209f463 100644 --- a/modules/contributor/pages/adr/ADR025-database-connection.adoc +++ b/modules/contributor/pages/adr/ADR025-database-connection.adoc @@ -137,7 +137,23 @@ spec: location: /tmp/derby/ # optional, defaults to /tmp/derby-{metadata.name}/derby.db parameters: # optional create: "true" - custom: - connectionString: "postgresql://superset:superset@superset-postgresql.default.svc.cluster.local/superset" + genericConnectionString: + format: postgresql://$SUPERSET_DB_USER:$SUPERSET_DB_PASS@postgres.default.svc.local:$SUPERSET_DB_PORT/superset¶m1=value1¶m2=value2 + secret: ... # optional + SUPERSET_DB_USER: ... + SUPERSET_DB_PASS: ... + SUPERSET_DB_PORT: ... + generic: + host: superset-postgresql.default.svc.cluster.local # optional + port: 5432 # optional + protocol: pgsql123 # optional + instance: superset # optional + credentials: name-of-secret-with-credentials #optional + parameters: {...} # optional + connectionStringFormat: "{protocol}://{credentials.user_name}:{credentials.credentials}@{host}:{port}/{instance}&[parameters,;]" + tls: # optional + verification: + ca_cert: + ... == Decision Outcome From 838bb02e71872723e87701b884c66c266ce96c4d Mon Sep 17 00:00:00 2001 From: Razvan-Daniel Mihai <84674+razvan@users.noreply.github.com> Date: Thu, 8 Dec 2022 14:29:47 +0100 Subject: [PATCH 09/19] Update ADR025-database-connection.adoc wip --- .../pages/adr/ADR025-database-connection.adoc | 66 +++++++++++++++++++ 1 file changed, 66 insertions(+) diff --git a/modules/contributor/pages/adr/ADR025-database-connection.adoc b/modules/contributor/pages/adr/ADR025-database-connection.adoc index a3209f463..d69f22f64 100644 --- a/modules/contributor/pages/adr/ADR025-database-connection.adoc +++ b/modules/contributor/pages/adr/ADR025-database-connection.adoc @@ -137,13 +137,20 @@ spec: location: /tmp/derby/ # optional, defaults to /tmp/derby-{metadata.name}/derby.db parameters: # optional create: "true" + + + genericConnectionString: + driver: postgresql format: postgresql://$SUPERSET_DB_USER:$SUPERSET_DB_PASS@postgres.default.svc.local:$SUPERSET_DB_PORT/superset¶m1=value1¶m2=value2 secret: ... # optional SUPERSET_DB_USER: ... SUPERSET_DB_PASS: ... SUPERSET_DB_PORT: ... + + generic: + driver: postgresql host: superset-postgresql.default.svc.cluster.local # optional port: 5432 # optional protocol: pgsql123 # optional @@ -156,4 +163,63 @@ spec: ca_cert: ... + +Hive + +[source,xml] + + javax.jdo.option.ConnectionURL + jdbc:postgresql://mypostgresql.testabcd1111.us-west-2.rds.amazonaws.com:5432/mypgdb + PostgreSQL JDBC driver connection URL + + + javax.jdo.option.ConnectionDriverName + org.postgresql.Driver + PostgreSQL metastore driver class name + + + javax.jdo.option.ConnectionUserName + database_username + the username for the DB instance + + + javax.jdo.option.ConnectionPassword + database_password + the password for the DB instance + + +Druid + +[source] +druid.extensions.loadList=["postgresql-metadata-storage"] +druid.metadata.storage.type=postgresql +druid.metadata.storage.connector.connectURI=jdbc:postgresql:///druid +druid.metadata.storage.connector.user=druid +druid.metadata.storage.connector.password=diurd + +Superset + +[source] +postgresql://{username}:{password}@{host}:{port}/{database}?sslmode=require + + +Airflow + +[source,yaml] +--- +apiVersion: v1 +kind: Secret +metadata: + name: simple-airflow-credentials +type: Opaque +stringData: + adminUser.username: airflow + adminUser.firstname: Airflow + adminUser.lastname: Admin + adminUser.email: airflow@airflow.com + adminUser.password: airflow + connections.secretKey: thisISaSECRET_1234 + connections.sqlalchemyDatabaseUri: postgresql+psycopg2://airflow:airflow@airflow-postgresql.default.svc.cluster.local/airflow + connections.celeryResultBackend: db+postgresql://airflow:airflow@airflow-postgresql.default.svc.cluster.local/airflow + connections.celeryBrokerUrl: redis://:redis@airflow-redis-master:6379/0 == Decision Outcome From a3fb7c02d7a297bf988803a9358c532fb59de2db Mon Sep 17 00:00:00 2001 From: Sebastian Bernauer Date: Thu, 8 Dec 2022 14:30:31 +0100 Subject: [PATCH 10/19] Update ADR025-database-connection.adoc --- .../contributor/pages/adr/ADR025-database-connection.adoc | 5 ----- 1 file changed, 5 deletions(-) diff --git a/modules/contributor/pages/adr/ADR025-database-connection.adoc b/modules/contributor/pages/adr/ADR025-database-connection.adoc index d69f22f64..64ae282e1 100644 --- a/modules/contributor/pages/adr/ADR025-database-connection.adoc +++ b/modules/contributor/pages/adr/ADR025-database-connection.adoc @@ -137,9 +137,6 @@ spec: location: /tmp/derby/ # optional, defaults to /tmp/derby-{metadata.name}/derby.db parameters: # optional create: "true" - - - genericConnectionString: driver: postgresql format: postgresql://$SUPERSET_DB_USER:$SUPERSET_DB_PASS@postgres.default.svc.local:$SUPERSET_DB_PORT/superset¶m1=value1¶m2=value2 @@ -147,8 +144,6 @@ spec: SUPERSET_DB_USER: ... SUPERSET_DB_PASS: ... SUPERSET_DB_PORT: ... - - generic: driver: postgresql host: superset-postgresql.default.svc.cluster.local # optional From 5d025159e2c822adef3872ea56dd8c4cd77c143d Mon Sep 17 00:00:00 2001 From: Sebastian Bernauer Date: Thu, 8 Dec 2022 15:58:11 +0100 Subject: [PATCH 11/19] Update ADR025-database-connection.adoc --- .../pages/adr/ADR025-database-connection.adoc | 115 ++++++++++++++++++ 1 file changed, 115 insertions(+) diff --git a/modules/contributor/pages/adr/ADR025-database-connection.adoc b/modules/contributor/pages/adr/ADR025-database-connection.adoc index 64ae282e1..87253f7d5 100644 --- a/modules/contributor/pages/adr/ADR025-database-connection.adoc +++ b/modules/contributor/pages/adr/ADR025-database-connection.adoc @@ -217,4 +217,119 @@ stringData: connections.sqlalchemyDatabaseUri: postgresql+psycopg2://airflow:airflow@airflow-postgresql.default.svc.cluster.local/airflow connections.celeryResultBackend: db+postgresql://airflow:airflow@airflow-postgresql.default.svc.cluster.local/airflow connections.celeryBrokerUrl: redis://:redis@airflow-redis-master:6379/0 + +[source,yaml] +---- +Within operator-rs we have a commons struct for every DB that we support: +1. postgresql +2. mysql +3. mariadb +4. oracle +5. sqlite +6. derby +7. redis +8. etc... + +This has the advantage that all our products configure e.g. a PostgresQL the exact same way. +We can also add some functions on the structs for e.g. jdbc-based connections strings or similar. + +Every product operators has a enum containing all the structs of the DBs the product supports (or only a subset if Stackable does only support a subset) +This has the advantage that the CRD as well as automatically generated documentation will list not only the supported dbs, but also documents all the attributes of them. + +Also every operator has a *individual* `generic` struct, which exposes exactly the settings the product has. +This enables full flexibility, as all the settings of the product are configurable. + +--- +kind: DruidCluster +spec: + metadataDB: + postgresql: + host: postgresql # mandatory + port: 5432 # defaults to some port number - depending on wether tls is enabled + schema: druid # mandatory + credentials: postgresql-credentials # mandatory. key username and password + parameters: {} # optional + mysql: + host: mysql # mandatory + port: XXXX # defaults to some port number - depending on wether tls is enabled + schema: druid # mandatory + credentials: mysql-credentials # mandatory. key username and password + parameters: {} # optional + derby: + location: /tmp/derby/ # optional, defaults to /tmp/derby-/derby.db + generic: + driver: postgresql # mandatory + uri: jdbc:postgresql:///druid?foo;bar # mandatory + credentialsSecret: my-secret # mandatory. key username + password +# druid.metadata.storage.type=postgresql +# druid.metadata.storage.connector.connectURI=jdbc:postgresql:///druid +# druid.metadata.storage.connector.user=druid +# druid.metadata.storage.connector.password=diurd + +--- +kind: SupersetCluster +spec: + metadataDB: + postgresql: + host: postgresql # mandatory + port: 5432 # defaults to some port number - depending on wether tls is enabled + schema: superset # mandatory + credentials: postgresql-credentials # mandatory. key username and password + parameters: {} # optional + mysql: + host: mysql # mandatory + port: XXXX # defaults to some port number - depending on wether tls is enabled + schema: superset # mandatory + credentials: mysql-credentials # mandatory. key username and password + parameters: {} # optional + sqlite: + location: /tmp/sqlite/ # optional, defaults to /tmp/sqlite-/derby.db + generic: + uriSecret: my-secret # mandatory. key uri + # ALTERNATIVE SOLUTION + uriTemplate: postgresql://$SUPERSET_DB_USER:$SUPERSET_DB_PASS@postgres.default.svc.local:$SUPERSET_DB_PORT/superset¶m1=value1¶m2=value2 + templateSecret: my-secret # optional + SUPERSET_DB_USER: ... + SUPERSET_DB_PASS: ... + SUPERSET_DB_PORT: ... +# postgresql://{username}:{password}@{host}:{port}/{database}?sslmode=require + +kind: HiveCluster +spec: + metadataDB: + postgresql: + host: postgresql # mandatory + port: 5432 # defaults to some port number - depending on wether tls is enabled + schema: druid # mandatory + credentials: postgresql-credentials # mandatory. key username and password + parameters: {} # optional + derby: + location: /tmp/derby/ # optional, defaults to /tmp/derby-/derby.db + # Missing: MS-SQL server, Oracle + generic: + driver: org.postgresql.Driver # mandatory + uri: jdbc:postgresql://postgresql.us-west-2.rds.amazonaws.com:5432/mypgdb # mandatory + credentialsSecret: my-secret # mandatory (?). key username + password + # + # javax.jdo.option.ConnectionURL + # jdbc:postgresql://postgresql.us-west-2.rds.amazonaws.com:5432/mypgdb + # PostgreSQL JDBC driver connection URL + # + # + # javax.jdo.option.ConnectionDriverName + # org.postgresql.Driver + # PostgreSQL metastore driver class name + # + # + # javax.jdo.option.ConnectionUserName + # database_username + # the username for the DB instance + # + # + # javax.jdo.option.ConnectionPassword + # database_password + # the password for the DB instance + # +---- + == Decision Outcome From eb76776db2b59104269aa7958721c18b1b34f9d2 Mon Sep 17 00:00:00 2001 From: Razvan-Daniel Mihai <84674+razvan@users.noreply.github.com> Date: Thu, 8 Dec 2022 16:48:16 +0100 Subject: [PATCH 12/19] wip --- .../pages/adr/ADR025-database-connection.adoc | 38 +++++++++++++------ 1 file changed, 26 insertions(+), 12 deletions(-) diff --git a/modules/contributor/pages/adr/ADR025-database-connection.adoc b/modules/contributor/pages/adr/ADR025-database-connection.adoc index 87253f7d5..b8f87cdf0 100644 --- a/modules/contributor/pages/adr/ADR025-database-connection.adoc +++ b/modules/contributor/pages/adr/ADR025-database-connection.adoc @@ -1,13 +1,17 @@ -= [short title of solved problem and solution] -Doc Writer -v0.1, YYYY-MM-DD += ADR026: Standardize database connection specifications +Razvan Mihai +v0.1, 2022-12-08 :status: draft * Status: {draft} -* Deciders: [list everyone involved in the decision] -* Date: [YYYY-MM-DD when the decision was last updated] +* Deciders: +** Felix Henning +** Malte Sanders +** Sebastian Bernauer +** Razvan Mihai +* Date: 2022-12-08 -Technical Story: https://github.com/stackabletech/hive-operator/issues/148 +Technical Story: https://github.com/stackabletech/issues/issues/238 == Context and Problem Statement @@ -25,16 +29,22 @@ Here we attempt to standardize the way database connections are defined across t * Different database systems are supported. * Access credentials are defined in Kubernetes `Secret`` objects. -* Database connections can be reused across product services and even product installations. +* Product configuration only allows (product) supported databases ... +* But there is a generic way to configure additional database systems. +* Misconfigured connections should be rejected as early as possible in the product lifecycle. +* Generated CRD documentation is easy to follow by users. + +Initially we thought that database connections should be implemented as stand-alone Kubernetes resources and should be referenced in cluster definitions. This idea was thrown away mostly because sharing database connections across products is not good practice and we shouldn't encourage it. == Considered Options -1. A generic resource definition. +1. `DatabaseConnection` A generic resource definition. 2. Database driver specific resource definition. +3. Product supported and a generic DB specifications. -=== A generic resource definition +=== 1. (discarded) `DatabaseConnection` A generic resource definition -To achieve the acceptance criteria defined above, we propose a new Kubernetes resource called `DatabaseConnection` with the following fields: +The first idea was to introduce a new Kubernetes resource called `DatabaseConnection` with the following fields: [cols="1,1"] |=== @@ -108,9 +118,11 @@ data: USER_NAME: druid PASSWORD: druid -=== Database driver specific resource definition +NOTE: This idea was discarded because it didn't satisfy all acceptance criteria. In particular it wouldn't be possible to catch misconfigurations at cluster creation time. + +=== (discarded) 2. Database driver specific resource definition. -An alternative aproach could look like the following. We will discuss this approach on Wednesday, 07.12.2022 +In an attempt to address the issues of the first option above, a more detailed specification was necessary. Here, database generic configurations are possible that can be better validated, as in the example below. [source,yaml] --- @@ -157,7 +169,9 @@ spec: verification: ca_cert: ... +In addition, a second generic DB type (`genericConnectionString`) is introduced. This specification allows templating connection URLs with variables defined in secrets and it's not restricted only to user credentials. +NOTE: This proposal was rejected because for the same reason as the first proposal. In addition, it fails to make possible DB configurations product specific. Hive From 25f8b2ff8c4c77b07fc081899f50c967c047ec0a Mon Sep 17 00:00:00 2001 From: Razvan-Daniel Mihai <84674+razvan@users.noreply.github.com> Date: Thu, 8 Dec 2022 16:52:15 +0100 Subject: [PATCH 13/19] Update adr number for db connections to 26 --- ...logging_architecture.adoc => ADR026-logging_architecture.adoc} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename modules/contributor/pages/adr/{ADR025-logging_architecture.adoc => ADR026-logging_architecture.adoc} (100%) diff --git a/modules/contributor/pages/adr/ADR025-logging_architecture.adoc b/modules/contributor/pages/adr/ADR026-logging_architecture.adoc similarity index 100% rename from modules/contributor/pages/adr/ADR025-logging_architecture.adoc rename to modules/contributor/pages/adr/ADR026-logging_architecture.adoc From 453d105b8930a790e550975526307e65f830e5f9 Mon Sep 17 00:00:00 2001 From: Razvan-Daniel Mihai <84674+razvan@users.noreply.github.com> Date: Thu, 8 Dec 2022 16:54:32 +0100 Subject: [PATCH 14/19] Fix adr renaming. --- ...logging_architecture.adoc => ADR025-logging_architecture.adoc} | 0 ...5-database-connection.adoc => ADR026-database-connection.adoc} | 0 2 files changed, 0 insertions(+), 0 deletions(-) rename modules/contributor/pages/adr/{ADR026-logging_architecture.adoc => ADR025-logging_architecture.adoc} (100%) rename modules/contributor/pages/adr/{ADR025-database-connection.adoc => ADR026-database-connection.adoc} (100%) diff --git a/modules/contributor/pages/adr/ADR026-logging_architecture.adoc b/modules/contributor/pages/adr/ADR025-logging_architecture.adoc similarity index 100% rename from modules/contributor/pages/adr/ADR026-logging_architecture.adoc rename to modules/contributor/pages/adr/ADR025-logging_architecture.adoc diff --git a/modules/contributor/pages/adr/ADR025-database-connection.adoc b/modules/contributor/pages/adr/ADR026-database-connection.adoc similarity index 100% rename from modules/contributor/pages/adr/ADR025-database-connection.adoc rename to modules/contributor/pages/adr/ADR026-database-connection.adoc From b6c6811bbaf185214ffea35aab3246d320b76b62 Mon Sep 17 00:00:00 2001 From: Razvan-Daniel Mihai <84674+razvan@users.noreply.github.com> Date: Thu, 8 Dec 2022 17:51:23 +0100 Subject: [PATCH 15/19] wip --- .../pages/adr/ADR026-database-connection.adoc | 108 ++++-------------- 1 file changed, 24 insertions(+), 84 deletions(-) diff --git a/modules/contributor/pages/adr/ADR026-database-connection.adoc b/modules/contributor/pages/adr/ADR026-database-connection.adoc index b8f87cdf0..20b248cb3 100644 --- a/modules/contributor/pages/adr/ADR026-database-connection.adoc +++ b/modules/contributor/pages/adr/ADR026-database-connection.adoc @@ -38,11 +38,11 @@ Initially we thought that database connections should be implemented as stand-al == Considered Options -1. `DatabaseConnection` A generic resource definition. -2. Database driver specific resource definition. -3. Product supported and a generic DB specifications. +1. (rejected) `DatabaseConnection` A generic resource definition. +2. (rejected) Database driver specific resource definition. +3. (accepted) Product supported and a generic DB specifications. -=== 1. (discarded) `DatabaseConnection` A generic resource definition +=== 1. (rejected) `DatabaseConnection` A generic resource definition The first idea was to introduce a new Kubernetes resource called `DatabaseConnection` with the following fields: @@ -120,7 +120,7 @@ data: NOTE: This idea was discarded because it didn't satisfy all acceptance criteria. In particular it wouldn't be possible to catch misconfigurations at cluster creation time. -=== (discarded) 2. Database driver specific resource definition. +=== (rejected) 2. Database driver specific resource definition. In an attempt to address the issues of the first option above, a more detailed specification was necessary. Here, database generic configurations are possible that can be better validated, as in the example below. @@ -173,90 +173,27 @@ In addition, a second generic DB type (`genericConnectionString`) is introduced. NOTE: This proposal was rejected because for the same reason as the first proposal. In addition, it fails to make possible DB configurations product specific. -Hive - -[source,xml] - - javax.jdo.option.ConnectionURL - jdbc:postgresql://mypostgresql.testabcd1111.us-west-2.rds.amazonaws.com:5432/mypgdb - PostgreSQL JDBC driver connection URL - - - javax.jdo.option.ConnectionDriverName - org.postgresql.Driver - PostgreSQL metastore driver class name - - - javax.jdo.option.ConnectionUserName - database_username - the username for the DB instance - - - javax.jdo.option.ConnectionPassword - database_password - the password for the DB instance - - -Druid +=== (accepted) Product supported and a generic DB specifications. -[source] -druid.extensions.loadList=["postgresql-metadata-storage"] -druid.metadata.storage.type=postgresql -druid.metadata.storage.connector.connectURI=jdbc:postgresql:///druid -druid.metadata.storage.connector.user=druid -druid.metadata.storage.connector.password=diurd +It seems that an unique, platform wide mechanism to describe database connections that also fulfills all acceptance criteria is not feasable. Database drivers and product configurations are too diverse and cannot be forced into a type safe specification. -Superset +Thus the single, global connection manifest needs to split into two different categories, each covering a subset of the acceptance criteria: -[source] -postgresql://{username}:{password}@{host}:{port}/{database}?sslmode=require +1. A database specific mechanism. This allows to catch misconfigurations early, it promotes good documentation and uniformity inside the platform. +2. An operator specific mechanism. This is a wildcard that can be used to configure database connections that are not officially supported by the products but that can still be partially validated early. +The first mechanism requires the operator framwork to provide predefined structures and supporting functions for widely available database systems such as: PostgreSQL, MySQL, MariaDB, Oracle, SQLite, Derby, Redis and so on. This doesn't mean that all products can be configured with all DB implementations. The product definitions will only allow the subset that is officially supported by the products. -Airflow +The second mechanism is operator/product specific and it contains mostly a pass-through list of relevant **product properties**. There is at least one exception, and that is the handling of user credentials which still need to be provisioned in a secure way (as long as the product supports it). -[source,yaml] ---- -apiVersion: v1 -kind: Secret -metadata: - name: simple-airflow-credentials -type: Opaque -stringData: - adminUser.username: airflow - adminUser.firstname: Airflow - adminUser.lastname: Admin - adminUser.email: airflow@airflow.com - adminUser.password: airflow - connections.secretKey: thisISaSECRET_1234 - connections.sqlalchemyDatabaseUri: postgresql+psycopg2://airflow:airflow@airflow-postgresql.default.svc.cluster.local/airflow - connections.celeryResultBackend: db+postgresql://airflow:airflow@airflow-postgresql.default.svc.cluster.local/airflow - connections.celeryBrokerUrl: redis://:redis@airflow-redis-master:6379/0 +The following example shows how to configure the metadata storage for a Druid cluster using either one of the supported back-ends or a generic system. In a production setting only the PostgreSQL or MySQL manifests should be used. [source,yaml] ----- -Within operator-rs we have a commons struct for every DB that we support: -1. postgresql -2. mysql -3. mariadb -4. oracle -5. sqlite -6. derby -7. redis -8. etc... - -This has the advantage that all our products configure e.g. a PostgresQL the exact same way. -We can also add some functions on the structs for e.g. jdbc-based connections strings or similar. - -Every product operators has a enum containing all the structs of the DBs the product supports (or only a subset if Stackable does only support a subset) -This has the advantage that the CRD as well as automatically generated documentation will list not only the supported dbs, but also documents all the attributes of them. - -Also every operator has a *individual* `generic` struct, which exposes exactly the settings the product has. -This enables full flexibility, as all the settings of the product are configurable. - --- kind: DruidCluster spec: - metadataDB: + # ... + metadataStorageDatabase: postgresql: host: postgresql # mandatory port: 5432 # defaults to some port number - depending on wether tls is enabled @@ -275,11 +212,16 @@ spec: driver: postgresql # mandatory uri: jdbc:postgresql:///druid?foo;bar # mandatory credentialsSecret: my-secret # mandatory. key username + password -# druid.metadata.storage.type=postgresql -# druid.metadata.storage.connector.connectURI=jdbc:postgresql:///druid -# druid.metadata.storage.connector.user=druid -# druid.metadata.storage.connector.password=diurd +We do not discuss implementation details in this document but want to note that the `generic` manifest can be derived from all DB specific specifications. Regardless of what Db system is used, all of the above translate to a fragment of `runtime.properties` such as: + +[source] +druid.metadata.storage.type=postgresql +druid.metadata.storage.connector.connectURI=jdbc:postgresql:///druid +druid.metadata.storage.connector.user=druid +druid.metadata.storage.connector.password=diurd + +[source,yaml] --- kind: SupersetCluster spec: @@ -344,6 +286,4 @@ spec: # database_password # the password for the DB instance # ----- -== Decision Outcome From a8c83d260a35d5d32dc986a47bc7f2437143ef9f Mon Sep 17 00:00:00 2001 From: Razvan-Daniel Mihai <84674+razvan@users.noreply.github.com> Date: Fri, 9 Dec 2022 14:16:50 +0100 Subject: [PATCH 16/19] Explain product and db specific manifests. --- .../pages/adr/ADR026-database-connection.adoc | 179 +++++++++--------- 1 file changed, 91 insertions(+), 88 deletions(-) diff --git a/modules/contributor/pages/adr/ADR026-database-connection.adoc b/modules/contributor/pages/adr/ADR026-database-connection.adoc index 20b248cb3..c5a9fcb00 100644 --- a/modules/contributor/pages/adr/ADR026-database-connection.adoc +++ b/modules/contributor/pages/adr/ADR026-database-connection.adoc @@ -186,104 +186,107 @@ The first mechanism requires the operator framwork to provide predefined structu The second mechanism is operator/product specific and it contains mostly a pass-through list of relevant **product properties**. There is at least one exception, and that is the handling of user credentials which still need to be provisioned in a secure way (as long as the product supports it). +==== Database specific manifests + +Support for the following database systems is planned. Additional systems may be added in the future. + +1. PostgreSQL + +[source,yaml] +postgresql: + host: postgresql # mandatory + port: 5432 # optional, default is 5432 + instance: my-database # mandatory + credentials: my-application-credentials # mandatory. key username and password + parameters: {} # optional + tls: secure-connection-class-name # optional + auth: authentication-class-name # optional. authentication class to use. + +PostgreSQL supports multiple authentication mechanisms as described https://www.postgresql.org/docs/9.1/auth-pg-hba-conf.html[here]. + +2. MySQL + +[source,yaml] +mysql: + host: mysql # mandatory + port: 3306 # optional, default is 3306 + instance: my-database # mandatory + credentials: my-application-credentials # mandatory. key username and password + parameters: {} # optional + tls: secure-connection-class-name # optional + auth: authentication-class-name # optional. authentication class to use. + +MySQL supports multiple authentication mechanisms as described https://dev.mysql.com/doc/refman/8.0/en/socket-pluggable-authentication.html[here]. + +3. Derby + +Derby is used often as an embeded database for testing and prototyping ideas and implementations. It's not recommended for production usecases. + +[source,yaml] +derby: + location: /tmp/my-database/ # optional, defaults to /tmp/derby-/derby.db + + +==== Product specific manifests + +1. Apache Druid + +Apache Druid clusters can be configured any of the DB specific manifests from above. In addition, a DB generic configuration can pe specified: + The following example shows how to configure the metadata storage for a Druid cluster using either one of the supported back-ends or a generic system. In a production setting only the PostgreSQL or MySQL manifests should be used. [source,yaml] ---- -kind: DruidCluster -spec: - # ... - metadataStorageDatabase: - postgresql: - host: postgresql # mandatory - port: 5432 # defaults to some port number - depending on wether tls is enabled - schema: druid # mandatory - credentials: postgresql-credentials # mandatory. key username and password - parameters: {} # optional - mysql: - host: mysql # mandatory - port: XXXX # defaults to some port number - depending on wether tls is enabled - schema: druid # mandatory - credentials: mysql-credentials # mandatory. key username and password - parameters: {} # optional - derby: - location: /tmp/derby/ # optional, defaults to /tmp/derby-/derby.db - generic: - driver: postgresql # mandatory - uri: jdbc:postgresql:///druid?foo;bar # mandatory - credentialsSecret: my-secret # mandatory. key username + password +generic: + driver: postgresql # mandatory + uri: jdbc:postgresql:///druid?foo;bar # mandatory + credentialsSecret: my-secret # mandatory. key username + password -We do not discuss implementation details in this document but want to note that the `generic` manifest can be derived from all DB specific specifications. Regardless of what Db system is used, all of the above translate to a fragment of `runtime.properties` such as: +The above is translated into the following Java properties: [source] druid.metadata.storage.type=postgresql -druid.metadata.storage.connector.connectURI=jdbc:postgresql:///druid +druid.metadata.storage.connector.connectURI=jdbc:postgresql:///druid?foo;bar druid.metadata.storage.connector.user=druid druid.metadata.storage.connector.password=diurd +2. Apache Superset + +NOTE: Superset supports a very wide range of database systems as described https://superset.apache.org/docs/databases/installing-database-drivers[here]. Not all of them are suitable for metadata storage. + +Connections to Apache Hive, Apache Druid and Trino clusters deployed as part of the SDP platform can be automated by using discovery configuration maps. In this case, the only attribute to configure is the name of the discovery config map of the appropriate system. + +In addition, a generic way to configure a database connection looks as follows: + [source,yaml] ---- -kind: SupersetCluster -spec: - metadataDB: - postgresql: - host: postgresql # mandatory - port: 5432 # defaults to some port number - depending on wether tls is enabled - schema: superset # mandatory - credentials: postgresql-credentials # mandatory. key username and password - parameters: {} # optional - mysql: - host: mysql # mandatory - port: XXXX # defaults to some port number - depending on wether tls is enabled - schema: superset # mandatory - credentials: mysql-credentials # mandatory. key username and password - parameters: {} # optional - sqlite: - location: /tmp/sqlite/ # optional, defaults to /tmp/sqlite-/derby.db - generic: - uriSecret: my-secret # mandatory. key uri - # ALTERNATIVE SOLUTION - uriTemplate: postgresql://$SUPERSET_DB_USER:$SUPERSET_DB_PASS@postgres.default.svc.local:$SUPERSET_DB_PORT/superset¶m1=value1¶m2=value2 - templateSecret: my-secret # optional - SUPERSET_DB_USER: ... - SUPERSET_DB_PASS: ... - SUPERSET_DB_PORT: ... -# postgresql://{username}:{password}@{host}:{port}/{database}?sslmode=require +generic: + secret: superset-metadata-secret # mandatory. A secret naming with one entry called "key". Used to encrypt metadata and session cookies. + template: postgresql://{{SUPERSET_DB_USER}}:{{SUPERSET_DB_PASS}}@postgres.default.svc.local/superset¶m1=value1¶m2=value2 # mandatory + templateSecret: my-secret # optional + SUPERSET_DB_USER: ... + SUPERSET_DB_PASS: ... -kind: HiveCluster -spec: - metadataDB: - postgresql: - host: postgresql # mandatory - port: 5432 # defaults to some port number - depending on wether tls is enabled - schema: druid # mandatory - credentials: postgresql-credentials # mandatory. key username and password - parameters: {} # optional - derby: - location: /tmp/derby/ # optional, defaults to /tmp/derby-/derby.db - # Missing: MS-SQL server, Oracle - generic: - driver: org.postgresql.Driver # mandatory - uri: jdbc:postgresql://postgresql.us-west-2.rds.amazonaws.com:5432/mypgdb # mandatory - credentialsSecret: my-secret # mandatory (?). key username + password - # - # javax.jdo.option.ConnectionURL - # jdbc:postgresql://postgresql.us-west-2.rds.amazonaws.com:5432/mypgdb - # PostgreSQL JDBC driver connection URL - # - # - # javax.jdo.option.ConnectionDriverName - # org.postgresql.Driver - # PostgreSQL metastore driver class name - # - # - # javax.jdo.option.ConnectionUserName - # database_username - # the username for the DB instance - # - # - # javax.jdo.option.ConnectionPassword - # database_password - # the password for the DB instance - # +The template attribute allows to specify the full connection string as required by Superset (and the underlying SQLAlchemy framework). Variables in the template are specified within `{{` and `}}` markers and threir contents is replaced with the corresponding field in the `templateSecret` object. + +3. Apache Hive + +For production environments, we recommend PostgreSQL back-end and for development, Derby. + +A generic connection can be configured as follows: + +[source,yaml] +generic: + driver: org.postgresql.Driver # mandatory + uri: jdbc:postgresql://postgresql.us-west-2.rds.amazonaws.com:5432/mypgdb # mandatory + credentialsSecret: my-secret # mandatory (?). key username + password + +4. Apache Airflow + +A generic Airflow database connection can be configured in a similar fashion with Superset: + +[source,yaml] +generic: + template: postgresql://{{AIRFLOW_DB_USER}}:{{AIRFLOW_DB_PASS}}@postgres.default.svc.local/superset¶m1=value1¶m2=value2 # mandatory + templateSecret: my-secret # optional + AIRFLOW_DB_USER: ... + AIRFLOW_DB_PASS: ... From c571ca8139b28723c55f131dbd8cbc803d366984 Mon Sep 17 00:00:00 2001 From: Sebastian Bernauer Date: Tue, 10 Jan 2023 15:11:23 +0100 Subject: [PATCH 17/19] Update ADR026-database-connection.adoc --- .../pages/adr/ADR026-database-connection.adoc | 93 +++++++++++++++++++ 1 file changed, 93 insertions(+) diff --git a/modules/contributor/pages/adr/ADR026-database-connection.adoc b/modules/contributor/pages/adr/ADR026-database-connection.adoc index c5a9fcb00..d4e08ac36 100644 --- a/modules/contributor/pages/adr/ADR026-database-connection.adoc +++ b/modules/contributor/pages/adr/ADR026-database-connection.adoc @@ -290,3 +290,96 @@ generic: AIRFLOW_DB_USER: ... AIRFLOW_DB_PASS: ... +For the record, theese were some sample CRDs that where created during discussion: +[source,yaml] +---- +--- +kind: DruidCluster +spec: + metadataDB: + postgresql: + host: postgresql # mandatory + port: 5432 # defaults to some port number - depending on wether tls is enabled + schema: druid # mandatory + credentials: postgresql-credentials # mandatory. key username and password + parameters: {} # optional + mysql: + host: mysql # mandatory + port: XXXX # defaults to some port number - depending on wether tls is enabled + schema: druid # mandatory + credentials: mysql-credentials # mandatory. key username and password + parameters: {} # optional + derby: + location: /tmp/derby/ # optional, defaults to /tmp/derby-/derby.db + generic: + driver: postgresql # mandatory + uri: jdbc:postgresql:///druid?foo;bar # mandatory + credentialsSecret: my-secret # mandatory. key username + password +# druid.metadata.storage.type=postgresql +# druid.metadata.storage.connector.connectURI=jdbc:postgresql:///druid +# druid.metadata.storage.connector.user=druid +# druid.metadata.storage.connector.password=diurd +--- +kind: SupersetCluster +spec: + metadataDB: + postgresql: + host: postgresql # mandatory + port: 5432 # defaults to some port number - depending on wether tls is enabled + schema: superset # mandatory + credentials: postgresql-credentials # mandatory. key username and password + parameters: {} # optional + mysql: + host: mysql # mandatory + port: XXXX # defaults to some port number - depending on wether tls is enabled + schema: superset # mandatory + credentials: mysql-credentials # mandatory. key username and password + parameters: {} # optional + sqlite: + location: /tmp/sqlite/ # optional, defaults to /tmp/sqlite-/derby.db + generic: + uriSecret: my-secret # mandatory. key uri + # ALTERNATIVE SOLUTION + uriTemplate: postgresql://$SUPERSET_DB_USER:$SUPERSET_DB_PASS@postgres.default.svc.local:$SUPERSET_DB_PORT/superset¶m1=value1¶m2=value2 + templateSecret: my-secret # optional + SUPERSET_DB_USER: ... + SUPERSET_DB_PASS: ... + SUPERSET_DB_PORT: ... +# postgresql://{username}:{password}@{host}:{port}/{database}?sslmode=require +kind: HiveCluster +spec: + metadataDB: + postgresql: + host: postgresql # mandatory + port: 5432 # defaults to some port number - depending on wether tls is enabled + schema: druid # mandatory + credentials: postgresql-credentials # mandatory. key username and password + parameters: {} # optional + derby: + location: /tmp/derby/ # optional, defaults to /tmp/derby-/derby.db + # Missing: MS-SQL server, Oracle + generic: + driver: org.postgresql.Driver # mandatory + uri: jdbc:postgresql://postgresql.us-west-2.rds.amazonaws.com:5432/mypgdb # mandatory + credentialsSecret: my-secret # mandatory (?). key username + password + # + # javax.jdo.option.ConnectionURL + # jdbc:postgresql://postgresql.us-west-2.rds.amazonaws.com:5432/mypgdb + # PostgreSQL JDBC driver connection URL + # + # + # javax.jdo.option.ConnectionDriverName + # org.postgresql.Driver + # PostgreSQL metastore driver class name + # + # + # javax.jdo.option.ConnectionUserName + # database_username + # the username for the DB instance + # + # + # javax.jdo.option.ConnectionPassword + # database_password + # the password for the DB instance + # +---- From 91b5de21c4626d02a83c4689c62f1fb8acc38539 Mon Sep 17 00:00:00 2001 From: Sebastian Bernauer Date: Tue, 22 Aug 2023 15:35:44 +0200 Subject: [PATCH 18/19] Meeting in Ka on-site --- ...21-stackablectl_stacks_inital_version.adoc | 101 ----- .../pages/adr/ADR026-database-connection.adoc | 385 ------------------ .../contributor/partials/current_adrs.adoc | 3 +- 3 files changed, 2 insertions(+), 487 deletions(-) delete mode 100644 modules/contributor/pages/adr/ADR021-stackablectl_stacks_inital_version.adoc delete mode 100644 modules/contributor/pages/adr/ADR026-database-connection.adoc diff --git a/modules/contributor/pages/adr/ADR021-stackablectl_stacks_inital_version.adoc b/modules/contributor/pages/adr/ADR021-stackablectl_stacks_inital_version.adoc deleted file mode 100644 index 9bc600b67..000000000 --- a/modules/contributor/pages/adr/ADR021-stackablectl_stacks_inital_version.adoc +++ /dev/null @@ -1,101 +0,0 @@ -= ADR021: Initial Version of Stackable Stacks Functionality -Sönke Liebau -v0.1, 2022-06-07 -:status: accepted - -* Status: {status} -* Deciders: -** Rob Siwicki -** Sebastian Bernauer -** Sönke Liebau -** Teo Klestrup-Röijezon -* Date: 2022-06-07 - -== Context and Problem Statement - -During the preparations for the first real release we noticed that the _create_test_cluster.py_ script is not really polished at all. -We would very much like to include _stackablectl_ as the CLI tool in the first release as well as demo it on the website in the configurator. - -In principle _stackablectl_ is usable, but one main functionality is missing, which is the ability to apply the examples and stand up products. -While just applying the examples is not a huge problem, some of our tool have external dependencies that we need to supply via helm charts at the moment (for example Trino needs a Postgres database). -We need to have a way to install helm charts as part of the stacks functionality in _stackablectl_ to make it viable for rolling out example setups. - -The scope of this ADR is to define a minimal solution that allows defining stacks and specifying Helm charts with properties as prerequisites before applying yaml files. -This should be defined in a way that allows us as much flexibility as possible when further defining how _stackablectl stacks_ should behave and how stacks are defined down the road, as this is an ongoing discussion. - -All subsequent decisions will be documented in a separate ADR. - -== Decision Drivers - -* Implementation effort should be small so this can be included in release 1 -* Chosen solution should give flexibility to extend it without breaking changes in the future - -== Considered Options - -* Do nothing -* Implement basic definition of stacks -* Go all in on https://porter.sh/[Porter] / https://cnab.io/[CNAB] and use it to fully define our stacks - -== Decision Outcome - -Chosen option: "Implement basic definition of stacks", because it is a lightweight solution that can be implemented with limited effort and matches the expected overall direction of _stackablectl_ well. -There was agreement between all deciders that we do not want to marry our solution too tightly to the as yet unproven CNAB standard or Porter as a concrete implementation. -By defining our own, thin, abstraction layer we can isolate our users from the chosen implementation technologies in the backend (Porter, CNAB, Helm, ...). - -=== Positive Consequences - -* We can use _stackablectl_ as CLI tool in the first official release -* This allows adding CNAB bundles as the preferred implementation in a non-breaking fashion later on - -=== Negative Consequences - -* Depending on the future direction we take with _stackablectl_ we have a risk of needing to break the api surface that we create with this implementation - -== Pros and Cons of the Options - -=== Do nothing - -We could do nothing right now and instead use the _create_test_cluster.py_ script as our entry point for the initial release. - -* Good, this gives us time to design a fully thought out solution before implementing something -* Bad, _create_test_cluster.py_ doesn't feel very polished and we do want to introducet _stackablectl_ to the world as our tool of choice - -=== Implement basic definition of stacks - -In order to generate a minimally invasive way to enable deploying Helm charts as prerequisites for our stacks we will introduce the initial definition of a stack roughly as shown below. -The implementation details may vary, this snippet is provided more to show the overall structure, most specifically the _manifests_ attribute. - -Initial implementations here will be provided for applying a bunch of yaml files and installing Helm charts. - -[source,yaml] ----- - trino: - description: Simply stack only containing Trino - stackableRelease: 22.05-sbernauer - labels: - - trino - manifests: - - helmChart: - repository: https://charts.bitnami.com/bitnami - name: postgresql - properties: - - auth.username: superset - - auth.password: superset - - auth.database: superset - - plainYaml: stacks/trino.yaml ----- - -helm install --repo https://charts.bitnami.com/bitnami --set auth.username=superset --set auth.password=superset --set auth.database=superset superset-postgresql postgresql - -* Good, because it allows us to use _stackablectl_ in release 1 and the marketing campaign -* Good, because it provides isolation between _stackablectl_ and bundle technologies -* Bad, because it may require breaking changes down the line to the interface users get now - -=== Go all in on Porter/CNAB and use it to fully define our stacks - -CNAB in theory provides everything we'd need to install a stack, instead of allowing to define yaml files, helm charts and other things, we could simply bundle an entire stack as a CNAB bundle using Porter and have _stackablectl_ install this. - -CNAB bundles can be pushed to OCI compliant registries, so we would not need to provide our own method of listing stacks for _stackablectl_ either. - -* Good, because instead of reinventing the wheel we would use an existing technology -* Bad, because we tightly couple _stackablectl_ to this technology, the adoption of which is yet to be proven \ No newline at end of file diff --git a/modules/contributor/pages/adr/ADR026-database-connection.adoc b/modules/contributor/pages/adr/ADR026-database-connection.adoc deleted file mode 100644 index d4e08ac36..000000000 --- a/modules/contributor/pages/adr/ADR026-database-connection.adoc +++ /dev/null @@ -1,385 +0,0 @@ -= ADR026: Standardize database connection specifications -Razvan Mihai -v0.1, 2022-12-08 -:status: draft - -* Status: {draft} -* Deciders: -** Felix Henning -** Malte Sanders -** Sebastian Bernauer -** Razvan Mihai -* Date: 2022-12-08 - -Technical Story: https://github.com/stackabletech/issues/issues/238 - -== Context and Problem Statement - -Many products supported by the Stackable Data Platform require databases to store metadata. Currently there is no uniform, consistent way to define database conections. In addition, some Stackable operators define database credentials to be provided inline and in plain text in the cluster definitions. - -A quick analysis of the status-quo regarding database connection definitions shows how different operators handle them: - -* Apache Hive: the cluster custom resource defined a field called "database" with access credentials in clear text. -* Apache Airflow and Apache Superset: uses a field called "credentialSecret" that contains multiple different database connection definitions. Even worse, it contains credentials not related to a database, such as a secret to encrypt the cookies. In case of Airflow, this secret only supports the Celery executor. -* Apache Druid: uses a field called "metadataStorageDatabase" where access crdentials are expected to be inline and in plain text. - -== Decision Drivers - -Here we attempt to standardize the way database connections are defined across the Stackable platform in such a way that: - -* Different database systems are supported. -* Access credentials are defined in Kubernetes `Secret`` objects. -* Product configuration only allows (product) supported databases ... -* But there is a generic way to configure additional database systems. -* Misconfigured connections should be rejected as early as possible in the product lifecycle. -* Generated CRD documentation is easy to follow by users. - -Initially we thought that database connections should be implemented as stand-alone Kubernetes resources and should be referenced in cluster definitions. This idea was thrown away mostly because sharing database connections across products is not good practice and we shouldn't encourage it. - -== Considered Options - -1. (rejected) `DatabaseConnection` A generic resource definition. -2. (rejected) Database driver specific resource definition. -3. (accepted) Product supported and a generic DB specifications. - -=== 1. (rejected) `DatabaseConnection` A generic resource definition - -The first idea was to introduce a new Kubernetes resource called `DatabaseConnection` with the following fields: - -[cols="1,1"] -|=== -|Field name | Description -|credentials -|A string with name of a `Secret` containing at least a user name and a password field. Additional fields are allowed. -|driver -|A string with the database driver named. This is a generic field that identifies the type of the database used. -|protocol -|The protocol prefix of the final connection string. Most Java based products will use `jdbc:`. -|host -|A string with the host name to connect to. -|instance -|A string with the database instance to connect to. Optional. -|port -|A positive integer with the TCP port used for the connection. Optional. -|properties -|A dictionary of addtional properties for driver tuning like number of client threads, various buffer sizes and so on. Some drivers, like `derby` use this to define the database name and whether the DB should by automatically created or not. Optional -|=== - -The `Secret` object referenced by `credentials` must contain two fields named `USER_NAME` and `PASSWORD` but can contain additional fields like first name, last name, email, user role and so on. - -=== Examples - -These examples showcase the spec change required from the current status: - -The current Druid metadata database connection - -[source,yaml] ---- -metadataStorageDatabase: - dbType: postgresql - connString: jdbc:postgresql://druid-postgresql/druid - host: druid-postgresql - port: 5432 - user: druid - password: druid - -becomes - -[source,yaml] ---- -metadataStorageDatabase: druid-metadata-connection - -where `druid-metadata-connection` is a standalone `DatabaseConnection` resource defined as follows - -[source,yaml] ---- -apiVersion: db.stackable.tech/v1alpha1 -kind: DatabaseConnection -metadata: - name: druid-metadata-connection -spec: - driver: postgresql - host: druid-postgresql - port: 5432 - protocol: jdbc:postgresql - instance: druid - credentials: druid-metadata-credentials - -and the credentials field contains the name of a Kubernetes `Secret` defined as: - -[source,yaml] ---- -apiVersion: v1 -kind: Secret -metadata: - name: druid-metadata-credentials -type: Opaque -data: - USER_NAME: druid - PASSWORD: druid - -NOTE: This idea was discarded because it didn't satisfy all acceptance criteria. In particular it wouldn't be possible to catch misconfigurations at cluster creation time. - -=== (rejected) 2. Database driver specific resource definition. - -In an attempt to address the issues of the first option above, a more detailed specification was necessary. Here, database generic configurations are possible that can be better validated, as in the example below. - -[source,yaml] ---- -apiVersion: databaseconnection.stackable.tech/v1alpha1 -kind: DatabaseConnection -metadata: - name: druid-metadata-connection - namespace: default -spec: - database: - postgresql: - host: druid-postgresql # mandatory - port: 5432 # defaults to some port number - depending on wether tls is enabled - schema: druid # defaults to druid - credentials: druid-postgresql-credentials # mandatory. key username and password - parameters: {} # optional - redis: - host: airflow-redis-master # mandatory - port: 6379 # defaults to some port number - depending on wether tls is enabled - schema: druid # defaults to druid - credentials: airflow-redis-credentials # optional. key password - parameters: {} # optional - derby: - location: /tmp/derby/ # optional, defaults to /tmp/derby-{metadata.name}/derby.db - parameters: # optional - create: "true" - genericConnectionString: - driver: postgresql - format: postgresql://$SUPERSET_DB_USER:$SUPERSET_DB_PASS@postgres.default.svc.local:$SUPERSET_DB_PORT/superset¶m1=value1¶m2=value2 - secret: ... # optional - SUPERSET_DB_USER: ... - SUPERSET_DB_PASS: ... - SUPERSET_DB_PORT: ... - generic: - driver: postgresql - host: superset-postgresql.default.svc.cluster.local # optional - port: 5432 # optional - protocol: pgsql123 # optional - instance: superset # optional - credentials: name-of-secret-with-credentials #optional - parameters: {...} # optional - connectionStringFormat: "{protocol}://{credentials.user_name}:{credentials.credentials}@{host}:{port}/{instance}&[parameters,;]" - tls: # optional - verification: - ca_cert: - ... -In addition, a second generic DB type (`genericConnectionString`) is introduced. This specification allows templating connection URLs with variables defined in secrets and it's not restricted only to user credentials. - -NOTE: This proposal was rejected because for the same reason as the first proposal. In addition, it fails to make possible DB configurations product specific. - -=== (accepted) Product supported and a generic DB specifications. - -It seems that an unique, platform wide mechanism to describe database connections that also fulfills all acceptance criteria is not feasable. Database drivers and product configurations are too diverse and cannot be forced into a type safe specification. - -Thus the single, global connection manifest needs to split into two different categories, each covering a subset of the acceptance criteria: - -1. A database specific mechanism. This allows to catch misconfigurations early, it promotes good documentation and uniformity inside the platform. -2. An operator specific mechanism. This is a wildcard that can be used to configure database connections that are not officially supported by the products but that can still be partially validated early. - -The first mechanism requires the operator framwork to provide predefined structures and supporting functions for widely available database systems such as: PostgreSQL, MySQL, MariaDB, Oracle, SQLite, Derby, Redis and so on. This doesn't mean that all products can be configured with all DB implementations. The product definitions will only allow the subset that is officially supported by the products. - -The second mechanism is operator/product specific and it contains mostly a pass-through list of relevant **product properties**. There is at least one exception, and that is the handling of user credentials which still need to be provisioned in a secure way (as long as the product supports it). - -==== Database specific manifests - -Support for the following database systems is planned. Additional systems may be added in the future. - -1. PostgreSQL - -[source,yaml] -postgresql: - host: postgresql # mandatory - port: 5432 # optional, default is 5432 - instance: my-database # mandatory - credentials: my-application-credentials # mandatory. key username and password - parameters: {} # optional - tls: secure-connection-class-name # optional - auth: authentication-class-name # optional. authentication class to use. - -PostgreSQL supports multiple authentication mechanisms as described https://www.postgresql.org/docs/9.1/auth-pg-hba-conf.html[here]. - -2. MySQL - -[source,yaml] -mysql: - host: mysql # mandatory - port: 3306 # optional, default is 3306 - instance: my-database # mandatory - credentials: my-application-credentials # mandatory. key username and password - parameters: {} # optional - tls: secure-connection-class-name # optional - auth: authentication-class-name # optional. authentication class to use. - -MySQL supports multiple authentication mechanisms as described https://dev.mysql.com/doc/refman/8.0/en/socket-pluggable-authentication.html[here]. - -3. Derby - -Derby is used often as an embeded database for testing and prototyping ideas and implementations. It's not recommended for production usecases. - -[source,yaml] -derby: - location: /tmp/my-database/ # optional, defaults to /tmp/derby-/derby.db - - -==== Product specific manifests - -1. Apache Druid - -Apache Druid clusters can be configured any of the DB specific manifests from above. In addition, a DB generic configuration can pe specified: - -The following example shows how to configure the metadata storage for a Druid cluster using either one of the supported back-ends or a generic system. In a production setting only the PostgreSQL or MySQL manifests should be used. - -[source,yaml] -generic: - driver: postgresql # mandatory - uri: jdbc:postgresql:///druid?foo;bar # mandatory - credentialsSecret: my-secret # mandatory. key username + password - -The above is translated into the following Java properties: - -[source] -druid.metadata.storage.type=postgresql -druid.metadata.storage.connector.connectURI=jdbc:postgresql:///druid?foo;bar -druid.metadata.storage.connector.user=druid -druid.metadata.storage.connector.password=diurd - -2. Apache Superset - -NOTE: Superset supports a very wide range of database systems as described https://superset.apache.org/docs/databases/installing-database-drivers[here]. Not all of them are suitable for metadata storage. - -Connections to Apache Hive, Apache Druid and Trino clusters deployed as part of the SDP platform can be automated by using discovery configuration maps. In this case, the only attribute to configure is the name of the discovery config map of the appropriate system. - -In addition, a generic way to configure a database connection looks as follows: - -[source,yaml] -generic: - secret: superset-metadata-secret # mandatory. A secret naming with one entry called "key". Used to encrypt metadata and session cookies. - template: postgresql://{{SUPERSET_DB_USER}}:{{SUPERSET_DB_PASS}}@postgres.default.svc.local/superset¶m1=value1¶m2=value2 # mandatory - templateSecret: my-secret # optional - SUPERSET_DB_USER: ... - SUPERSET_DB_PASS: ... - -The template attribute allows to specify the full connection string as required by Superset (and the underlying SQLAlchemy framework). Variables in the template are specified within `{{` and `}}` markers and threir contents is replaced with the corresponding field in the `templateSecret` object. - -3. Apache Hive - -For production environments, we recommend PostgreSQL back-end and for development, Derby. - -A generic connection can be configured as follows: - -[source,yaml] -generic: - driver: org.postgresql.Driver # mandatory - uri: jdbc:postgresql://postgresql.us-west-2.rds.amazonaws.com:5432/mypgdb # mandatory - credentialsSecret: my-secret # mandatory (?). key username + password - -4. Apache Airflow - -A generic Airflow database connection can be configured in a similar fashion with Superset: - -[source,yaml] -generic: - template: postgresql://{{AIRFLOW_DB_USER}}:{{AIRFLOW_DB_PASS}}@postgres.default.svc.local/superset¶m1=value1¶m2=value2 # mandatory - templateSecret: my-secret # optional - AIRFLOW_DB_USER: ... - AIRFLOW_DB_PASS: ... - -For the record, theese were some sample CRDs that where created during discussion: -[source,yaml] ----- ---- -kind: DruidCluster -spec: - metadataDB: - postgresql: - host: postgresql # mandatory - port: 5432 # defaults to some port number - depending on wether tls is enabled - schema: druid # mandatory - credentials: postgresql-credentials # mandatory. key username and password - parameters: {} # optional - mysql: - host: mysql # mandatory - port: XXXX # defaults to some port number - depending on wether tls is enabled - schema: druid # mandatory - credentials: mysql-credentials # mandatory. key username and password - parameters: {} # optional - derby: - location: /tmp/derby/ # optional, defaults to /tmp/derby-/derby.db - generic: - driver: postgresql # mandatory - uri: jdbc:postgresql:///druid?foo;bar # mandatory - credentialsSecret: my-secret # mandatory. key username + password -# druid.metadata.storage.type=postgresql -# druid.metadata.storage.connector.connectURI=jdbc:postgresql:///druid -# druid.metadata.storage.connector.user=druid -# druid.metadata.storage.connector.password=diurd ---- -kind: SupersetCluster -spec: - metadataDB: - postgresql: - host: postgresql # mandatory - port: 5432 # defaults to some port number - depending on wether tls is enabled - schema: superset # mandatory - credentials: postgresql-credentials # mandatory. key username and password - parameters: {} # optional - mysql: - host: mysql # mandatory - port: XXXX # defaults to some port number - depending on wether tls is enabled - schema: superset # mandatory - credentials: mysql-credentials # mandatory. key username and password - parameters: {} # optional - sqlite: - location: /tmp/sqlite/ # optional, defaults to /tmp/sqlite-/derby.db - generic: - uriSecret: my-secret # mandatory. key uri - # ALTERNATIVE SOLUTION - uriTemplate: postgresql://$SUPERSET_DB_USER:$SUPERSET_DB_PASS@postgres.default.svc.local:$SUPERSET_DB_PORT/superset¶m1=value1¶m2=value2 - templateSecret: my-secret # optional - SUPERSET_DB_USER: ... - SUPERSET_DB_PASS: ... - SUPERSET_DB_PORT: ... -# postgresql://{username}:{password}@{host}:{port}/{database}?sslmode=require -kind: HiveCluster -spec: - metadataDB: - postgresql: - host: postgresql # mandatory - port: 5432 # defaults to some port number - depending on wether tls is enabled - schema: druid # mandatory - credentials: postgresql-credentials # mandatory. key username and password - parameters: {} # optional - derby: - location: /tmp/derby/ # optional, defaults to /tmp/derby-/derby.db - # Missing: MS-SQL server, Oracle - generic: - driver: org.postgresql.Driver # mandatory - uri: jdbc:postgresql://postgresql.us-west-2.rds.amazonaws.com:5432/mypgdb # mandatory - credentialsSecret: my-secret # mandatory (?). key username + password - # - # javax.jdo.option.ConnectionURL - # jdbc:postgresql://postgresql.us-west-2.rds.amazonaws.com:5432/mypgdb - # PostgreSQL JDBC driver connection URL - # - # - # javax.jdo.option.ConnectionDriverName - # org.postgresql.Driver - # PostgreSQL metastore driver class name - # - # - # javax.jdo.option.ConnectionUserName - # database_username - # the username for the DB instance - # - # - # javax.jdo.option.ConnectionPassword - # database_password - # the password for the DB instance - # ----- diff --git a/modules/contributor/partials/current_adrs.adoc b/modules/contributor/partials/current_adrs.adoc index cbab65ee4..744ff8550 100644 --- a/modules/contributor/partials/current_adrs.adoc +++ b/modules/contributor/partials/current_adrs.adoc @@ -17,7 +17,7 @@ **** xref:adr/ADR018-product_image_versioning.adoc[] **** xref:adr/ADR019-trino_catalog_definitions.adoc[] **** xref:adr/ADR020-trino_catalog_usage.adoc[] -**** xref:adr/ADR021-stackablectl_stacks_inital_version.adoc[] +**** xref:adr/ADR021-stackablectl_stacks_initial_version.adoc[] **** xref:adr/ADR022-spark-history-server.adoc[] **** xref:adr/ADR023-product-image-selection.adoc[] **** xref:adr/ADR024-out-of-cluster_access.adoc[] @@ -25,3 +25,4 @@ **** xref:adr/ADR026-affinities.adoc[] **** xref:adr/ADR027-status.adoc[] **** xref:adr/ADR028-automatic-stackable-version.adoc[] +**** xref:adr/ADR029-database-connection.adoc[] From b53a5e781216c8ff0e0e7006c86dd83b8cf92d89 Mon Sep 17 00:00:00 2001 From: Sebastian Bernauer Date: Tue, 22 Aug 2023 16:01:02 +0200 Subject: [PATCH 19/19] fix --- ...1-stackablectl_stacks_initial_version.adoc | 101 +++++ .../pages/adr/ADR029-database-connection.adoc | 384 ++++++++++++++++++ 2 files changed, 485 insertions(+) create mode 100644 modules/contributor/pages/adr/ADR021-stackablectl_stacks_initial_version.adoc create mode 100644 modules/contributor/pages/adr/ADR029-database-connection.adoc diff --git a/modules/contributor/pages/adr/ADR021-stackablectl_stacks_initial_version.adoc b/modules/contributor/pages/adr/ADR021-stackablectl_stacks_initial_version.adoc new file mode 100644 index 000000000..9bc600b67 --- /dev/null +++ b/modules/contributor/pages/adr/ADR021-stackablectl_stacks_initial_version.adoc @@ -0,0 +1,101 @@ += ADR021: Initial Version of Stackable Stacks Functionality +Sönke Liebau +v0.1, 2022-06-07 +:status: accepted + +* Status: {status} +* Deciders: +** Rob Siwicki +** Sebastian Bernauer +** Sönke Liebau +** Teo Klestrup-Röijezon +* Date: 2022-06-07 + +== Context and Problem Statement + +During the preparations for the first real release we noticed that the _create_test_cluster.py_ script is not really polished at all. +We would very much like to include _stackablectl_ as the CLI tool in the first release as well as demo it on the website in the configurator. + +In principle _stackablectl_ is usable, but one main functionality is missing, which is the ability to apply the examples and stand up products. +While just applying the examples is not a huge problem, some of our tool have external dependencies that we need to supply via helm charts at the moment (for example Trino needs a Postgres database). +We need to have a way to install helm charts as part of the stacks functionality in _stackablectl_ to make it viable for rolling out example setups. + +The scope of this ADR is to define a minimal solution that allows defining stacks and specifying Helm charts with properties as prerequisites before applying yaml files. +This should be defined in a way that allows us as much flexibility as possible when further defining how _stackablectl stacks_ should behave and how stacks are defined down the road, as this is an ongoing discussion. + +All subsequent decisions will be documented in a separate ADR. + +== Decision Drivers + +* Implementation effort should be small so this can be included in release 1 +* Chosen solution should give flexibility to extend it without breaking changes in the future + +== Considered Options + +* Do nothing +* Implement basic definition of stacks +* Go all in on https://porter.sh/[Porter] / https://cnab.io/[CNAB] and use it to fully define our stacks + +== Decision Outcome + +Chosen option: "Implement basic definition of stacks", because it is a lightweight solution that can be implemented with limited effort and matches the expected overall direction of _stackablectl_ well. +There was agreement between all deciders that we do not want to marry our solution too tightly to the as yet unproven CNAB standard or Porter as a concrete implementation. +By defining our own, thin, abstraction layer we can isolate our users from the chosen implementation technologies in the backend (Porter, CNAB, Helm, ...). + +=== Positive Consequences + +* We can use _stackablectl_ as CLI tool in the first official release +* This allows adding CNAB bundles as the preferred implementation in a non-breaking fashion later on + +=== Negative Consequences + +* Depending on the future direction we take with _stackablectl_ we have a risk of needing to break the api surface that we create with this implementation + +== Pros and Cons of the Options + +=== Do nothing + +We could do nothing right now and instead use the _create_test_cluster.py_ script as our entry point for the initial release. + +* Good, this gives us time to design a fully thought out solution before implementing something +* Bad, _create_test_cluster.py_ doesn't feel very polished and we do want to introducet _stackablectl_ to the world as our tool of choice + +=== Implement basic definition of stacks + +In order to generate a minimally invasive way to enable deploying Helm charts as prerequisites for our stacks we will introduce the initial definition of a stack roughly as shown below. +The implementation details may vary, this snippet is provided more to show the overall structure, most specifically the _manifests_ attribute. + +Initial implementations here will be provided for applying a bunch of yaml files and installing Helm charts. + +[source,yaml] +---- + trino: + description: Simply stack only containing Trino + stackableRelease: 22.05-sbernauer + labels: + - trino + manifests: + - helmChart: + repository: https://charts.bitnami.com/bitnami + name: postgresql + properties: + - auth.username: superset + - auth.password: superset + - auth.database: superset + - plainYaml: stacks/trino.yaml +---- + +helm install --repo https://charts.bitnami.com/bitnami --set auth.username=superset --set auth.password=superset --set auth.database=superset superset-postgresql postgresql + +* Good, because it allows us to use _stackablectl_ in release 1 and the marketing campaign +* Good, because it provides isolation between _stackablectl_ and bundle technologies +* Bad, because it may require breaking changes down the line to the interface users get now + +=== Go all in on Porter/CNAB and use it to fully define our stacks + +CNAB in theory provides everything we'd need to install a stack, instead of allowing to define yaml files, helm charts and other things, we could simply bundle an entire stack as a CNAB bundle using Porter and have _stackablectl_ install this. + +CNAB bundles can be pushed to OCI compliant registries, so we would not need to provide our own method of listing stacks for _stackablectl_ either. + +* Good, because instead of reinventing the wheel we would use an existing technology +* Bad, because we tightly couple _stackablectl_ to this technology, the adoption of which is yet to be proven \ No newline at end of file diff --git a/modules/contributor/pages/adr/ADR029-database-connection.adoc b/modules/contributor/pages/adr/ADR029-database-connection.adoc new file mode 100644 index 000000000..d5d1d7975 --- /dev/null +++ b/modules/contributor/pages/adr/ADR029-database-connection.adoc @@ -0,0 +1,384 @@ += ADR029: Standardize database connections +Razvan Mihai +v0.1, 2022-12-08 +:status: accepted + +* Status: {status} +* Deciders: +** Felix Hennig +** Lukas Voetmand +** Malte Sander +** Razvan Mihai +** Sascha Lautenschläger +** Sebastian Bernauer +* Date: 2022-12-08 + +Technical Story: https://github.com/stackabletech/issues/issues/238 + +== Context and Problem Statement + +Many products supported by the Stackable Data Platform require databases to store metadata. Currently there is no uniform, consistent way to define database connections. In addition, some Stackable operators define database credentials to be provided inline and in plain text in the cluster definitions. + +A quick analysis of the status-quo regarding database connection definitions shows how different operators handle them: + +* Apache Hive: the cluster custom resource defined a field called "database" with access credentials in clear text. +* Apache Airflow and Apache Superset: uses a field called "credentialSecret" that contains multiple different database connection definitions. Even worse, it contains credentials not related to a database, such as a secret to encrypt the cookies. In case of Airflow, this secret only supports the Celery executor. +* Apache Druid: uses a field called "metadataStorageDatabase" where access credentials are expected to be inline and in plain text. + +== Decision Drivers + +Here we attempt to standardize the way database connections are defined across the Stackable platform in such a way that: + +* Different database systems are supported. +* Access credentials are defined in Kubernetes `Secret`` objects. +* Product configuration only allows (product) supported databases ... +* But there is a generic way to configure additional database systems. +* Misconfigured connections should be rejected as early as possible in the product lifecycle. +* Generated CRD documentation is easy to follow by users. + +Initially we thought that database connections should be implemented as stand-alone Kubernetes resources and should be referenced in cluster definitions. This idea was thrown away mostly because sharing database connections across products is not good practice and we shouldn't encourage it. + +== Considered Options + +1. (rejected) `DatabaseConnection` A generic resource definition. +2. (rejected) Database driver specific resource definition. +3. (accepted) Product supported and a generic DB specifications. + +=== 1. (rejected) `DatabaseConnection` A generic resource definition + +The first idea was to introduce a new Kubernetes resource called `DatabaseConnection` with the following fields: + +[cols="1,1"] +|=== +|Field name | Description +|credentials +|A string with name of a `Secret` containing at least a user name and a password field. Additional fields are allowed. +|driver +|A string with the database driver named. This is a generic field that identifies the type of the database used. +|protocol +|The protocol prefix of the final connection string. Most Java based products will use `jdbc:`. +|host +|A string with the host name to connect to. +|instance +|A string with the database instance to connect to. Optional. +|port +|A positive integer with the TCP port used for the connection. Optional. +|properties +|A dictionary of additional properties for driver tuning like number of client threads, various buffer sizes and so on. Some drivers, like `derby` use this to define the database name and whether the DB should by automatically created or not. Optional +|=== + +The `Secret` object referenced by `credentials` must contain two fields named `USER_NAME` and `PASSWORD` but can contain additional fields like first name, last name, email, user role and so on. + +=== Examples + +These examples showcase the spec change required from the current status: + +The current Druid metadata database connection + +[source,yaml] +--- +metadataStorageDatabase: + dbType: postgresql + connString: jdbc:postgresql://druid-postgresql/druid + host: druid-postgresql + port: 5432 + user: druid + password: druid + +becomes + +[source,yaml] +--- +metadataStorageDatabase: druid-metadata-connection + +where `druid-metadata-connection` is a standalone `DatabaseConnection` resource defined as follows + +[source,yaml] +--- +apiVersion: db.stackable.tech/v1alpha1 +kind: DatabaseConnection +metadata: + name: druid-metadata-connection +spec: + driver: postgresql + host: druid-postgresql + port: 5432 + protocol: jdbc:postgresql + instance: druid + credentials: druid-metadata-credentials + +and the credentials field contains the name of a Kubernetes `Secret` defined as: + +[source,yaml] +--- +apiVersion: v1 +kind: Secret +metadata: + name: druid-metadata-credentials +type: Opaque +data: + USER_NAME: druid + PASSWORD: druid + +NOTE: This idea was discarded because it didn't satisfy all acceptance criteria. In particular it wouldn't be possible to catch misconfigurations at cluster creation time. + +=== (rejected) 2. Database driver specific resource definition. + +In an attempt to address the issues of the first option above, a more detailed specification was necessary. Here, database generic configurations are possible that can be better validated, as in the example below. + +[source,yaml] +--- +apiVersion: databaseconnection.stackable.tech/v1alpha1 +kind: DatabaseConnection +metadata: + name: druid-metadata-connection + namespace: default +spec: + database: + postgresql: + host: druid-postgresql # mandatory + port: 5432 # defaults to some port number - depending on wether tls is enabled + schema: druid # defaults to druid + credentials: druid-postgresql-credentials # mandatory. key username and password + parameters: {} # optional + redis: + host: airflow-redis-master # mandatory + port: 6379 # defaults to some port number - depending on wether tls is enabled + schema: druid # defaults to druid + credentials: airflow-redis-credentials # optional. key password + parameters: {} # optional + derby: + location: /tmp/derby/ # optional, defaults to /tmp/derby-{metadata.name}/derby.db + parameters: # optional + create: "true" + genericConnectionString: + driver: postgresql + format: postgresql://$SUPERSET_DB_USER:$SUPERSET_DB_PASS@postgres.default.svc.local:$SUPERSET_DB_PORT/superset¶m1=value1¶m2=value2 + secret: ... # optional + SUPERSET_DB_USER: ... + SUPERSET_DB_PASS: ... + SUPERSET_DB_PORT: ... + generic: + driver: postgresql + host: superset-postgresql.default.svc.cluster.local # optional + port: 5432 # optional + protocol: pgsql123 # optional + instance: superset # optional + credentials: name-of-secret-with-credentials #optional + parameters: {...} # optional + connectionStringFormat: "{protocol}://{credentials.user_name}:{credentials.credentials}@{host}:{port}/{instance}&[parameters,;]" + tls: # optional + verification: + ca_cert: + ... +In addition, a second generic DB type (`genericConnectionString`) is introduced. This specification allows templating connection URLs with variables defined in secrets and it's not restricted only to user credentials. + +NOTE: This proposal was rejected because for the same reason as the first proposal. In addition, it fails to make possible DB configurations product specific. + +=== (accepted) Product supported and a generic DB specifications. + +It seems that an unique, platform wide mechanism to describe database connections that also fulfills all acceptance criteria is not feasable. Database drivers and product configurations are too diverse and cannot be forced into a type safe specification. + +Thus the single, global connection manifest needs to split into two different categories, each covering a subset of the acceptance criteria: + +1. A database specific mechanism. This allows to catch misconfigurations early, it promotes good documentation and uniformity inside the platform. +2. An operator specific mechanism. This is a wildcard that can be used to configure database connections that are not officially supported by the products but that can still be partially validated early. + +The first mechanism requires the operator framwork to provide predefined structures and supporting functions for widely available database systems such as: PostgreSQL, MySQL, MariaDB, Oracle, SQLite, Derby, Redis and so on. This doesn't mean that all products can be configured with all DB implementations. The product definitions will only allow the subset that is officially supported by the products. + +The second mechanism is operator/product specific and it contains mostly a pass-through list of relevant **product properties**. There is at least one exception, and that is the handling of user credentials which still need to be provisioned in a secure way (as long as the product supports it). + +==== Database specific manifests + +Support for the following database systems is planned. Additional systems may be added in the future. + +1. PostgreSQL + +[source,yaml] +postgresql: + host: postgresql # mandatory + port: 5432 # optional, default is 5432 + instance: my-database # mandatory + credentials: my-application-credentials # mandatory. key username and password + parameters: {} # optional + tls: secure-connection-class-name # optional + auth: authentication-class-name # optional. authentication class to use. + +PostgreSQL supports multiple authentication mechanisms as described https://www.postgresql.org/docs/9.1/auth-pg-hba-conf.html[here]. + +2.) MySQL + +[source,yaml] +mysql: + host: mysql # mandatory + port: 3306 # optional, default is 3306 + instance: my-database # mandatory + credentials: my-application-credentials # mandatory. key username and password + parameters: {} # optional + tls: secure-connection-class-name # optional + auth: authentication-class-name # optional. authentication class to use. + +MySQL supports multiple authentication mechanisms as described https://dev.mysql.com/doc/refman/8.0/en/socket-pluggable-authentication.html[here]. + +3.) Derby + +Derby is used often as an embedded database for testing and prototyping ideas and implementations. It's not recommended for production use-cases. + +[source,yaml] +derby: + location: /tmp/my-database/ # optional, defaults to /tmp/derby-/derby.db + + +==== Product specific manifests + +1.) Apache Druid + +Apache Druid clusters can be configured any of the DB specific manifests from above. In addition, a DB generic configuration can pe specified: + +The following example shows how to configure the metadata storage for a Druid cluster using either one of the supported back-ends or a generic system. In a production setting only the PostgreSQL or MySQL manifests should be used. + +[source,yaml] +generic: + driver: postgresql # mandatory + uri: jdbc:postgresql:///druid?foo;bar # mandatory + credentialsSecret: my-secret # mandatory. key username + password + +The above is translated into the following Java properties: + +[source] +druid.metadata.storage.type=postgresql +druid.metadata.storage.connector.connectURI=jdbc:postgresql:///druid?foo;bar +druid.metadata.storage.connector.user=druid +druid.metadata.storage.connector.password=druid + +2.) Apache Superset + +NOTE: Superset supports a very wide range of database systems as described https://superset.apache.org/docs/databases/installing-database-drivers[here]. Not all of them are suitable for metadata storage. + +Connections to Apache Hive, Apache Druid and Trino clusters deployed as part of the SDP platform can be automated by using discovery configuration maps. In this case, the only attribute to configure is the name of the discovery config map of the appropriate system. + +In addition, a generic way to configure a database connection looks as follows: + +[source,yaml] +generic: + secret: superset-metadata-secret # mandatory. A secret naming with one entry called "key". Used to encrypt metadata and session cookies. + template: postgresql://{{SUPERSET_DB_USER}}:{{SUPERSET_DB_PASS}}@postgres.default.svc.local/superset¶m1=value1¶m2=value2 # mandatory + templateSecret: my-secret # optional + SUPERSET_DB_USER: ... + SUPERSET_DB_PASS: ... + +The template attribute allows to specify the full connection string as required by Superset (and the underlying SQLAlchemy framework). Variables in the template are specified within `{{` and `}}` markers and their contents is replaced with the corresponding field in the `templateSecret` object. + +3.) Apache Hive + +For production environments, we recommend PostgreSQL back-end and for development, Derby. + +A generic connection can be configured as follows: + +[source,yaml] +generic: + driver: org.postgresql.Driver # mandatory + uri: jdbc:postgresql://postgresql.us-west-2.rds.amazonaws.com:5432/mypgdb # mandatory + credentialsSecret: my-secret # mandatory (?). key username + password + +4.) Apache Airflow + +A generic Airflow database connection can be configured in a similar fashion with Superset: + +[source,yaml] +generic: + template: postgresql://{{AIRFLOW_DB_USER}}:{{AIRFLOW_DB_PASS}}@postgres.default.svc.local/superset¶m1=value1¶m2=value2 # mandatory + templateSecret: my-secret # optional + AIRFLOW_DB_USER: ... + AIRFLOW_DB_PASS: ... + +The resulting CRDs look like: + +[source,yaml] +---- +kind: DruidCluster +spec: + clusterConfig: + metadataDatabase: + postgresql: + host: postgresql # mandatory + port: 5432 # defaults to some port number - depending on wether tls is enabled + database: druid # mandatory + credentials: postgresql-credentials # mandatory. key username and password + parameters: {} # optional BTreeMap + mysql: + host: mysql # mandatory + port: XXXX # defaults to some port number - depending on wether tls is enabled + database: druid # mandatory + credentials: mysql-credentials # mandatory. key username and password + parameters: {} # optional BTreeMap + derby: + location: /tmp/derby/ # optional, defaults to /tmp/derby-/derby.db + generic: + driver: postgresql # mandatory + uri: jdbc:postgresql:///druid?foo;bar # mandatory + credentialsSecret: my-secret # mandatory. key username + password +# druid.metadata.storage.type=postgresql +# druid.metadata.storage.connector.connectURI=jdbc:postgresql:///druid +# druid.metadata.storage.connector.user=druid +# druid.metadata.storage.connector.password=druid +--- +kind: SupersetCluster +spec: + clusterConfig: + metadataDatabase: + postgresql: + host: postgresql # mandatory + port: 5432 # defaults to some port number - depending on wether tls is enabled + database: superset # mandatory + credentials: postgresql-credentials # mandatory. key username and password + parameters: {} # optional BTreeMap + mysql: + host: mysql # mandatory + port: XXXX # defaults to some port number - depending on wether tls is enabled + database: superset # mandatory + credentials: mysql-credentials # mandatory. key username and password + parameters: {} # optional BTreeMap + sqlite: + location: /tmp/sqlite/ # optional, defaults to /tmp/sqlite-/derby.db + generic: + uriSecret: my-secret # mandatory. key uri +# postgresql://{username}:{password}@{host}:{port}/{database}?sslmode=require +kind: HiveCluster +spec: + clusterConfig: + metadataDatabase: + postgresql: + host: postgresql # mandatory + port: 5432 # defaults to some port number - depending on wether tls is enabled + database: druid # mandatory + credentials: postgresql-credentials # mandatory. key username and password + parameters: {} # optional BTreeMap + derby: + location: /tmp/derby/ # optional, defaults to /tmp/derby-/derby.db + # Missing: MS-SQL server, Oracle + generic: + driver: org.postgresql.Driver # mandatory + uri: jdbc:postgresql://postgresql.us-west-2.rds.amazonaws.com:5432/mypgdb # mandatory + credentialsSecret: my-secret # mandatory (?). key username + password + # + # javax.jdo.option.ConnectionURL + # jdbc:postgresql://postgresql.us-west-2.rds.amazonaws.com:5432/mypgdb + # PostgreSQL JDBC driver connection URL + # + # + # javax.jdo.option.ConnectionDriverName + # org.postgresql.Driver + # PostgreSQL metastore driver class name + # + # + # javax.jdo.option.ConnectionUserName + # database_username + # the username for the DB instance + # + # + # javax.jdo.option.ConnectionPassword + # database_password + # the password for the DB instance + # +----