Deploy ScalarDB Analytics Locally
This guide explains how to deploy ScalarDB Analytics to a local Kubernetes cluster, specifically designed for testing purposes, by using a Helm Chart.
Prerequisitesβ
Before deploying ScalarDB Analytics to a local environment, ensure that you have the following tools installed:
You need to have a license key (trial license or commercial license) to use ScalarDB Analytics. If you don't have a license key, please contact us.
Example architectureβ
The following is an example architecture described in this guide.
This guide assumes a Kubernetes cluster running on minikube. In this setup, PostgreSQL is treated as an external data source not managed by ScalarDB transactions, while MySQL is treated as a data source managed by ScalarDB transactions (ScalarDB-managed data source). The ScalarDB Analytics server is deployed as a Pod by using a Helm Chart. A separate Pod is also created to serve as the client for running Spark commands. Additionally, the CLI tool used to operate the ScalarDB Analytics server is provided as a container image and runs on a separate Pod.
Please set up each data source yourself, referring to resources such as How to Deploy ScalarDB Cluster Locally for guidance.
Step 1: Set up the Kubernetes environmentβ
You first need to set up the Kubernetes environment where all components will be deployed.
Create ServiceAccount and ClusterRoleBindingβ
Create a service account (ServiceAccount
) and a cluster role binding (ClusterRoleBinding
) to allow Spark jobs to manage resources within the Kubernetes cluster.
NAMESPACE=default
SERVICE_ACCOUNT_NAME=spark
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ServiceAccount
metadata:
name: ${SERVICE_ACCOUNT_NAME}
namespace: ${NAMESPACE}
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: spark-role
roleRef:
kind: ClusterRole
name: cluster-admin
apiGroup: rbac.authorization.k8s.io
subjects:
- kind: ServiceAccount
name: ${SERVICE_ACCOUNT_NAME}
namespace: ${NAMESPACE}
EOF
You can change the NAMESPACE
and SERVICE_ACCOUNT_NAME
environment variables as needed.
Step 2: Deploy the ScalarDB Analytics serverβ
In this step, you will deploy the ScalarDB Analytics server to the Kubernetes environment using the Helm Chart provided by Scalar.
Prepare a configuration file for the ScalarDB data sourceβ
When using a ScalarDB-managed data source, you must provide a configuration file accessible to both the ScalarDB Analytics server and the Spark driver.
This example uses only MySQL but is intentionally configured as a multi-storage setup to simplify adding other databases later. As configured, the storage name mysql
is assigned to the MySQL data source, and the namespace nsmy
is mapped to it.
cat <<EOF > scalardb.properties
# Storage
scalar.db.storage=multi-storage
# Multi-storage settings
scalar.db.multi_storage.storages=mysql
# Namespace mapping
scalar.db.multi_storage.namespace_mapping=nsmy:mysql
# Default storage
scalar.db.multi_storage.default_storage=mysql
# Multi-storage: Define MySQL
scalar.db.multi_storage.storages.mysql.storage=jdbc
scalar.db.multi_storage.storages.mysql.contact_points=<MYSQL_URL>
scalar.db.multi_storage.storages.mysql.username=<MYSQL_USERNAME>
scalar.db.multi_storage.storages.mysql.password=<MYSQL_PASSWORD>
EOF
For details about multi-storage configurations, see Multi-Storage Transactions.
Then, store it as a ConfigMap
in Kubernetes by running the following command:
kubectl create configmap scalardb-properties --from-file=./scalardb.properties
Add the Scalar Helm Charts repositoryβ
Add the Scalar Helm Charts repository by running the following command.
helm repo add scalar-labs https://scalar-labs.github.io/helm-charts
Create a custom values file for the ScalarDB Analytics serverβ
Create a custom values file (analytics-server-custom-values.yaml
) for the ScalarDB Analytics server Helm Chart.
The following is an example of a simple configuration.
cat <<EOF > analytics-server-custom-values.yaml
scalarDbAnalyticsServer:
properties: |
scalar.db.analytics.server.catalog.port=11051
scalar.db.analytics.server.metering.port=11052
scalar.db.analytics.server.db.url=<CATALOG_SERVER_BACKEND_DB_URL>
scalar.db.analytics.server.db.username=<USERNAME_FOR_BACKEND_DB>
scalar.db.analytics.server.db.password=<PASSWORD_FOR_BACKEND_DB>
scalar.db.analytics.server.metering.storage.provider=filesystem
scalar.db.analytics.server.metering.storage.container_name=metering
scalar.db.analytics.server.metering.storage.path=/tmp
scalar.db.analytics.server.licensing.license_key=<YOUR_LICENSE_KEY>
scalar.db.analytics.server.licensing.license_check_cert_pem=<YOUR_LICENSE_CERT_PEM>
service:
type: LoadBalancer
extraVolumes:
- name: scalardb-properties
configMap:
name: scalardb-properties
extraVolumeMounts:
- name: scalardb-properties
mountPath: /tmp/scalar
EOF
The following describes what you should change the content in the angle brackets to:
<CATALOG_SERVER_BACKEND_DB_URL>
: JDBC connection string for the backend database of the ScalarDB Analytics server.<USERNAME_FOR_BACKEND_DB>
: The username of the backend database.<PASSWORD_FOR_BACKEND_DB>
: The password of the backend database.<YOUR_LICENSE_KEY>
: The license key for the ScalarDB Analytics server.<YOUR_LICENSE_CERT_PEM>
: The PEM encoded license certificate for the ScalarDB Analytics server.
The metering-related property values (scalar.db.analytics.server.metering.storage.*
) can be used as shown in the example. For more details on metering configuration, see the Configuration reference.
Deploy the Analytics serverβ
Deploy the Analytics Server by running the following command.
helm install scalardb-analytics-server scalar-labs/scalardb-analytics-server -f analytics-server-custom-values.yaml
Step 3: Configure the catalog and data sources by using the CLI toolβ
To create catalogs and register data sources on the ScalarDB Analytics server, use the CLI tool, which is provided as a container image. As an example, this section shows how to set up a Pod for the CLI tool and run commands from it.
Set up a Pod for the CLI toolβ
Create a manifest file for the CLI tool Pod.
cat <<EOF > analytics-server-cli.yaml
apiVersion: v1
kind: Pod
metadata:
name: analytics-server-cli
spec:
containers:
- name: analytics-server-cli
image: ghcr.io/scalar-labs/scalardb-analytics-cli:3.16.2
command: ['sleep']
args: ['inf']
restartPolicy: Never
EOF
You can change metadata.name
and spec.containers[*].name
to any values you like.
Then, create the Pod for the CLI tool by running the following command:
kubectl apply -f analytics-server-cli.yaml
Once the Pod is deployed, access it through the shell by running the following command. All following steps in this section should be performed inside this Pod.
kubectl exec -it analytics-server-cli -- bash
Set up an alias for the CLI tool to simplify command execution by running the following command:
alias scalardb-analytics-cli="java -jar /scalardb-analytics-cli/scalardb-analytics-cli.jar"
Prepare data source definition filesβ
You must define the data sources that ScalarDB Analytics accesses in JSON format.
The following is an example of defining a data source managed by ScalarDB. You must specify scalardb
as the value of type
item when using the ScalarDB-managed data source.
The configPath
value must be a common path accessible from both the ScalarDB Analytics server and the Spark driver. This example uses ConfigMap
in Kubernetes to share the configuration file and mounts it into the relevant Pods.
cat <<EOF > data_source_scalardb.json
{
"catalog": "mycatalog",
"name": "data_source_scalardb",
"type": "scalardb",
"provider": {
"configPath": "/tmp/scalar/scalardb.properties"
}
}
EOF
The following is an example of defining a PostgreSQL data source that is not managed by ScalarDB. You must specify postgres
as the value of type
item when using PostgreSQL as the data source. Then, replace the contents in the command below in angle brackets with details about your backend DB connection and run the command:
cat <<EOF > data_source_postgres.json
{
"catalog": "mycatalog",
"name": "data_source_postgres",
"type": "postgres",
"provider": {
"host": <POSTGRES_HOST>,
"port": <PORT>,
"username": <USERNAME>,
"password": <PASSWORD>,
"database": <DATABASE>
}
}
EOF
Create a configuration file for the CLI toolβ
Create a configuration file (client.properties
) for the ScalarDB Analytics CLI tool by running the following command, replacing <ANALYTICS_SERVER_HOST>
with the hostname or IP address of the ScalarDB Analytics server:
cat <<EOF > client.properties
scalar.db.analytics.client.server.host=<ANALYTICS_SERVER_HOST>
scalar.db.analytics.client.server.catalog.port=11051
EOF
Register the catalog and data sourcesβ
This section describes how to register a catalog and data sources using the CLI tool.
Create a catalogβ
First, create a catalog by using the following command. Replace <CATALOG_NAME>
with your desired catalog name.
scalardb-analytics-cli -c client.properties catalog create --catalog <CATALOG_NAME>
Register data sourcesβ
Next, register the data sources for both ScalarDB-managed and nonβScalarDB-managed.
Register a ScalarDB-managed data source by using the following command.
scalardb-analytics-cli -c client.properties data-source register --data-source-json=./data_source_scalardb.json
Register a non-ScalarDB-managed data source by using the following command.
scalardb-analytics-cli -c client.properties data-source register --data-source-json=./data_source_postgres.json
Additional CLI Commandsβ
The CLI tool provides additional commands for managing catalogs and data sources. For detailed instructions, refer to the ScalarDB Analytics CLI tool documentation.
Step 4: Deploy a Spark client Podβ
In this step, you will deploy a Spark client Pod and set it up to run Spark jobs.
Create a Spark client Podβ
Create a manifest file for the Spark client Pod.
In the following example, the service account name is set to spark
, and the ScalarDB data source configuration file (scalardb-properties
), stored in ConfigMap
, is mounted as a volume named scalardb-properties
. Configure the Spark client Pod by running the following command:
cat<<'EOF' > spark-client.yaml
apiVersion: v1
kind: Pod
metadata:
name: "spark-client"
spec:
serviceAccountName: spark
containers:
- name: spark-client
image: eclipse-temurin:21
command: ['sleep']
args: ['inf']
volumeMounts:
- name: scalardb-properties
mountPath: /tmp/scalar
volumes:
- name: scalardb-properties
configMap:
name: scalardb-properties
restartPolicy: Never
terminationGracePeriodSeconds: 0
EOF
Create the Spark client Pod by running the following command:
kubectl apply -f spark-client.yaml
Set up the Spark client Podβ
Access the Spark client pod via a shell session by running the following command:
kubectl exec -it spark-client -- bash
Install the Spark binary files and navigate to their directory by running the following command:
VERSION=3.5.6
curl -O https://dlcdn.apache.org/spark/spark-${VERSION}/spark-${VERSION}-bin-hadoop3.tgz
tar xzf spark-${VERSION}-bin-hadoop3.tgz
cd spark-${VERSION}-bin-hadoop3
Create a spark-defaults.conf
file by changing the content in the angle brackets and then running the following command:
cat <<EOF > ./conf/spark-defaults.conf
spark.sql.extensions com.scalar.db.analytics.spark.extension.ScalarDbAnalyticsExtensions
spark.jars.packages com.scalar-labs:scalardb-analytics-spark-all-<SPARK_VERSION>_<SCALA_VERSION>:<SCALARDB_ANALYTICS_VERSION>
spark.sql.catalog.<CATALOG_NAME> com.scalar.db.analytics.spark.ScalarDbAnalyticsCatalog
spark.sql.catalog.<CATALOG_NAME>.server.host <ANALYTICS_SERVER_HOST>
spark.sql.catalog.<CATALOG_NAME>.server.catalog.port 11051
spark.sql.catalog.<CATALOG_NAME>.server.metering.port 11052
spark.extraListeners com.scalar.db.analytics.spark.metering.ScalarDbAnalyticsListener
EOF
The following describes what you should change the content in the angle brackets to:
<SPARK_VERSION>
: The version of Spark.<SCALA_VERSION>
: The version of Scala used to build Spark.<SCALARDB_ANALYTICS_VERSION>
: The version of ScalarDB Analytics.<CATALOG_NAME>
: The name of the catalog.<ANALYTICS_SERVER_HOST>
: The hostname or IP address of the ScalarDB Analytics server.
For more details, refer to Set up ScalarDB Analytics in the Spark configuration.
Step 5: Run Spark jobs from the client Podβ
At this point, the Spark client Pod has been set up and is ready to run Spark jobs. This step shows examples of how to run analytical queries as Spark jobs using the following two methods.
- Using Spark SQL
- Submitting jobs by using the
spark-submit
command
ScalarDB Analytics currently uses Apache Spark as its query engine. It can leverage Sparkβs native Kubernetes deployment mode, which enables dynamic provisioning of Spark driver and executor Pods at runtime. To use the Kubernetes deployment mode, you need to specify the Kubernetes API server (k8s://...
) in the --master
option of the spark commands.
- Spark SQL
- Submit a Spark job
Use the spark-sql
command to run Spark SQL
You can run Spark SQL by running a command like the following:
./bin/spark-sql \
--master k8s://https://kubernetes.default.svc \
--conf spark.kubernetes.container.image=docker.io/bitnami/spark:3.5 \
--conf spark.driver.host=$(hostname -i)
Use the spark-submit
command to run a Spark job
This section describes registering an application JAR, creating a temporary Pod, creating a Pod template, and executing spark-submit
.
Register the application JAR to PVC
To run an application as a Spark job, you need to prepare the application's JAR file and execute the spark-submit
command by running the following command. The JAR file must be located at a path accessible from the Spark driver. There are several ways to achieve this, and this guide demonstrates how to use a persistent volume claim (PVC).
PVC_NAME=spark-app-pvc
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ${PVC_NAME}
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 4Gi
EOF
Create a temporary Pod and copy the file
Create a temporary Pod to store the application JAR in the PVC by running the following command:
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: spark-pvc-loader
spec:
containers:
- name: loader
image: busybox
command: ["sleep", "3600"]
volumeMounts:
- mountPath: /mnt
name: spark-vol
volumes:
- name: spark-vol
persistentVolumeClaim:
claimName: ${PVC_NAME}
restartPolicy: Never
EOF
Wait for the temporary Pod to be created by running the following command:
kubectl wait --for=condition=Ready pod/spark-pvc-loader --timeout=60s
Copy the application JAR to PVC by running the following command:
export JAR_PATH=/path/to/your/app.jar
kubectl cp ${JAR_PATH} spark-pvc-loader:/mnt/app.jar
Delete the temporary Pod by running the following command:
kubectl delete pod spark-pvc-loader
Create a Pod template
To create a Pod template for the dynamically generated Spark driver and executor Pods, log in to the Spark client pod and run the following command:
PVC_NAME=spark-app-pvc
cat <<EOF > spark-pod-template.yaml
apiVersion: v1
kind: Pod
metadata:
name: spark-pod-template
spec:
volumes:
- name: spark-jar-volume
persistentVolumeClaim:
claimName: ${PVC_NAME}
- name: scalardb-properties
configMap:
name: scalardb-properties
containers:
- name: spark-kubernetes-container
volumeMounts:
- mountPath: /opt/spark-jars
name: spark-jar-volume
- mountPath: /tmp/scalar
name: scalardb-properties
EOF
As mentioned earlier, both the ScalarDB Analytics server and the Spark driver need to access the ScalarDB data source configuration file (scalardb.properties
) at the same path. Therefore, in this example, the file stored in ConfigMap
is mounted at the /tmp/scalar
path.
Execute spark-submit
Run the application as a Spark job by using a command like the following:
./bin/spark-submit \
--master k8s://https://kubernetes.default.svc \
--deploy-mode cluster \
--name analytics-sample-job \
--class com.example.TestApp \
--conf spark.kubernetes.container.image=docker.io/bitnami/spark:3.5 \
--conf spark.kubernetes.namespace=default \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.driver.podTemplateFile=./spark-pod-template.yaml \
--conf spark.kubernetes.executor.podTemplateFile=./spark-pod-template.yaml \
--conf spark.jars.ivy=/tmp/.ivy2 \
--conf spark.jars.repositories=https://repo1.maven.org/maven2,https://packages.confluent.io/maven/ \
--properties-file ./conf/spark-defaults.conf \
local:///opt/spark-jars/app.jar
Clean up deployed resourcesβ
This section shows how to clean up the resources you deployed in the Kubernetes environment.
Remove the ScalarDB Analytics server by running the following command:
helm uninstall scalardb-analytics-server
Additionally, you can remove the Pods you deployed by running the following command:
kubectl delete pod spark-client analytics-server-cli
Also, you can remove the other Kubernetes resources you created by running the following commands:
# Delete the `spark` service account
kubectl delete serviceaccount spark
# Delete `scalardb-properties` `ConfigMap`
kubectl delete configmap scalardb-properties
# Delete the `spark-app-pvc` PVC
kubectl delete pvc spark-app-pvc