Deploy ScalarDB Analytics Locally
This guide explains how to deploy ScalarDB Analytics to a local Kubernetes cluster, specifically designed for testing purposes, by using a Helm Chart.
Prerequisitesβ
Before deploying ScalarDB Analytics to a local environment, ensure that you have the following tools installed:
You need to have a license key (trial license or commercial license) to use ScalarDB Analytics. If you don't have a license key, please contact us.
Example architectureβ
The following is an example architecture described in this guide.
This guide assumes a Kubernetes cluster running on minikube. In this setup, PostgreSQL is treated as an external data source not managed by ScalarDB transactions, while MySQL is treated as a data source managed by ScalarDB transactions (ScalarDB-managed data source). Both of these data sources store user data. The ScalarDB Analytics server is deployed as a Pod by using a Helm Chart. The ScalarDB Analytics server uses a dedicated backend database to store catalog data. A separate Pod is also created to serve as the client for running Spark commands. Additionally, the CLI tool used to operate the ScalarDB Analytics server is provided as a container image and runs on a separate Pod.
Please set up each data source yourself, referring to resources such as How to Deploy ScalarDB Cluster Locally for guidance.
Step 1: Set up the Kubernetes environmentβ
You first need to set up the Kubernetes environment where all components will be deployed.
Create ServiceAccount and RoleBindingβ
Create a service account (ServiceAccount) and a role binding (RoleBinding) to allow Spark jobs to manage resources within the Kubernetes cluster.
NAMESPACE=default
SERVICE_ACCOUNT_NAME=spark
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ServiceAccount
metadata:
name: ${SERVICE_ACCOUNT_NAME}
namespace: ${NAMESPACE}
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: spark-role
namespace: ${NAMESPACE}
roleRef:
kind: ClusterRole
name: cluster-admin
apiGroup: rbac.authorization.k8s.io
subjects:
- kind: ServiceAccount
name: ${SERVICE_ACCOUNT_NAME}
namespace: ${NAMESPACE}
EOF
You can change the NAMESPACE and SERVICE_ACCOUNT_NAME environment variables as needed.
Step 2: Deploy the ScalarDB Analytics serverβ
In this step, you will deploy the ScalarDB Analytics server to the Kubernetes environment using the Helm Chart provided by Scalar.
Deploy a backend database for the ScalarDB Analytics serverβ
The ScalarDB Analytics server uses a dedicated backend database to manage catalog information. This guide uses a dedicated PostgreSQL instance for the backend database.
You can deploy PostgreSQL on the Kubernetes cluster by using a Bitnami Helm Chart. Add the Bitnami Helm Charts repository by running the following command:
helm repo add bitnami https://charts.bitnami.com/bitnami
Deploy PostgreSQL by running the following command:
helm install postgresql-scalardb-analytics bitnami/postgresql \
--set auth.postgresPassword=postgres \
--set primary.persistence.enabled=false
Check if the PostgreSQL Pod is running by running the following command:
kubectl get pod
You should see output similar to the following, with the PostgreSQL Pod in the Running state:
NAME READY STATUS RESTARTS AGE
postgresql-scalardb-analytics-0 1/1 Running 0 10s
Add the Scalar Helm Charts repositoryβ
Add the Scalar Helm Charts repository by running the following command.
helm repo add scalar-labs https://scalar-labs.github.io/helm-charts
Create a custom values file for the ScalarDB Analytics serverβ
Create a custom values file (analytics-server-custom-values.yaml) for the ScalarDB Analytics server Helm Chart.
The following is an example of a simple configuration.
cat <<EOF > analytics-server-custom-values.yaml
scalarDbAnalyticsServer:
properties: |
scalar.db.analytics.server.catalog.port=11051
scalar.db.analytics.server.metering.port=11052
scalar.db.analytics.server.db.contact_points=<CATALOG_SERVER_BACKEND_DB_URL>
scalar.db.analytics.server.db.username=<USERNAME_FOR_BACKEND_DB>
scalar.db.analytics.server.db.password=<PASSWORD_FOR_BACKEND_DB>
scalar.db.analytics.server.metering.storage.provider=filesystem
scalar.db.analytics.server.metering.storage.container_name=metering
scalar.db.analytics.server.metering.storage.path=/tmp
scalar.db.analytics.server.licensing.license_key=<YOUR_LICENSE_KEY>
scalar.db.analytics.server.licensing.license_check_cert_pem=<YOUR_LICENSE_CERT_PEM>
EOF
The following describes what you should change the content in the angle brackets to:
<CATALOG_SERVER_BACKEND_DB_URL>: JDBC connection string for the backend database of the ScalarDB Analytics server.<USERNAME_FOR_BACKEND_DB>: The username of the backend database.<PASSWORD_FOR_BACKEND_DB>: The password of the backend database.<YOUR_LICENSE_KEY>: The license key for the ScalarDB Analytics server.<YOUR_LICENSE_CERT_PEM>: The PEM encoded license certificate for the ScalarDB Analytics server.
The metering-related property values (scalar.db.analytics.server.metering.storage.*) can be used as shown in the example. For more details on metering configuration, see the Configuration reference.
Deploy the Analytics serverβ
Deploy the Analytics Server by running the following command.
helm install scalardb-analytics-server scalar-labs/scalardb-analytics-server -f analytics-server-custom-values.yaml
For more details on deploying the ScalarDB Analytics server, see Deploy a ScalarDB Analytics server.
Step 3: Configure the catalog and data sources by using the CLI toolβ
To create catalogs and register data sources on the ScalarDB Analytics server, use the CLI tool, which is provided as a container image. As an example, this section shows how to set up a Pod for the CLI tool and run commands from it.
Set up a Pod for the CLI toolβ
Create a manifest file for the CLI tool Pod.
cat <<EOF > analytics-server-cli.yaml
apiVersion: v1
kind: Pod
metadata:
name: analytics-server-cli
spec:
containers:
- name: analytics-server-cli
image: ghcr.io/scalar-labs/scalardb-analytics-cli:3.17.1
command: ['sleep']
args: ['inf']
restartPolicy: Never
EOF
You can change metadata.name and spec.containers[*].name to any values you like.
Then, create the Pod for the CLI tool by running the following command:
kubectl apply -f analytics-server-cli.yaml
Once the Pod is deployed, access it through the shell by running the following command. All following steps in this section should be performed inside this Pod.
kubectl exec -it analytics-server-cli -- bash
Set up an alias for the CLI tool to simplify command execution by running the following command:
alias scalardb-analytics-cli="java -jar /scalardb-analytics-cli/scalardb-analytics-cli.jar"
Prepare a configuration file for the ScalarDB data sourceβ
When you use a data source under ScalarDB transaction management, you need to provide properties for ScalarDB. This guide explains how to register the ScalarDB properties file to the ScalarDB Analytics server.
This example uses only MySQL but is intentionally configured as a multi-storage setup to simplify adding other databases later. As configured, the storage name mysql is assigned to the MySQL data source, and the namespace nsmy is mapped to it.
cat <<EOF > scalardb.properties
# Storage
scalar.db.storage=multi-storage
# Multi-storage settings
scalar.db.multi_storage.storages=mysql
# Namespace mapping
scalar.db.multi_storage.namespace_mapping=nsmy:mysql
# Default storage
scalar.db.multi_storage.default_storage=mysql
# Multi-storage: Define MySQL
scalar.db.multi_storage.storages.mysql.storage=jdbc
scalar.db.multi_storage.storages.mysql.contact_points=<MYSQL_URL>
scalar.db.multi_storage.storages.mysql.username=<MYSQL_USERNAME>
scalar.db.multi_storage.storages.mysql.password=<MYSQL_PASSWORD>
EOF
For details about multi-storage configurations, see Multi-Storage Transactions.
Prepare data source definition filesβ
You must define the data sources that ScalarDB Analytics accesses in JSON format. The following is an example of defining a data source managed by ScalarDB. When using a ScalarDB-managed data source, you must set the type item to scalardb and specify the path to the properties file by using the ${file:<PATH>} syntax. Please Replace <PATH> with the actual path to the ScalarDB properties file, as shown below:
cat <<"EOF" > data_source_scalardb.json
{
"type": "scalardb",
"configs": "${file:./scalardb.properties}"
}
EOF
You can specify <PATH> as either an absolute path or a relative path.
The following is an example of defining a PostgreSQL data source that is not managed by ScalarDB. You must specify postgresql as the value of type item when using PostgreSQL as the data source. Then, replace each placeholder value enclosed in angle brackets in the command below with the appropriate value for your PostgreSQL data source and run the command:
cat <<EOF > data_source_postgres.json
{
"type": "postgresql",
"host": <POSTGRES_HOST>,
"port": "5432",
"username": <POSTGRES_USER_NAME>,
"password": <POSTGRES_PASSWORD>,
"database": <POSTGRES_DATABASE>
}
EOF
Create a configuration file for the CLI toolβ
Create a configuration file (client.properties) for the ScalarDB Analytics CLI tool.
To connect to the ScalarDB Analytics server from the CLI tool, you need the hostname or IP address of the server. You can get the IP address by checking the CLUSTER-IP value of the ScalarDB Analytics server service by running the following command:
$ kubectl get svc scalardb-analytics-server
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
scalardb-analytics-server ClusterIP 10.97.81.28 <none> 11051/TCP,11052/TCP 40s
Then, create the configuration file by running the following command, replacing <ANALYTICS_SERVER_HOST> with the CLUSTER-IP value you retrieved:
cat <<EOF > client.properties
scalar.db.analytics.client.server.host=<ANALYTICS_SERVER_HOST>
scalar.db.analytics.client.server.catalog.port=11051
EOF
Register the catalog and data sourcesβ
This section describes how to register a catalog and data sources using the CLI tool.
Create a catalogβ
First, create a catalog by using the following command. Replace <CATALOG_NAME> with your desired catalog name.
scalardb-analytics-cli -c client.properties catalog create --catalog <CATALOG_NAME>
Register data sourcesβ
Next, register the data sources for both ScalarDB-managed and nonβScalarDB-managed.
Register a ScalarDB-managed data source by using the following command. Replace <CATALOG_NAME> and <DATA_SOURCE_NAME> with your desired catalog and data source name.
scalardb-analytics-cli -c client.properties data-source register \
--catalog=<CATALOG_NAME> --data-source=<DATA_SOURCE_NAME> --provider-file=./data_source_scalardb.json
Register a non-ScalarDB-managed data source by using the following command. Replace <CATALOG_NAME> and <DATA_SOURCE_NAME> with your desired catalog and data source name.
scalardb-analytics-cli -c client.properties data-source register \
--catalog=<CATALOG_NAME> --data-source=<DATA_SOURCE_NAME> --provider-file=./data_source_postgres.json
Additional CLI Commandsβ
The CLI tool provides additional commands for managing catalogs and data sources. For detailed instructions, refer to the ScalarDB Analytics CLI tool documentation.
Step 4: Deploy a Spark client Podβ
In this step, you will deploy a Spark client Pod and set it up to run Spark jobs.
Create a Spark client Podβ
Create a manifest file for the Spark client Pod.
In the following example, the service account name is set to spark. Configure the Spark client Pod by running the following command:
cat <<'EOF' > spark-client.yaml
apiVersion: v1
kind: Pod
metadata:
name: "spark-client"
spec:
serviceAccountName: spark
containers:
- name: spark-client
image: eclipse-temurin:21
command: ['sleep']
args: ['inf']
restartPolicy: Never
terminationGracePeriodSeconds: 0
EOF
Create the Spark client Pod by running the following command:
kubectl apply -f spark-client.yaml
Set up the Spark client Podβ
Access the Spark client Pod via a shell session by running the following command:
kubectl exec -it spark-client -- bash
Install the Spark binary files and navigate to their directory by running the following command:
VERSION=3.5.7
curl -O https://dlcdn.apache.org/spark/spark-${VERSION}/spark-${VERSION}-bin-hadoop3.tgz
tar xzf spark-${VERSION}-bin-hadoop3.tgz
cd spark-${VERSION}-bin-hadoop3
Create a spark-defaults.conf file by changing the content in the angle brackets and then running the following command:
cat <<EOF > ./conf/spark-defaults.conf
spark.jars.packages com.scalar-labs:scalardb-analytics-spark-all-<SPARK_VERSION>_<SCALA_VERSION>:<SCALARDB_ANALYTICS_VERSION>
spark.sql.catalog.<CATALOG_NAME> com.scalar.db.analytics.spark.ScalarDbAnalyticsCatalog
spark.sql.catalog.<CATALOG_NAME>.server.host <ANALYTICS_SERVER_HOST>
spark.sql.catalog.<CATALOG_NAME>.server.catalog.port 11051
spark.sql.catalog.<CATALOG_NAME>.server.metering.port 11052
spark.extraListeners com.scalar.db.analytics.spark.metering.ScalarDbAnalyticsListener
EOF
The following describes what you should change the content in the angle brackets to:
<SPARK_VERSION>: The version of Spark.<SCALA_VERSION>: The version of Scala used to build Spark.<SCALARDB_ANALYTICS_VERSION>: The version of ScalarDB Analytics.<CATALOG_NAME>: The name of the catalog.<ANALYTICS_SERVER_HOST>: TheCLUSTER-IPvalue of the ScalarDB Analytics server service.
For more details, refer to Set up ScalarDB Analytics in the Spark configuration.
Step 5: Run Spark jobs from the client Podβ
At this point, the Spark client Pod has been set up and is ready to run Spark jobs. This step shows examples of how to run analytical queries as Spark jobs using the following two methods.
- Using Spark SQL
- Submitting jobs by using the
spark-submitcommand
ScalarDB Analytics currently uses Apache Spark as its query engine. It can leverage Spark's native Kubernetes deployment mode, which enables dynamic provisioning of Spark driver and executor Pods at runtime. To use the Kubernetes deployment mode, you need to specify the Kubernetes API server (k8s://...) in the --master option of the spark commands.
- Spark SQL
- Submit a Spark job
Use the spark-sql command to run Spark SQL
You can run Spark SQL by running a command like the following:
./bin/spark-sql \
--master k8s://https://kubernetes.default.svc \
--conf spark.kubernetes.container.image=apache/spark:3.5.7-scala2.12-java11-python3-r-ubuntu \
--conf spark.driver.host=$(hostname -i)
Use the spark-submit command to run a Spark job
This section describes registering an application JAR, creating a temporary Pod, creating a Pod template, and executing spark-submit.
Register the application JAR to PVC
To run an application as a Spark job, you need to prepare the application's JAR file and execute the spark-submit command by running the following command. The JAR file must be located at a path accessible from the Spark driver. There are several ways to achieve this, and this guide demonstrates how to use a persistent volume claim (PVC).
PVC_NAME=spark-app-pvc
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ${PVC_NAME}
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 4Gi
EOF
Create a temporary Pod and copy the file
Create a temporary Pod to store the application JAR in the PVC by running the following command:
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: spark-pvc-loader
spec:
containers:
- name: loader
image: busybox
command: ["sleep", "3600"]
volumeMounts:
- mountPath: /mnt
name: spark-vol
volumes:
- name: spark-vol
persistentVolumeClaim:
claimName: ${PVC_NAME}
restartPolicy: Never
EOF
Wait for the temporary Pod to be created by running the following command:
kubectl wait --for=condition=Ready pod/spark-pvc-loader --timeout=60s
Copy the application JAR to PVC by running the following command:
export JAR_PATH=/path/to/your/app.jar
kubectl cp ${JAR_PATH} spark-pvc-loader:/mnt/app.jar
Delete the temporary Pod by running the following command:
kubectl delete pod spark-pvc-loader
Create a Pod template
To create a Pod template for the dynamically generated Spark driver and executor Pods, log in to the Spark client Pod and run the following command:
PVC_NAME=spark-app-pvc
cat <<EOF > spark-pod-template.yaml
apiVersion: v1
kind: Pod
metadata:
name: spark-pod-template
spec:
volumes:
- name: spark-jar-volume
persistentVolumeClaim:
claimName: ${PVC_NAME}
containers:
- name: spark-kubernetes-container
volumeMounts:
- mountPath: /opt/spark-jars
name: spark-jar-volume
EOF
Execute spark-submit
Run the application as a Spark job by using a command like the following:
./bin/spark-submit \
--master k8s://https://kubernetes.default.svc \
--deploy-mode cluster \
--name analytics-sample-job \
--class com.example.TestApp \
--conf spark.kubernetes.container.image=apache/spark:3.5.7-scala2.12-java11-python3-r-ubuntu \
--conf spark.kubernetes.namespace=default \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.driver.podTemplateFile=./spark-pod-template.yaml \
--conf spark.kubernetes.executor.podTemplateFile=./spark-pod-template.yaml \
--conf spark.jars.ivy=/tmp/.ivy2 \
--conf spark.jars.repositories=https://repo1.maven.org/maven2,https://packages.confluent.io/maven/ \
--properties-file ./conf/spark-defaults.conf \
local:///opt/spark-jars/app.jar
Clean up deployed resourcesβ
This section shows how to clean up the resources you deployed in the Kubernetes environment.
Remove the ScalarDB Analytics server by running the following command:
helm uninstall scalardb-analytics-server postgresql-scalardb-analytics
Additionally, you can remove the Pods you deployed by running the following command:
kubectl delete pod spark-client analytics-server-cli
Also, you can remove the other Kubernetes resources you created by running the following commands:
# Delete the `spark` service account
kubectl delete serviceaccount spark
# Delete the `spark-app-pvc` PVC
kubectl delete pvc spark-app-pvc