Version: 3.17

Deploy ScalarDB Analytics Locally

This guide explains how to deploy ScalarDB Analytics to a local Kubernetes cluster, specifically designed for testing purposes, by using a Helm Chart.

Prerequisites

Before deploying ScalarDB Analytics to a local environment, ensure that you have the following tools installed:

Kubernetes cluster (this guide assumes you're using minikube)
kubectl
Helm

warning

You need to have a license key (trial license or commercial license) to use ScalarDB Analytics. If you don't have a license key, please contact us.

Example architecture

The following is an example architecture described in this guide.

Example architecture

This guide assumes a Kubernetes cluster running on minikube. In this setup, PostgreSQL is treated as an external data source not managed by ScalarDB transactions, while MySQL is treated as a data source managed by ScalarDB transactions (ScalarDB-managed data source). The ScalarDB Analytics server is deployed as a Pod by using a Helm Chart. The ScalarDB Analytics server uses a dedicated backend database to manage catalog information. For simplicity, in this guide, the backend database used to store catalog information will be created on the PostgreSQL instance mentioned above. A separate Pod is also created to serve as the client for running Spark commands. Additionally, the CLI tool used to operate the ScalarDB Analytics server is provided as a container image and runs on a separate Pod.

note

Please set up each data source yourself, referring to resources such as How to Deploy ScalarDB Cluster Locally for guidance.

Step 1: Set up the Kubernetes environment

You first need to set up the Kubernetes environment where all components will be deployed.

Create ServiceAccount and RoleBinding

Create a service account (ServiceAccount) and a role binding (RoleBinding) to allow Spark jobs to manage resources within the Kubernetes cluster.

NAMESPACE=default
SERVICE_ACCOUNT_NAME=spark

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ServiceAccount
metadata:
  name: ${SERVICE_ACCOUNT_NAME}
  namespace: ${NAMESPACE}
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: spark-role
  namespace: ${NAMESPACE}
roleRef:
  kind: ClusterRole
  name: cluster-admin
  apiGroup: rbac.authorization.k8s.io
subjects:
  - kind: ServiceAccount
    name: ${SERVICE_ACCOUNT_NAME}
    namespace: ${NAMESPACE}
EOF

You can change the NAMESPACE and SERVICE_ACCOUNT_NAME environment variables as needed.

Step 2: Deploy the ScalarDB Analytics server

In this step, you will deploy the ScalarDB Analytics server to the Kubernetes environment using the Helm Chart provided by Scalar.

Deploy a backend database for a ScalarDB Analytics server

The ScalarDB Analytics server uses a dedicated backend database to manage catalog information. For simplicity, in this guide, the backend database is set up on PostgreSQL, used as an external data source outside ScalarDB transaction management.

For detailed instructions on setting up the backend database for ScalarDB Analytics, please see Create a ScalarDB Analytics Catalog.

Add the Scalar Helm Charts repository

Add the Scalar Helm Charts repository by running the following command.

helm repo add scalar-labs https://scalar-labs.github.io/helm-charts

Create a custom values file for the ScalarDB Analytics server

Create a custom values file (analytics-server-custom-values.yaml) for the ScalarDB Analytics server Helm Chart.

The following is an example of a simple configuration.

cat <<EOF > analytics-server-custom-values.yaml
scalarDbAnalyticsServer:
  properties: |
    scalar.db.analytics.server.catalog.port=11051
    scalar.db.analytics.server.metering.port=11052

    scalar.db.analytics.server.db.contact_points=<CATALOG_SERVER_BACKEND_DB_URL>
    scalar.db.analytics.server.db.username=<USERNAME_FOR_BACKEND_DB>
    scalar.db.analytics.server.db.password=<PASSWORD_FOR_BACKEND_DB>

    scalar.db.analytics.server.metering.storage.provider=filesystem
    scalar.db.analytics.server.metering.storage.container_name=metering
    scalar.db.analytics.server.metering.storage.path=/tmp

    scalar.db.analytics.server.licensing.license_key=<YOUR_LICENSE_KEY>
    scalar.db.analytics.server.licensing.license_check_cert_pem=<YOUR_LICENSE_CERT_PEM>

  service:
    type: LoadBalancer
EOF

The following describes what you should change the content in the angle brackets to:

<CATALOG_SERVER_BACKEND_DB_URL>: JDBC connection string for the backend database of the ScalarDB Analytics server.
<USERNAME_FOR_BACKEND_DB>: The username of the backend database.
<PASSWORD_FOR_BACKEND_DB>: The password of the backend database.
<YOUR_LICENSE_KEY>: The license key for the ScalarDB Analytics server.
<YOUR_LICENSE_CERT_PEM>: The PEM encoded license certificate for the ScalarDB Analytics server.

note

The metering-related property values (scalar.db.analytics.server.metering.storage.*) can be used as shown in the example. For more details on metering configuration, see the Configuration reference.

Deploy the Analytics server

Deploy the Analytics Server by running the following command.

helm install scalardb-analytics-server scalar-labs/scalardb-analytics-server -f analytics-server-custom-values.yaml

Step 3: Configure the catalog and data sources by using the CLI tool

To create catalogs and register data sources on the ScalarDB Analytics server, use the CLI tool, which is provided as a container image. As an example, this section shows how to set up a Pod for the CLI tool and run commands from it.

Set up a Pod for the CLI tool

Create a manifest file for the CLI tool Pod.

cat <<EOF > analytics-server-cli.yaml
apiVersion: v1
kind: Pod
metadata:
  name: analytics-server-cli
spec:
  containers:
    - name: analytics-server-cli
      image: ghcr.io/scalar-labs/scalardb-analytics-cli:3.17.1
      command: ['sleep']
      args: ['inf']
  restartPolicy: Never
EOF

You can change metadata.name and spec.containers[*].name to any values you like.

Then, create the Pod for the CLI tool by running the following command:

kubectl apply -f analytics-server-cli.yaml

Once the Pod is deployed, access it through the shell by running the following command. All following steps in this section should be performed inside this Pod.

kubectl exec -it analytics-server-cli -- bash

Set up an alias for the CLI tool to simplify command execution by running the following command:

alias scalardb-analytics-cli="java -jar /scalardb-analytics-cli/scalardb-analytics-cli.jar"

Prepare a configuration file for the ScalarDB data source

When you use a data source under ScalarDB transaction management, you need to provide properties for ScalarDB. This guide explains how to register the ScalarDB properties file to the ScalarDB Analytics server.

This example uses only MySQL but is intentionally configured as a multi-storage setup to simplify adding other databases later. As configured, the storage name mysql is assigned to the MySQL data source, and the namespace nsmy is mapped to it.

cat <<EOF > scalardb.properties
# Storage
scalar.db.storage=multi-storage

# Multi-storage settings
scalar.db.multi_storage.storages=mysql

# Namespace mapping
scalar.db.multi_storage.namespace_mapping=nsmy:mysql

# Default storage
scalar.db.multi_storage.default_storage=mysql

# Multi-storage: Define MySQL
scalar.db.multi_storage.storages.mysql.storage=jdbc
scalar.db.multi_storage.storages.mysql.contact_points=<MYSQL_URL>
scalar.db.multi_storage.storages.mysql.username=<MYSQL_USERNAME>
scalar.db.multi_storage.storages.mysql.password=<MYSQL_PASSWORD>
EOF

note

For details about multi-storage configurations, see Multi-Storage Transactions.

Prepare data source definition files

You must define the data sources that ScalarDB Analytics accesses in JSON format. The following is an example of defining a data source managed by ScalarDB. When using a ScalarDB-managed data source, you must set the type item to scalardb and specify the path to the properties file by using the ${file:<PATH>} syntax. Please Replace <PATH> with the actual path to the ScalarDB properties file, as shown below:

cat <<"EOF" > data_source_scalardb.json
{
  "type": "scalardb",
  "configs": "${file:./scalardb.properties}"
}
EOF

note

You can specify <PATH> as either an absolute path or a relative path.

The following is an example of defining a PostgreSQL data source that is not managed by ScalarDB. You must specify postgresql as the value of type item when using PostgreSQL as the data source. Then, replace the contents in the command below in angle brackets with details about your backend database connection and run the command:

cat <<EOF > data_source_postgres.json
{
  "type": "postgresql",
  "host": <POSTGRES_HOST>,
  "port": "5432",
  "username": <POSTGRES_USER_NAME>,
  "password": <POSTGRES_PASSWORD>,
  "database": <POSTGRES_DATABASE>
}
EOF

note

For simplicity, this guide assumes that the Analytics Server’s backend database and the PostgreSQL data source are shared. Accordingly, set each placeholder as follows:

<POSTGRES_HOST>: the value of <CATALOG_SERVER_BACKEND_DB_URL>
<POSTGRES_USER_NAME>: the value of <USERNAME_FOR_BACKEND_DB>
<POSTGRES_PASSWORD>: the value of <PASSWORD_FOR_BACKEND_DB>
<POSTGRES_DATABASE>: the database name used by the Analytics Server’s backend database

If you are using a PostgreSQL data source that is separate from the Analytics Server’s backend database, make sure to set each placeholder to the appropriate value.

Create a configuration file for the CLI tool

Create a configuration file (client.properties) for the ScalarDB Analytics CLI tool by running the following command, replacing <ANALYTICS_SERVER_HOST> with the hostname or IP address of the ScalarDB Analytics server:

cat <<EOF > client.properties
scalar.db.analytics.client.server.host=<ANALYTICS_SERVER_HOST>
scalar.db.analytics.client.server.catalog.port=11051
EOF

Register the catalog and data sources

This section describes how to register a catalog and data sources using the CLI tool.

Create a catalog

First, create a catalog by using the following command. Replace <CATALOG_NAME> with your desired catalog name.

scalardb-analytics-cli -c client.properties catalog create --catalog <CATALOG_NAME>

Register data sources

Next, register the data sources for both ScalarDB-managed and non–ScalarDB-managed.

Register a ScalarDB-managed data source by using the following command. Replace <CATALOG_NAME> and <DATA_SOURCE_NAME> with your desired catalog and data source name.

scalardb-analytics-cli -c client.properties data-source register \
  --catalog=<CATALOG_NAME> --data-source=<DATA_SOURCE_NAME> --provider-file=./data_source_scalardb.json

Register a non-ScalarDB-managed data source by using the following command. Replace <CATALOG_NAME> and <DATA_SOURCE_NAME> with your desired catalog and data source name.

scalardb-analytics-cli -c client.properties data-source register \
  --catalog=<CATALOG_NAME> --data-source=<DATA_SOURCE_NAME> --provider-file=./data_source_postgres.json

Additional CLI Commands

The CLI tool provides additional commands for managing catalogs and data sources. For detailed instructions, refer to the ScalarDB Analytics CLI tool documentation.

Step 4: Deploy a Spark client Pod

In this step, you will deploy a Spark client Pod and set it up to run Spark jobs.

Create a Spark client Pod

Create a manifest file for the Spark client Pod.

In the following example, the service account name is set to spark. Configure the Spark client Pod by running the following command:

cat <<'EOF' > spark-client.yaml
apiVersion: v1
kind: Pod
metadata:
  name: "spark-client"
spec:
  serviceAccountName: spark
  containers:
    - name: spark-client
      image: eclipse-temurin:21
      command: ['sleep']
      args: ['inf']
  restartPolicy: Never
  terminationGracePeriodSeconds: 0
EOF

Create the Spark client Pod by running the following command:

kubectl apply -f spark-client.yaml

Set up the Spark client Pod

Access the Spark client Pod via a shell session by running the following command:

kubectl exec -it spark-client -- bash

Install the Spark binary files and navigate to their directory by running the following command:

VERSION=3.5.7

curl -O https://dlcdn.apache.org/spark/spark-${VERSION}/spark-${VERSION}-bin-hadoop3.tgz
tar xzf spark-${VERSION}-bin-hadoop3.tgz
cd spark-${VERSION}-bin-hadoop3

Create a spark-defaults.conf file by changing the content in the angle brackets and then running the following command:

cat <<EOF > ./conf/spark-defaults.conf
spark.jars.packages com.scalar-labs:scalardb-analytics-spark-all-<SPARK_VERSION>_<SCALA_VERSION>:<SCALARDB_ANALYTICS_VERSION>

spark.sql.catalog.<CATALOG_NAME> com.scalar.db.analytics.spark.ScalarDbAnalyticsCatalog
spark.sql.catalog.<CATALOG_NAME>.server.host <ANALYTICS_SERVER_HOST>
spark.sql.catalog.<CATALOG_NAME>.server.catalog.port 11051
spark.sql.catalog.<CATALOG_NAME>.server.metering.port 11052

spark.extraListeners com.scalar.db.analytics.spark.metering.ScalarDbAnalyticsListener
EOF

The following describes what you should change the content in the angle brackets to:

<SPARK_VERSION>: The version of Spark.
<SCALA_VERSION>: The version of Scala used to build Spark.
<SCALARDB_ANALYTICS_VERSION>: The version of ScalarDB Analytics.
<CATALOG_NAME>: The name of the catalog.
<ANALYTICS_SERVER_HOST>: The hostname or IP address of the ScalarDB Analytics server.

For more details, refer to Set up ScalarDB Analytics in the Spark configuration.

Step 5: Run Spark jobs from the client Pod

At this point, the Spark client Pod has been set up and is ready to run Spark jobs. This step shows examples of how to run analytical queries as Spark jobs using the following two methods.

Using Spark SQL
Submitting jobs by using the spark-submit command

note

ScalarDB Analytics currently uses Apache Spark as its query engine. It can leverage Spark's native Kubernetes deployment mode, which enables dynamic provisioning of Spark driver and executor Pods at runtime. To use the Kubernetes deployment mode, you need to specify the Kubernetes API server (k8s://...) in the --master option of the spark commands.

Spark SQL
Submit a Spark job

Use the `spark-sql` command to run Spark SQL

You can run Spark SQL by running a command like the following:

./bin/spark-sql \
--master k8s://https://kubernetes.default.svc \
--conf spark.kubernetes.container.image=apache/spark:3.5.7-scala2.12-java11-python3-r-ubuntu \
--conf spark.driver.host=$(hostname -i)

Use the `spark-submit` command to run a Spark job

This section describes registering an application JAR, creating a temporary Pod, creating a Pod template, and executing spark-submit.

Register the application JAR to PVC

To run an application as a Spark job, you need to prepare the application's JAR file and execute the spark-submit command by running the following command. The JAR file must be located at a path accessible from the Spark driver. There are several ways to achieve this, and this guide demonstrates how to use a persistent volume claim (PVC).

PVC_NAME=spark-app-pvc
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ${PVC_NAME}
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 4Gi
EOF

Create a temporary Pod and copy the file

Create a temporary Pod to store the application JAR in the PVC by running the following command:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: spark-pvc-loader
spec:
  containers:
    - name: loader
      image: busybox
      command: ["sleep", "3600"]
      volumeMounts:
        - mountPath: /mnt
          name: spark-vol
  volumes:
    - name: spark-vol
      persistentVolumeClaim:
        claimName: ${PVC_NAME}
  restartPolicy: Never
EOF

Wait for the temporary Pod to be created by running the following command:

kubectl wait --for=condition=Ready pod/spark-pvc-loader --timeout=60s

Copy the application JAR to PVC by running the following command:

export JAR_PATH=/path/to/your/app.jar
kubectl cp ${JAR_PATH} spark-pvc-loader:/mnt/app.jar

Delete the temporary Pod by running the following command:

kubectl delete pod spark-pvc-loader

Create a Pod template

To create a Pod template for the dynamically generated Spark driver and executor Pods, log in to the Spark client Pod and run the following command:

PVC_NAME=spark-app-pvc
cat <<EOF > spark-pod-template.yaml
apiVersion: v1
kind: Pod
metadata:
  name: spark-pod-template
spec:
  volumes:
    - name: spark-jar-volume
      persistentVolumeClaim:
        claimName: ${PVC_NAME}
  containers:
    - name: spark-kubernetes-container
      volumeMounts:
        - mountPath: /opt/spark-jars
          name: spark-jar-volume
EOF

Execute `spark-submit`

Run the application as a Spark job by using a command like the following:

./bin/spark-submit \
--master k8s://https://kubernetes.default.svc \
--deploy-mode cluster \
--name analytics-sample-job \
--class com.example.TestApp \
--conf spark.kubernetes.container.image=apache/spark:3.5.7-scala2.12-java11-python3-r-ubuntu \
--conf spark.kubernetes.namespace=default \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.driver.podTemplateFile=./spark-pod-template.yaml \
--conf spark.kubernetes.executor.podTemplateFile=./spark-pod-template.yaml \
--conf spark.jars.ivy=/tmp/.ivy2 \
--conf spark.jars.repositories=https://repo1.maven.org/maven2,https://packages.confluent.io/maven/ \
--properties-file ./conf/spark-defaults.conf \
local:///opt/spark-jars/app.jar

Clean up deployed resources

This section shows how to clean up the resources you deployed in the Kubernetes environment.

Remove the ScalarDB Analytics server by running the following command:

helm uninstall scalardb-analytics-server

Additionally, you can remove the Pods you deployed by running the following command:

kubectl delete pod spark-client analytics-server-cli

Also, you can remove the other Kubernetes resources you created by running the following commands:

# Delete the `spark` service account
kubectl delete serviceaccount spark

# Delete the `spark-app-pvc` PVC
kubectl delete pvc spark-app-pvc

Prerequisites​

Example architecture​

Step 1: Set up the Kubernetes environment​

Create ServiceAccount and RoleBinding​

Step 2: Deploy the ScalarDB Analytics server​

Deploy a backend database for a ScalarDB Analytics server​

Add the Scalar Helm Charts repository​

Create a custom values file for the ScalarDB Analytics server​

Deploy the Analytics server​

Step 3: Configure the catalog and data sources by using the CLI tool​

Set up a Pod for the CLI tool​

Prepare a configuration file for the ScalarDB data source​

Prepare data source definition files​

Create a configuration file for the CLI tool​

Register the catalog and data sources​

Create a catalog​

Register data sources​

Additional CLI Commands​

Step 4: Deploy a Spark client Pod​

Create a Spark client Pod​

Set up the Spark client Pod​

Step 5: Run Spark jobs from the client Pod​

Use the spark-sql command to run Spark SQL

Use the spark-submit command to run a Spark job

Register the application JAR to PVC

Create a temporary Pod and copy the file

Create a Pod template

Execute spark-submit

Clean up deployed resources​

Prerequisites

Example architecture

Step 1: Set up the Kubernetes environment

Create ServiceAccount and RoleBinding

Step 2: Deploy the ScalarDB Analytics server

Deploy a backend database for a ScalarDB Analytics server

Add the Scalar Helm Charts repository

Create a custom values file for the ScalarDB Analytics server

Deploy the Analytics server

Step 3: Configure the catalog and data sources by using the CLI tool

Set up a Pod for the CLI tool

Prepare a configuration file for the ScalarDB data source

Prepare data source definition files

Create a configuration file for the CLI tool

Register the catalog and data sources

Create a catalog

Register data sources

Additional CLI Commands

Step 4: Deploy a Spark client Pod

Create a Spark client Pod

Set up the Spark client Pod

Step 5: Run Spark jobs from the client Pod

Use the `spark-sql` command to run Spark SQL

Use the `spark-submit` command to run a Spark job

Execute `spark-submit`

Clean up deployed resources