メインコンテンツまでスキップ
バージョン: 3.15

Deploy ScalarDB Analytics in Public Cloud Environments

This guide explains how to deploy ScalarDB Analytics in a public cloud environment. ScalarDB Analytics currently uses Apache Spark as an execution engine and supports managed Spark services provided by public cloud providers, such as Amazon EMR and Databricks.

Supported managed Spark services and their application types

ScalarDB Analytics supports the following managed Spark services and application types.

Public Cloud ServiceSpark DriverSpark ConnectJDBC
Amazon EMR (EMR on EC2)
Databricks

Configure and deploy

Select your public cloud environment, and follow the instructions to set up and deploy ScalarDB Analytics.

Use Amazon EMR

You can use Amazon EMR (EMR on EC2) to run analytical queries through ScalarDB Analytics. For the basics to launch an EMR cluster, please refer to the AWS EMR on EC2 documentation.

ScalarDB Analytics configuration

To enable ScalarDB Analytics, you need to add the following configuration to the Software setting when you launch an EMR cluster. Be sure to replace the content in the angle brackets:

[
{
"Classification": "spark-defaults",
"Properties": {
"spark.jars.packages": "com.scalar-labs:scalardb-analytics-spark-all-<SPARK_VERSION>_<SCALA_VERSION>:<SCALARDB_ANALYTICS_VERSION>",
"spark.sql.catalog.<CATALOG_NAME>": "com.scalar.db.analytics.spark.ScalarDbAnalyticsCatalog",
"spark.sql.extensions": "com.scalar.db.analytics.spark.extension.ScalarDbAnalyticsExtensions",
"spark.sql.catalog.<CATALOG_NAME>.license.cert_pem": "<YOUR_LICENSE_CERT_PEM>",
"spark.sql.catalog.<CATALOG_NAME>.license.key": "<YOUR_LICENSE_KEY>",

// Add your data source configuration below
}
}
]

The following describes what you should change the content in the angle brackets to:

  • <SPARK_VERSION>: The version of Spark.
  • <SCALA_VERSION>: The version of Scala used to build Spark.
  • <SCALARDB_ANALYTICS_VERSION>: The version of ScalarDB Analytics.
  • <CATALOG_NAME>: The name of the catalog.
  • <YOUR_LICENSE_CERT_PEM>: The PEM encoded license certificate.
  • <YOUR_LICENSE_KEY>: The license key.

For more details, refer to Set up ScalarDB Analytics in the Spark configuration.

Run analytical queries via the Spark driver

After the EMR Spark cluster has launched, you can use ssh to connect to the primary node of the EMR cluster and run your Spark application. For details on how to create a Spark Driver application, refer to Spark Driver application.

Run analytical queries via Spark Connect

You can use Spark Connect to run your Spark application remotely by using the EMR cluster that you launched.

You first need to configure the Software setting in the same way as the Spark Driver application. You also need to set the following configuration to enable Spark Connect.

Allow inbound traffic for a Spark Connect server
  1. Create a security group to allow inbound traffic for a Spark Connect server. (Port 15001 is the default).
  2. Allow the role of "Amazon EMR service role" to attach the security group to the primary node of the EMR cluster.
  3. Add the security group to the primary node of the EMR cluster as "Additional security groups" when you launch the EMR cluster.
Launch the Spark Connect server via a bootstrap action
  1. Create a script file to launch the Spark Connect server as follows:
#!/usr/bin/env bash

set -eu -o pipefail

cd /var/lib/spark

sudo -u spark /usr/lib/spark/sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_<SCALA_VERSION>:<SPARK_FULL_VERSION>,com.scalar-labs:scalardb-analytics-spark-all-<SPARK_VERSION>_<SCALA_VERSION>:<SCALARDB_ANALYTICS_VERSION>

The following describes what you should change the content in the angle brackets to:

  • <SCALA_VERSION>: The major and minor version of Scala that matches your Spark installation (such as 2.12 or 2.13)
  • <SPARK_FULL_VERSION>: The full version of Spark you are using (such as 3.5.3)
  • <SPARK_VERSION>: The major and minor version of Spark you are using (such as 3.5)
  • <SCALARDB_ANALYTICS_VERSION>: The version of ScalarDB Analytics
  1. Upload the script file to S3.
  2. Allow the role of "EC2 instance profile for Amazon EMR" to access the uploaded script file in S3.
  3. Add the uploaded script file to "Bootstrap actions" when you launch the EMR cluster.
Run analytical queries

You can run your Spark application via Spark Connect from anywhere by using the remote URL of the Spark Connect server, which is sc://<PRIMARY_NODE_PUBLIC_HOSTNAME>:15001.

For details on how to create a Spark application by using Spark Connect, refer to Spark Connect application.