メインコンテンツまでスキップ
バージョン: 3.15

Run Analytical Queries Through ScalarDB Analytics

This guide explains how to develop ScalarDB Analytics applications. For details on the architecture and design, see ScalarDB Analytics Design

ScalarDB Analytics currently uses Spark as an execution engine and provides a Spark custom catalog plugin to provide a unified view of ScalarDB-managed and non-ScalarDB-managed data sources as Spark tables. This allows you to execute arbitrary Spark SQL queries seamlessly.

Preparation

This section describes the prerequisites, setting up ScalarDB Analytics in the Spark configuration, and adding the ScalarDB Analytics dependency.

Prerequisites

ScalarDB Analytics works with Apache Spark 3.4 or later. If you don't have Spark installed yet, please download the Spark distribution from Apache's website.

注記

Apache Spark are built with either Scala 2.12 or Scala 2.13. ScalarDB Analytics supports both versions. You need to be sure which version you are using so that you can select the correct version of ScalarDB Analytics later. You can refer to Version Compatibility for more details.

Set up ScalarDB Analytics in the Spark configuration

The following sections describe all available configuration options for ScalarDB Analytics. These configurations control:

  • How ScalarDB Analytics integrates with Spark
  • How data sources are connected and accessed
  • How license information is provided

For example configurations in a practical scenario, see the sample application configuration.

Spark plugin configurations

Configuration KeyRequiredDescription
spark.jars.packagesNoA comma-separated list of Maven coordinates for the required dependencies. User need to include the ScalarDB Analytics package you are using, otherwise, specify it as the command line argument when running the Spark application. For details about the Maven coordinates of ScalarDB Analytics, refer to Add ScalarDB Analytics dependency.
spark.sql.extensionsYesMust be set to com.scalar.db.analytics.spark.Extensions
spark.sql.catalog.<CATALOG_NAME>YesMust be set to com.scalar.db.analytics.spark.ScalarCatalog

You can specify any name for <CATALOG_NAME>. Be sure to use the same catalog name throughout your configuration.

License configurations

Configuration KeyRequiredDescription
spark.sql.catalog.<CATALOG_NAME>.license.keyYesJSON string of the license key for ScalarDB Analytics
spark.sql.catalog.<CATALOG_NAME>.license.cert_pemYesA string of PEM-encoded certificate of ScalarDB Analytics license. Either cert_pem or cert_path must be set.
spark.sql.catalog.<CATALOG_NAME>.license.cert_pathYesA path to the PEM-encoded certificate of ScalarDB Analytics license. Either cert_pem or cert_path must be set.

Data source configurations

ScalarDB Analytics supports multiple types of data sources. Each type requires specific configuration parameters:

注記

ScalarDB Analytics supports ScalarDB as a data source. This table describes how to configure ScalarDB as a data source.

Configuration KeyRequiredDescription
spark.sql.catalog.<CATALOG_NAME>.data_source.<DATA_SOURCE_NAME>.typeYesAlways set to scalardb
spark.sql.catalog.<CATALOG_NAME>.data_source.<DATA_SOURCE_NAME>.config_pathYesThe path to the configuration file for ScalarDB
ヒント

You can use an arbitrary name for <DATA_SOURCE_NAME>.

Example configuration

Below is an example configuration for ScalarDB Analytics that demonstrates how to set up a catalog named scalardb with multiple data sources:

# Spark plugin configurations
spark.jars.packages com.scalar-labs:scalardb-analytics-spark-all-<SPARK_VERSION>_<SCALA_VERSION>:<SCALARDB_ANALYTICS_VERSION>
spark.sql.extensions com.scalar.db.analytics.spark.Extensions
spark.sql.catalog.scalardb com.scalar.db.analytics.spark.ScalarCatalog

# License configurations
spark.sql.catalog.scalardb.license.key <LICENSE_KEY>
spark.sql.catalog.scalardb.license.cert_pem <LICENSE_PEM_ENCODED_CERTIFICATE>

# Data source configurations
spark.sql.catalog.scalardb.data_source.scalardb.type scalardb
spark.sql.catalog.scalardb.data_source.scalardb.config_path /path/to/scalardb.properties

spark.sql.catalog.scalardb.data_source.mysql_source.type mysql
spark.sql.catalog.scalardb.data_source.mysql_source.host localhost
spark.sql.catalog.scalardb.data_source.mysql_source.port 3306
spark.sql.catalog.scalardb.data_source.mysql_source.username root
spark.sql.catalog.scalardb.data_source.mysql_source.password password
spark.sql.catalog.scalardb.data_source.mysql_source.database mydb

The following describes what you should change the content in the angle brackets to:

  • <LICENSE_KEY>: The license key for ScalarDB Analytics
  • <LICENSE_PEM_ENCODED_CERTIFICATE>: The PEM-encoded certificate of ScalarDB Analytics license
  • <SPARK_VERSION>: The major and minor version of Spark you are using (such as 3.4)
  • <SCALA_VERSION>: The major and minor version of Scala that matches your Spark installation (such as 2.12 or 2.13)
  • <SCALARDB_ANALYTICS_VERSION>: The version of ScalarDB Analytics

Add the ScalarDB Analytics dependency

ScalarDB Analytics is hosted in the Maven Central Repository. The name of the package is scalardb-analytics-spark-all-<SPARK_VERSION>_<SCALA_VERSION>:<SCALARDB_ANALYTICS_VERSION>, where:

  • <SPARK_VERSION>: The major and minor version of Spark you are using (such as 3.4)
  • <SCALA_VERSION>: The major and minor version of Scala that matches your Spark installation (such as 2.12 or 2.13)
  • <SCALARDB_ANALYTICS_VERSION>: The version of ScalarDB Analytics

For details about version compatibility, refer to Version Compatibility.

You can add this dependency to your project by configuring the build settings of your project. For example, if you are using Gradle, you can add the following to your build.gradle file:

dependencies {
implementation 'com.scalar-labs:scalardb-analytics-spark-all-<SPARK_VERSION>_<SCALA_VERSION>:<SCALARDB_ANALYTICS_VERSION>'
}
注記

If you want bundle your application in a single fat JAR file by using plugins like Gradle Shadow plugin or Maven Shade plugin, you need to exclude ScalarDB Analytics from the fat JAR file by choosing the appropriate configuration, such as provided or shadow, depending on the plugin you are using.

Develop a Spark application

In this section, you will learn how to develop a Spark application that uses ScalarDB Analytics in Java.

There are three ways to develop Spark applications with ScalarDB Analytics:

  1. Spark driver application: A traditional Spark application that runs within the cluster
  2. Spark Connect application: A remote application that uses the Spark Connect protocol
  3. JDBC application: A remote application that uses the JDBC interface
注記

Depending on your environment, you may not be able to use all the methods mentioned above. For details about supported features and deployment options, refer to Supported managed Spark services and their application types.

With all these methods, you can refer to tables in ScalarDB Analytics using the same table identifier format. For details about how ScalarDB Analytics maps catalog information from data sources, refer to Catalog information mappings by data source.

You can use a commonly used SparkSession class for ScalarDB Analytics. Additionally, you can use any type of cluster deployment that Spark supports, such as YARN, Kubernetes, standalone, or local mode.

To read data from tables in ScalarDB Analytics, you can use the spark.sql or spark.read.table function in the same way as when reading a normal Spark table.

First, you need to set up your Java project. For example, if you are using Gradle, you can add the following to your build.gradle file:

dependencies {
implementation 'com.scalar-labs:scalardb-analytics-spark-<SPARK_VERSION>_<SCALA_VERSION>:<SCALARDB_ANALYTICS_VERSION>'
}

Below is an example of a Spark Driver application:

import org.apache.spark.sql.SparkSession;

public class MyApp {
public static void main(String[] args) {
// Create a SparkSession
try (SparkSession spark = SparkSession.builder().getOrCreate()) {
// Read data from a table in ScalarDB Analytics
spark.sql("SELECT * FROM my_catalog.my_data_source.my_namespace.my_table").show();
}
}
}

Then, you can build and run your application by using the spark-submit command.

注記

You may need to build a fat JAR file for your application, as is usual for normal Spark applications.

spark-submit --class MyApp --master local[*] my-spark-application-all.jar
ヒント

You can also use other CLI tools that Spark provides, such as spark-sql and spark-shell, to interact with ScalarDB Analytics tables.

Catalog information mapping

ScalarDB Analytics manages its own catalog, containing data sources, namespaces, tables, and columns. That information is automatically mapped to the Spark catalog. In this section, you will learn how ScalarDB Analytics maps its catalog information to the Spark catalog.

For details about how information in the raw data sources is mapped to the ScalarDB Analytics catalog, refer to Catalog information mappings by data source.

Catalog level mapping

Each catalog level object in the ScalarDB Analytics catalog is mapped to a Spark catalog. The following table shows how the catalog levels are mapped:

Data source tables

Tables from data sources in the ScalarDB Analytics catalog are mapped to Spark tables. The following format is used to represent the identity of the Spark tables that correspond to ScalarDB Analytics tables:

<CATALOG_NAME>.<DATA_SOURCE_NAME>.<NAMESPACE_NAMES>.<TABLE_NAME>

The following describes what you should change the content in the angle brackets to:

  • <CATALOG_NAME>: The name of the catalog.
  • <DATA_SOURCE_NAME>: The name of the data source.
  • <NAMESPACE_NAMES>: The names of the namespaces. If the namespace names are multi-level, they are concatenated with a dot (.) as the separator.
  • <TABLE_NAME>: The name of the table.

For example, if you have a ScalarDB catalog named my_catalog that contains a data source named my_data_source and a schema named my_schema, you can refer to the table named my_table in that schema as my_catalog.my_data_source.my_schema.my_table.

Views

Views in ScalarDB Analytics are provided as tables in the Spark catalog, not views. The following format is used to represent the identity of the Spark tables that correspond to ScalarDB Analytics views:

<CATALOG_NAME>.view.<VIEW_NAMESPACE_NAMES>.<VIEW_NAME>

The following describes what you should change the content in the angle brackets to:

  • <CATALOG_NAME>: The name of the catalog.
  • <VIEW_NAMESPACE_NAMES>: The names of the view namespaces. If the view namespace names are multi-level, they are concatenated with a dot (.) as the separator.
  • <VIEW_NAME>: The name of the view.

For example, if you have a ScalarDB catalog named my_catalog and a view namespace named my_view_namespace, you can refer to the view named my_view in that namespace as my_catalog.view.my_view_namespace.my_view.

注記

view is prefixed to avoid conflicts with the data source table identifier.

WAL-interpreted views

As explained in ScalarDB Analytics Design, ScalarDB Analytics provides a functionality called WAL-interpreted views, which is a special type of views. These views are automatically created for tables of ScalarDB data sources to provide a user-friendly view of the data by interpreting WAL-metadata in the tables.

Since the data source name and the namespace names of the original ScalarDB tables are used as the view namespace names for WAL-interpreted views, if you have a ScalarDB table named my_table in a namespace named my_namespace of a data source named my_data_source, you can refer to the WAL-interpreted view of the table as my_catalog.view.my_data_source.my_namespace.my_table.

Data-type mapping

ScalarDB Analytics maps data types in its catalog to Spark data types. The following table shows how the data types are mapped:

ScalarDB Data TypeSpark Data Type
BYTEByte
SMALLINTShort
INTInteger
BIGINTLong
FLOATFloat
DOUBLEDouble
DECIMALDecimal
TEXTString
BLOBBinary
BOOLEANBoolean
DATEDate
TIMETimestampNTZ
DATETIMETimestampNTZ
TIMESTAMPTimestamp
DURATIONCalendarInterval
INTERVALCalendarInterval

Version compatibility

Since Spark and Scala may be incompatible among different minor versions, ScalarDB Analytics offers different artifacts for various Spark and Scala versions, named in the format scalardb-analytics-spark-all-<SPARK_VERSION>_<SCALA_VERSION>. Make sure that you select the artifact matching the Spark and Scala versions you're using. For example, if you're using Spark 3.5 with Scala 2.13, you must specify scalardb-analytics-spark-all-3.5_2.13.

Regarding the Java version, ScalarDB Analytics supports Java 8 or later.

The following is a list of Spark and Scalar versions supported by each version of ScalarDB Analytics.

ScalarDB Analytics VersionScalarDB VersionSpark Versions SupportedScala Versions SupportedMinimum Java Version
3.153.153.5, 3.42.13, 2.128
3.143.143.5, 3.42.13, 2.128
3.123.123.5, 3.42.13, 2.128