Run Analytical Queries Through ScalarDB Analytics
This guide explains how to develop ScalarDB Analytics applications. For details on the architecture and design, see ScalarDB Analytics Design
ScalarDB Analytics currently uses Spark as an execution engine and provides a Spark custom catalog plugin to provide a unified view of ScalarDB-managed and non-ScalarDB-managed data sources as Spark tables. This allows you to execute arbitrary Spark SQL queries seamlessly.
Preparation
This section describes the prerequisites, setting up ScalarDB Analytics in the Spark configuration, and adding the ScalarDB Analytics dependency.
Prerequisites
ScalarDB Analytics works with Apache Spark 3.4 or later. If you don't have Spark installed yet, please download the Spark distribution from Apache's website.
Apache Spark are built with either Scala 2.12 or Scala 2.13. ScalarDB Analytics supports both versions. You need to be sure which version you are using so that you can select the correct version of ScalarDB Analytics later. You can refer to Version Compatibility for more details.
Set up ScalarDB Analytics in the Spark configuration
The following sections describe all available configuration options for ScalarDB Analytics. These configurations control:
- How ScalarDB Analytics integrates with Spark
- How data sources are connected and accessed
- How license information is provided
For example configurations in a practical scenario, see the sample application configuration.
Spark plugin configurations
Configuration Key | Required | Description |
---|---|---|
spark.jars.packages | No | A comma-separated list of Maven coordinates for the required dependencies. User need to include the ScalarDB Analytics package you are using, otherwise, specify it as the command line argument when running the Spark application. For details about the Maven coordinates of ScalarDB Analytics, refer to Add ScalarDB Analytics dependency. |
spark.sql.extensions | Yes | Must be set to com.scalar.db.analytics.spark.Extensions |
spark.sql.catalog.<CATALOG_NAME> | Yes | Must be set to com.scalar.db.analytics.spark.ScalarCatalog |
You can specify any name for <CATALOG_NAME>
. Be sure to use the same catalog name throughout your configuration.
License configurations
Configuration Key | Required | Description |
---|---|---|
spark.sql.catalog.<CATALOG_NAME>.license.key | Yes | JSON string of the license key for ScalarDB Analytics |
spark.sql.catalog.<CATALOG_NAME>.license.cert_pem | Yes | A string of PEM-encoded certificate of ScalarDB Analytics license. Either cert_pem or cert_path must be set. |
spark.sql.catalog.<CATALOG_NAME>.license.cert_path | Yes | A path to the PEM-encoded certificate of ScalarDB Analytics license. Either cert_pem or cert_path must be set. |
Data source configurations
ScalarDB Analytics supports multiple types of data sources. Each type requires specific configuration parameters:
- ScalarDB
- MySQL
- PostgreSQL
- Oracle
- SQL Server
ScalarDB Analytics supports ScalarDB as a data source. This table describes how to configure ScalarDB as a data source.
Configuration Key | Required | Description |
---|---|---|
spark.sql.catalog.<CATALOG_NAME>.data_source.<DATA_SOURCE_NAME>.type | Yes | Always set to scalardb |
spark.sql.catalog.<CATALOG_NAME>.data_source.<DATA_SOURCE_NAME>.config_path | Yes | The path to the configuration file for ScalarDB |
You can use an arbitrary name for <DATA_SOURCE_NAME>
.
Configuration Key | Required | Description |
---|---|---|
spark.sql.catalog.<CATALOG_NAME>.data_source.<DATA_SOURCE_NAME>.type | Yes | Always set to mysql |
spark.sql.catalog.<CATALOG_NAME>.data_source.<DATA_SOURCE_NAME>.host | Yes | The host name of the MySQL server |
spark.sql.catalog.<CATALOG_NAME>.data_source.<DATA_SOURCE_NAME>.port | Yes | The port number of the MySQL server |
spark.sql.catalog.<CATALOG_NAME>.data_source.<DATA_SOURCE_NAME>.username | Yes | The username of the MySQL server |
spark.sql.catalog.<CATALOG_NAME>.data_source.<DATA_SOURCE_NAME>.password | Yes | The password of the MySQL server |
spark.sql.catalog.<CATALOG_NAME>.data_source.<DATA_SOURCE_NAME>.database | No | The name of the database to connect to |
You can use an arbitrary name for <DATA_SOURCE_NAME>
.
Configuration Key | Required | Description |
---|---|---|
spark.sql.catalog.<CATALOG_NAME>.data_source.<DATA_SOURCE_NAME>.type | Yes | Always set to postgresql or postgres |
spark.sql.catalog.<CATALOG_NAME>.data_source.<DATA_SOURCE_NAME>.host | Yes | The host name of the PostgreSQL server |
spark.sql.catalog.<CATALOG_NAME>.data_source.<DATA_SOURCE_NAME>.port | Yes | The port number of the PostgreSQL server |
spark.sql.catalog.<CATALOG_NAME>.data_source.<DATA_SOURCE_NAME>.username | Yes | The username of the PostgreSQL server |
spark.sql.catalog.<CATALOG_NAME>.data_source.<DATA_SOURCE_NAME>.password | Yes | The password of the PostgreSQL server |
spark.sql.catalog.<CATALOG_NAME>.data_source.<DATA_SOURCE_NAME>.database | Yes | The name of the database to connect to |
You can use an arbitrary name for <DATA_SOURCE_NAME>
.
Configuration Key | Required | Description |
---|---|---|
spark.sql.catalog.<CATALOG_NAME>.data_source.<DATA_SOURCE_NAME>.type | Yes | Always set to oracle |
spark.sql.catalog.<CATALOG_NAME>.data_source.<DATA_SOURCE_NAME>.host | Yes | The host name of the Oracle server |
spark.sql.catalog.<CATALOG_NAME>.data_source.<DATA_SOURCE_NAME>.port | Yes | The port number of the Oracle server |
spark.sql.catalog.<CATALOG_NAME>.data_source.<DATA_SOURCE_NAME>.username | Yes | The username of the Oracle server |
spark.sql.catalog.<CATALOG_NAME>.data_source.<DATA_SOURCE_NAME>.password | Yes | The password of the Oracle server |
spark.sql.catalog.<CATALOG_NAME>.data_source.<DATA_SOURCE_NAME>.service_name | Yes | The service name of the Oracle server |
You can use an arbitrary name for <DATA_SOURCE_NAME>
.
Configuration Key | Required | Description |
---|---|---|
spark.sql.catalog.<CATALOG_NAME>.data_source.<DATA_SOURCE_NAME>.type | Yes | Always set to sqlserver or mssql |
spark.sql.catalog.<CATALOG_NAME>.data_source.<DATA_SOURCE_NAME>.host | Yes | The host name of the SQL Server server |
spark.sql.catalog.<CATALOG_NAME>.data_source.<DATA_SOURCE_NAME>.port | Yes | The port number of the SQL Server server |
spark.sql.catalog.<CATALOG_NAME>.data_source.<DATA_SOURCE_NAME>.username | Yes | The username of the SQL Server server |
spark.sql.catalog.<CATALOG_NAME>.data_source.<DATA_SOURCE_NAME>.password | Yes | The password of the SQL Server server |
spark.sql.catalog.<CATALOG_NAME>.data_source.<DATA_SOURCE_NAME>.database | No | The name of the database to connect to |
spark.sql.catalog.<CATALOG_NAME>.data_source.<DATA_SOURCE_NAME>.secure | No | Whether to use a secure connection to the SQL Server server. Set to true to use a secure connection. |
You can use an arbitrary name for <DATA_SOURCE_NAME>
.
Example configuration
Below is an example configuration for ScalarDB Analytics that demonstrates how to set up a catalog named scalardb
with multiple data sources:
# Spark plugin configurations
spark.jars.packages com.scalar-labs:scalardb-analytics-spark-all-<SPARK_VERSION>_<SCALA_VERSION>:<SCALARDB_ANALYTICS_VERSION>
spark.sql.extensions com.scalar.db.analytics.spark.Extensions
spark.sql.catalog.scalardb com.scalar.db.analytics.spark.ScalarCatalog
# License configurations
spark.sql.catalog.scalardb.license.key <LICENSE_KEY>
spark.sql.catalog.scalardb.license.cert_pem <LICENSE_PEM_ENCODED_CERTIFICATE>
# Data source configurations
spark.sql.catalog.scalardb.data_source.scalardb.type scalardb
spark.sql.catalog.scalardb.data_source.scalardb.config_path /path/to/scalardb.properties
spark.sql.catalog.scalardb.data_source.mysql_source.type mysql
spark.sql.catalog.scalardb.data_source.mysql_source.host localhost
spark.sql.catalog.scalardb.data_source.mysql_source.port 3306
spark.sql.catalog.scalardb.data_source.mysql_source.username root
spark.sql.catalog.scalardb.data_source.mysql_source.password password
spark.sql.catalog.scalardb.data_source.mysql_source.database mydb