Getting Started with ScalarDB Analytics
This tutorial describes how to run analytical queries on sample data by using ScalarDB Analytics. The source code is available at https://github.com/scalar-labs/scalardb-samples/tree/main/scalardb-analytics-spark-sample.
ScalarDB Analytics in its current version leverages Apache Spark as its execution engine. It provides a unified view of ScalarDB-managed and non-ScalarDB-managed data sources by using a Spark custom catalog. By using ScalarDB Analytics, you can treat tables from these data sources as native Spark tables. This allows you to execute arbitrary Spark SQL queries seamlessly. For example, you can join a table stored in Cassandra with a table in PostgreSQL to perform cross-database analysis with ease.
Overview of the sample applicationβ
This sample tutorial demonstrates how to configure Spark to enable ScalarDB Analytics and perform interactive analyses using spark-sql on tables provided by ScalarDB Analytics.
Prerequisites for this sample applicationβ
- Docker 20.10 or later with Docker Compose V2 or later
You need to have a license key (trial license or commercial license) to use ScalarDB Analytics. If you don't have a license key, please contact us.
Step 1: Set up ScalarDB Analyticsβ
Clone the ScalarDB samples repositoryβ
Open Terminal, and clone the ScalarDB samples repository by running the following command:
git clone https://github.com/scalar-labs/scalardb-samples
Then, go to the directory that contains the sample application by running the following command:
cd scalardb-samples/scalardb-analytics-spark-sample
Set your license informationβ
ScalarDB Analytics requires valid license information to be specified in the Spark configuration. You can provide your license details in the spark-defaults.conf file.
Open the spark-defaults.conf file located in the conf directory of your Spark installation. Then, replace <REPLACE_THIS_WITH_YOUR_LICENSE> with your license key and <REPLACE_THIS_WITH_YOUR_LICENSE_CERTIFICATE_PEM_CONTENTS> with the PEM-encoded contents of your license certificate.
spark.sql.catalog.test_catalog.license.key <REPLACE_THIS_WITH_YOUR_LICENSE>
spark.sql.catalog.test_catalog.license.cert_pem <REPLACE_THIS_WITH_YOUR_LICENSE_CERTIFICATE_PEM_CONTENTS>
For additional configuration details required in the spark-defaults.conf file for setting up ScalarDB Analytics, refer to ScalarDB Analytics configuration.
Step 2: Set up the sample databasesβ
To set up the sample databases, run the following command:
docker compose up -d --wait
This command starts three services locally: PostgreSQL, Cassandra, and MySQL.
- PostgreSQL: Used independently (non-ScalarDB-managed).
- Cassandra and MySQL: Used as backend databases for ScalarDB (ScalarDB-managed).
In this guide, PostgreSQL is referred to as a non-ScalarDB-managed database, which is not managed by ScalarDB transactions, while Cassandra and DynamoDB are referred to as ScalarDB-managed databases, which are managed by ScalarDB transactions.
For non-ScalarDB-managed databases, sample data is automatically loaded when the Docker container is initialized, so no additional steps are required. For ScalarDB-managed databases, run the following command to load the sample data after starting the containers:
docker compose run --rm sample-data-loader
After completing the setup, the following tables should be available:
- In PostgreSQL:
sample_ns.customer
- In ScalarDB (backed by Cassandra):
cassandrans.lineitem
- In ScalarDB (backed by MySQL):
mysqlns.order
According to the above, within ScalarDB, cassandrans and mysqlns are mapped to Cassandra and MySQL, respectively.
For details about the table schema, including column definitions and data types, refer to Schema details. Ensure that the sample data has been successfully loaded into these tables.
Step 3: Launch the Spark SQL consoleβ
To launch the Spark SQL console, run the following command:
docker compose run --rm spark-sql
While launching the Spark SQL console, the ScalarDB Analytics catalog is initialized with the configuration in spark-defaults.conf and is registered as a Spark catalog named test_catalog.
Namespace mappingβ
The following tables in the configured data sources are mapped to Spark SQL tables, allowing seamless querying across different data sources:
- For PostgreSQL:
test_catalog.postgresql.sample_ns.customer
- For ScalarDB (backed by Cassandra):
test_catalog.scalardb.cassandrans.lineitem
- For ScalarDB (backed by MySQL):
test_catalog.scalardb.mysqlns.orders
For more details about how tables are mapped to Spark SQL tables, refer to Namespace-mapping details.
Additionally, ScalarDB Analytics offers WAL-interpreted views for ScalarDB tables, simplifying common use cases. In this sample application, you have the following WAL-interpreted views available:
- For ScalarDB (backed by Cassandra):
test_catalog.view.scalardb.cassandrans.lineitem
- For ScalarDB (backed by MySQL):
test_catalog.view.scalardb.mysqlns.orders
In most cases, WAL-interpreted views are preferred over raw tables. In this tutorial, we will use the WAL-interpreted views for the ScalarDB tables. For detailed information on WAL-interpreted views, including their use cases and benefits, see WAL-interpreted views for ScalarDB tables.
Step 4: Run analytical queriesβ
Now, everything is set up, and you can run analytical queries on the sample data using the Spark SQL console.
Read data and calculate summariesβ
You can run the following query to retrieve data from test_catalog.scalardb.cassandrans.lineitem in Cassandra and calculate aggregated metrics, including total quantity, average price, and total revenue for line items grouped by their return flag and line status.
SELECT
l_returnflag,
l_linestatus,
sum(l_quantity) AS sum_qty,
sum(l_extendedprice) AS sum_base_price,
sum(l_extendedprice * (1 - l_discount)) AS sum_disc_price,
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,
avg(l_quantity) AS avg_qty,
avg(l_extendedprice) AS avg_price,
avg(l_discount) AS avg_disc,
count(*) AS count_order
FROM
test_catalog.view.scalardb.cassandrans.lineitem
WHERE
to_date(l_shipdate, 'yyyy-MM-dd') <= date '1998-12-01' - 3
GROUP BY
l_returnflag,
l_linestatus
ORDER BY
l_returnflag,
l_linestatus;
You should see the following output:
A F 1519 2374824.6560278563 1387364.2207725341 1962763.4654265852 26.649122807017545 41663.590456629056 0.41501802923479575 57
N F 98 146371.2295412012 85593.96776336085 121041.55837332775 32.666666666666664 48790.409847067065 0.40984706454007996 3
N O 5374 8007373.247086477 4685647.785126835 6624210.945739046 24.427272727272726 36397.15112312035 0.4147594809559689 220
R F 1461 2190869.9676265526 1284178.4378283697 1814151.2807494882 25.189655172413794 37773.62013149229 0.41323493790730753 58