メインコンテンツまでスキップ
バージョン: 3.14

ScalarDB Analytics Design

ScalarDB Analytics is the analytical component of ScalarDB. Similar to ScalarDB, it unifies diverse data sources—ranging from RDBMSs like PostgreSQL and MySQL to NoSQL databases like Cassandra and DynamoDB—into a single logical database. This enables you to perform analytical queries across multiple databases seamlessly.

ScalarDB Analytics consists of two main components: a universal data catalog and a query engine:

  • Universal data catalog. The universal data catalog is a flexible metadata management system that handles multiple catalog spaces. Each catalog space provides an independent logical grouping of data sources and views, enabling organized management of diverse data environments.
  • Query engine. The query engine executes queries against the universal data catalog. ScalarDB Analytics provides appropriate data connectors to interface with the underlying data sources.

ScalarDB Analytics employs a decoupled architecture where the data catalog and query engine are separate components. This design allows for integration with various existing query engines through an extensible architecture. As a result, you can select different query engines to execute queries against the same data catalog based on your specific requirements.

Universal data catalog

The universal data catalog is composed of several levels and is structured as follows:

The following are definitions for those levels:

  • Catalog is a folder that contains all your data source information. For example, you might have one catalog called analytics_catalog for your analytics data and another called operational_catalog for your day-to-day operations.
  • Data source represents each data source you connect to. For each data source, we store important information like:
    • What kind of data source it is (PostgreSQL, Cassandra, etc.)
    • How to connect to it (connection details and passwords)
    • Special features the data source supports (like transactions)
  • Namespace is like a subfolder within your data source that groups related tables together. In PostgreSQL these are called schemas, in Cassandra they're called keyspaces. You can have multiple levels of namespaces, similar to having folders within folders.
  • Table is where your actual data lives. For each table, we keep track of:
    • What columns it has
    • What type of data each column can store
    • Whether columns can be empty (null)
  • View namespace is a special folder for views. Unlike regular namespaces that are tied to one data source, view namespaces can work with multiple data sources at once.
  • View is like a virtual table that can:
    • Show your data in a simpler way (like hiding technical columns in ScalarDB tables)
    • Combine data from different sources using SQL queries
    • Each view, like tables, has its own columns with specific types and rules about empty values.

Supported data types

ScalarDB Analytics supports a wide range of data types across different data sources. The universal data catalog maps these data types to a common set of types to ensure compatibility and consistency across sources. The following list shows the supported data types in ScalarDB Analytics:

  • BYTE
  • SMALLINT
  • INT
  • BIGINT
  • FLOAT
  • DOUBLE
  • DECIMAL
  • TEXT
  • BLOB
  • BOOLEAN
  • DATE
  • TIME
  • DATETIME
  • TIMESTAMP
  • DURATION
  • INTERVAL

Catalog information mappings by data source

When registering a data source to ScalarDB Analytics, the catalog information of the data source, that is, namespaces, tables, and columns, are resolved and registered to the universal data catalog. To resolve the catalog information of the data source, a particular object on the data sources side are mapped to the universal data catalog object. This mapping is consists of two parts: catalog-level mappings and data-type mappings. In the following sections, we describe how ScalarDB Analytics maps the catalog level and data type from each data source into the universal data catalog.

Catalog-level mappings

The catalog-level mappings are the mappings of the namespace names, table names, and column names from the data sources to the universal data catalog. To see the catalog-level mappings in each data source, select a data source.

The catalog information of ScalarDB is automatically resolved by ScalarDB Analytics. The catalog-level objects are mapped as follows:

  • The ScalarDB namespace is mapped to the namespace. Therefore, the namespace of the ScalarDB data source is always single level, consisting of only the namespace name.
  • The ScalarDB table is mapped to the table.
  • The ScalarDB column is mapped to the column.

Data-type mappings

The native data types of the underlying data sources are mapped to the data types in ScalarDB Analytics. To see the data-type mappings in each data source, select a data source.

ScalarDB Data TypeScalarDB Analytics Data Type
BOOLEANBOOLEAN
INTINT
BIGINTBIGINT
FLOATFLOAT
DOUBLEDOUBLE
TEXTTEXT
BLOBBLOB
DATEDATE
TIMETIME
DATETIMEDATETIME
TIMESTAMPTIMESTAMP
TIMESTAMPTZTIMESTAMPTZ

Query engine

A query engine is an independent component along with the universal data catalog, which is responsible for executing queries against the data sources registered in the universal data catalog and returning the results to the user. ScalarDB Analytics does not currently provide a built-in query engine. Instead, it is designed to be integrated with existing query engines, normally provided as a plugin of the query engine.

When you run a query, the ScalarDB Analytics query engine plugin works as follows:

  1. Fetches the catalog metadata by calling the universal data catalog API, like the data source location, the table object identifier, and the table schema.
  2. Sets up the data source connectors to the data sources by using the catalog metadata.
  3. Provides the query optimization information to the query engine based on the catalog metadata.
  4. Reads the data from the data sources by using the data source connectors.

ScalarDB Analytics manages these processes internally. You can simply run a query against the universal data catalog by using the query engine API in the same way that you would normally run a query.

ScalarDB Analytics currently supports Apache Spark as its query engine. For details on how to use ScalarDB Analytics with Spark, see Run Analytical Queries Through ScalarDB Analytics.