Version: 3.18

Consensus Commit Protocol

Consensus Commit is the transaction protocol used in ScalarDB and is designed for executing transactions spanning multiple diverse databases. Its uniqueness is that the protocol achieves ACID transactions without relying on the transaction capabilities of the underlying databases, unlike X/Open XA-based solutions. This document explains the details of the protocol, including how it works, the guaranteed isolation levels, the interfaces, the performance optimization that it employs, and its limitations.

The protocol

This section explains how the Consensus Commit protocol works. The Consensus Commit protocol uses a concurrency control protocol to guarantee isolation and an atomic commitment protocol to guarantee atomicity and durability.

Concurrency control protocol

The Consensus Commit protocol employs optimistic concurrency control (OCC) as its concurrency control protocol. OCC operates under the assumption that conflicts are rare, allowing transactions to proceed without the need for locks and resolving conflicts only when they actually occur. Therefore, OCC performs great in low-contention environments. It is also particularly beneficial in distributed environments, where managing locks is tricky.

note

Pessimistic concurrency control (PCC), on the other hand, assumes conflicts are common and takes locks on resources when they are used to avoid interference. Therefore, PCC performs great in high-contention environments.

The OCC protocol of ScalarDB has three phases, as the commonly used OCC protocols, each of which does the following:

Read phase:
- ScalarDB tracks the read and write sets of transactions. ScalarDB copies every record that a transaction accesses from databases to its local workspace and stores its writes in the local workspace.
Validation phase:
- ScalarDB checks if the committing transaction conflicts with other transactions. ScalarDB uses backward validation; it goes to the write phase only if other transactions have not written what the transaction reads and writes, which are called read validation and write validation, respectively.
Write phase:
- ScalarDB propagates the changes in the transaction's write set to the database and makes them visible to other transactions.

As described next, ScalarDB provides an isolation mode (isolation level) where it skips the read validation in the validation phase to allow for more performance for some workloads that don't require the read validation for correctness.

note

The OCC of ScalarDB without the read validation works similarly to snapshot isolation. However, it works with a single version and causes read-skew anomalies because it does not create global snapshots.

Atomic commitment protocol

The Consensus Commit protocol employs a variant of the two-phase commit protocol as an atomic commitment protocol (ACP). The ACP of ScalarDB comprises two phases, each of which has two sub-phases, and briefly works as follows:

Prepare phase (prepare-records phase + validate-records phase):
- In the prepare-records phase, ScalarDB runs the write validation of the OCC protocol for all the records written by the transaction by updating the statuses of the records to PREPARED and moves on to the next phase if all the records are successfully validated.
- In the validate-records phase, ScalarDB runs the read validation of the OCC protocol for all the records read by the transaction and moves on to the next phase if all the records are successfully validated.
Commit phase (commit-state phase + commit-records phase):
- In the commit-state phase, ScalarDB commits the transaction by writing a COMMITTED state to a special table called a coordinator table.
- In the commit-records phase, ScalarDB runs the write phase of the OCC protocol for all the records written by the transaction by updating the statuses of the records to COMMITTED.

note

In case of deleting records, the statuses of the records are first changed to DELETED in the prepare phase and later physically deleted in the commit phase.

How it works in more detail

Let's see how the protocol works in each phase in more detail.

Before the prepare phase

First, a transaction begins when a client accesses ScalarDB (or a ScalarDB Cluster node) and issues a begin command. When a transaction begins, ScalarDB acts as a transaction coordinator, accessing the underlying databases, and first generates a transaction ID (TxID) with UUID version 4. Then, when the client is ready to commit the transaction after performing operations such as reading and writing records, it calls a commit command to request ScalarDB to commit the transaction and enters the prepare phase. As described previously, ScalarDB holds the read set (readSet) and write set (writeSet) of the transaction in its local workspace at the time of committing.

Prepare phase

ScalarDB first prepares the records of the write set by propagating the records, including transaction logs like TxID as described later, with PREPARED states to the underlying databases as the prepare-records phase. Here, we assume a write set maintains updated records composed of the original records and updated columns. If any preparation fails, it aborts the transaction by writing an ABORTED state record to a Coordinator table, where all the transactions’ final states are determined and managed. We explain the Coordinator table in more detail later in this section.

note

ScalarDB checks conflicting preparations by using linearizable conditional writes. A transaction updates a record if the record has not been updated by another transaction since the transaction read it by checking if the TxID of the record has not been changed.

ScalarDB then moves on to the validate-records phase as necessary. The validate-records phase is only necessary if the isolation level is set to SERIALIZABLE. In this phase, ScalarDB re-reads all the records in the read set to see if other transactions have written the records that the transaction has read before. If the read set has not been changed, the transaction can go to the commit-state phase since there are no anti-dependencies; otherwise, it aborts the transaction.

Commit phase

If all the validations in the prepare phase are done successfully, ScalarDB commits the transaction by writing a COMMITTED state record to the Coordinator table as the commit-state phase.

note

ScalarDB uses linearizable conditional writes to coordinate concurrent writes to the Coordinator table, creating a state record with a TxID if there is no record for the TxID. Once the COMMITTED state is correctly written to the Coordinator table, the transaction is regarded as committed.
By default, if a transaction contains only read operations, ScalarDB skips the commit-state phase. However, you can configure ScalarDB to write a COMMITTED state record to the Coordinator table even for read-only transactions by setting the following parameter to false:
- scalar.db.consensus_commit.coordinator.write_omission_on_read_only.enabled

Then, ScalarDB commits all the validated (prepared) records by changing the states of the records to COMMITTED as the commit-records phase.

Distributed WAL

ScalarDB stores transaction logs (transaction metadata), which are for write-ahead logging (WAL), in the underlying database records that it manages. Specifically, as shown in the following figure, ScalarDB manages special columns for the log information in a record in addition to the columns that an application manages. The log information comprises, for example, a transaction ID (TxID) that has updated the corresponding record most recently, a record version number (Version), a record state (TxState) (for example, COMMITTED or PREPARED), timestamps (not shown in the diagram), and a before image that comprises the previous version's application data and its metadata.

ScalarDB also manages transaction states separately from the application records in the Coordinator table. The Coordinator table determines and manages transaction states as a single source of truth. The Coordinator table can be collocated with application-managed tables or located in a separate dedicated database.

Distributed WAL

note

The Coordinator table can be replicated for high availability by using the replication and consensus capabilities of the underlying databases. For example, if you manage the Coordinator table by using Cassandra with a replication factor of three, you can make the transaction coordination of ScalarDB tolerate one replica crash. Hence, you can make the atomic commitment protocol of ScalarDB perform like the Paxos Commit protocol; it could mitigate liveness issues (for example, blocking problems) without sacrificing safety.

Transaction metadata decoupling

ScalarDB manages transaction metadata in the same table as application data by default. However, ScalarDB also provides an option to enable managing transaction metadata separately from the application data, which we call transaction metadata decoupling. This capability is particularly useful when importing existing database tables into ScalarDB, as it allows users to maintain the original schema without any modifications.

note

Transaction metadata decoupling is currently in Private Preview. For details, please contact us or wait for this feature to become publicly available in a future version.

Currently, transaction metadata decoupling is supported only for JDBC databases.

If transaction metadata decoupling is enabled, ScalarDB internally creates a virtual table that combines the application data table with the transaction metadata table. The Consensus Commit protocol operates on this virtual table, functioning in the same manner as it does when transaction metadata decoupling is not enabled. However, ScalarDB requires twice as many accesses to the underlying databases when reading and writing records, which could result in performance overhead. ScalarDB minimizes the performance overhead by grouping and pushing down as many operations as possible when accessing the underlying databases.

Lazy recovery

Transactions can crash at any time and could leave records in an uncommitted state. ScalarDB recovers uncommitted records lazily when it reads them, depending on the transaction states of the Coordinator table. Specifically, if a record is in the PREPARED state, but the transaction that updated the record has expired or been aborted, the record will be rolled back. If a record is in the PREPARED state and the transaction that updated the record is committed, the record will be rolled forward.

A transaction expires after a certain amount of time (currently 15 seconds). When ScalarDB observes a record that has been prepared by an expired transaction, ScalarDB writes the ABORTED state for the transaction to the Coordinator table (with retries). If ScalarDB successfully writes the ABORTED state to the Coordinator table, the transaction is aborted. Otherwise, the transaction will be committed by the original process that is slow but still alive for some reason, or it will remain in the UNKNOWN state until it is either aborted or committed.

Correctness of index-based reads

To ensure the correctness of index-based Get, Scan, and ScanAll operations, ScalarDB performs an additional check called a before-image index check when reading records through a secondary index. For every user-defined secondary index on a non-primary-key column, ScalarDB stores the before-image in an internal column named before_<column> and maintains a companion secondary index on that before-image column in the underlying storage. The check uses this companion index to find records whose committed indexed-column value matches the query but whose current indexed value has been changed by another transaction that is not yet committed (that is, records in the PREPARED or DELETED state). Any such record is then recovered (rolled back or rolled forward) before results are returned; therefore, ScalarDB throws an exception if the record hasn't expired, otherwise it recovers the transaction.

Under the SERIALIZABLE isolation level, ScalarDB also uses the same before-image index during the read validation phase so that concurrent updates on indexed columns are correctly detected as anti-dependencies.

As a result, index-based Get, Scan, and ScanAll operations return every record whose committed indexed-column value matches the query.

warning

Tables created by a ScalarDB version that predates this check do not have the companion before-image indexes required for it. The check is available starting from ScalarDB 3.16.5, 3.17.3, and 3.18.0. After upgrading from an earlier version, you must run repairTable() on each existing table to create the missing companion indexes. Until repairTable() is run, ScalarDB will log a warning at startup, and index-based reads may miss records whose indexed column is being concurrently updated.
Setting scalar.db.consensus_commit.index.eventually_consistent_read.enabled to true skips the before-image index check. As a result, index-based reads may miss records whose indexed column is being concurrently updated. This option exists for backward compatibility and is not recommended for new workloads.

Isolation levels

The Consensus Commit protocol supports three isolation levels: read-committed snapshot isolation (a weaker variant of snapshot isolation), serializable, and read-committed, each of which has the following characteristics:

Read-committed snapshot isolation (SNAPSHOT - default)
- Possible anomalies: read skew, write skew, read only
- Faster than serializable, but guarantees weaker correctness.
Serializable (SERIALIZABLE)
- Possible anomalies: None
- Slower than read-committed snapshot isolation, but guarantees stronger (strongest) correctness.
Read-committed (READ_COMMITTED)
- Possible anomalies: read skew, write skew, read only
- Faster than read-committed snapshot isolation because it could return non-latest committed records.

As described above, serializable is preferable from a correctness perspective, but slower than read-committed snapshot isolation. Choose the appropriate one based on your application requirements and workload. For details on how to configure read-committed snapshot isolation, serializable, and read-committed, see ScalarDB Configuration. You can also override the configured default isolation level on a per-transaction basis by using the cc-transaction-isolation attribute when beginning or starting a transaction. For details, see Consensus Commit attributes.

note

The Consensus Commit protocol of ScalarDB requires each underlying database to provide linearizable operations, as described in Configurations for the Underlying Databases of ScalarDB; thus, it guarantees strict serializability.

warning

Scanning records without specifying a partition key (for example, ScanAll or SELECT * FROM table) for non-JDBC databases does not always guarantee serializability, even if SERIALIZABLE is specified. Therefore, you should do so at your own discretion and consider updating the schemas if possible. For more details, refer to Cross-partition scan configurations.

Interfaces

The Consensus Commit protocol provides two interfaces: a one-phase commit interface and a two-phase commit interface.

The one-phase commit interface is a simple interface that provides only a single commit method, where all the phases of the atomic commitment protocol are executed in the method. On the other hand, the two-phase commit interface exposes each phase of the protocol with prepare, validate, and commit methods.

note

The prepare method is for the prepare-records phase, and the validate method is for the validate-records phase.

In most cases, using the one-phase commit interface is recommended since it is easier to use and handle errors. But the two-phase commit interface is useful when running a transaction across multiple applications or services without directly accessing databases from ScalarDB, such as maintaining the consistency of databases in microservices.

Performance optimization

The Consensus Commit protocol employs several performance optimizations.

Parallel execution

Consensus Commit executes each phase of the atomic commitment protocol in parallel, using intra-transaction parallelism without sacrificing correctness. Specifically, it tries to execute the prepare-records phase by writing records with PREPARED status in parallel. Likewise, it uses a similar parallel execution for the validate-records phase, the commit-records phase, and the rollback process.

You can enable respective parallel execution by using the following parameters:

Prepare-records phase
- scalar.db.consensus_commit.parallel_preparation.enabled
Validate-records phase
- scalar.db.consensus_commit.parallel_validation.enabled
Commit-records phase
- scalar.db.consensus_commit.parallel_commit.enabled
Rollback processing
- scalar.db.consensus_commit.parallel_rollback.enabled

You can also configure the execution parallelism by using the following parameter:

scalar.db.consensus_commit.parallel_executor_count

For details about the configuration, refer to Performance-related configurations.

Asynchronous execution

Since a transaction is regarded as committed if the commit-state phase completes successfully, it can also return to the client without waiting for the completion of the commit-records phase, executing the phase asynchronously. Likewise, when a transaction fails for some reason and does a rollback, the rollback process can also be executed asynchronously without waiting for its completion.

You can enable respective asynchronous execution by using the following parameters:

Commit-records phase
- scalar.db.consensus_commit.async_commit.enabled
Rollback processing
- scalar.db.consensus_commit.async_rollback.enabled

One-phase commit

With one-phase commit optimization, ScalarDB can omit the prepare-records and commit-state phases without sacrificing correctness, provided that the transaction only updates records that the underlying database can atomically update.

You can enable one-phase commit optimization by using the following parameter:

scalar.db.consensus_commit.one_phase_commit.enabled

Group commit

Consensus Commit provides a group-commit feature to execute the commit-state phase of multiple transactions in a batch, reducing the number of writes for the commit-state phase. It is especially useful when writing to a Coordinator table is slow, for example, when the Coordinator table is deployed in a multi-region environment for high availability.

You can enable group commit by using the following parameter:

scalar.db.consensus_commit.coordinator.group_commit.enabled

Group commit has several other parameters. For more details, refer to Performance-related configurations.

Limitations

ScalarDB has several limitations in achieving database-agnostic transactions.

Applications must access ScalarDB to access the underlying databases

Since ScalarDB with the Consensus Commit protocol handles transactions in its layer without depending on the transactional capability of the underlying databases, your applications cannot bypass ScalarDB. Bypassing it will cause unexpected behavior, mostly resulting in facing some database anomalies. Even for read operations, accessing the underlying databases of ScalarDB directly will give you inconsistent data with the transaction metadata, so it is not allowed.

However, for tables that are not managed or touched by ScalarDB transactions, you can read from and write to the tables. For example, it is OK to check tables' metadata information, such as information schema, by directly accessing the tables without going through ScalarDB. Also, there are several other cases where you can access databases directly without going through ScalarDB. The basic criterion is whether or not you update the data of the underlying databases. If you are sure that you do not write to the databases, you can access the databases directly. For example, it is OK to take a backup of databases by using database-native tools.

note

If you take backups from multiple databases or from non-transactional databases, you need to pause your applications or ScalarDB Cluster. For more details, refer to How to Back Up and Restore Databases Used Through ScalarDB.

Executing particular operations in a certain sequence is prohibited for correctness

In the current implementation, ScalarDB throws an exception when executing scan operations after write (Put, Insert, Update, Upsert, Delete) operations for the same record in a transaction.

The protocol​

Concurrency control protocol​

Atomic commitment protocol​

How it works in more detail​

Before the prepare phase​

Prepare phase​

Commit phase​

Distributed WAL​

Transaction metadata decoupling​

Lazy recovery​

Correctness of index-based reads​

Isolation levels​

Interfaces​

Performance optimization​

Parallel execution​

Asynchronous execution​

One-phase commit​

Group commit​

Limitations​

Applications must access ScalarDB to access the underlying databases​

Executing particular operations in a certain sequence is prohibited for correctness​

See also​

The protocol

Concurrency control protocol

Atomic commitment protocol

How it works in more detail

Before the prepare phase

Prepare phase

Commit phase

Distributed WAL

Transaction metadata decoupling

Lazy recovery

Correctness of index-based reads

Isolation levels

Interfaces

Performance optimization

Parallel execution

Asynchronous execution

One-phase commit

Group commit

Limitations

Applications must access ScalarDB to access the underlying databases

Executing particular operations in a certain sequence is prohibited for correctness

See also