Version: 3.15

Getting started with Import

This document explains how you can get started with the ScalarDB Data Loader Import function.

Features

Import data from JSON or JSON Lines files
Automatic data mapping based on source field name mapping
Custom Data mapping via a JSON control file
Import data from one record or line into multiple tables
Support for INSERT, UPDATE and UPSERT

Usage

The Data Loader import function can be started with the following minimal configuration:

./scalardb-data-loader import --config scalardb.properties --namespace namespace --table tableName

The above configuration starts an import process where no control file is used and the data mapping is applied automatically.

Execute the following steps to successfully import new or existing data

Prepare a source file containing data that needs to be imported.
Choose the right import mode. By default, the import is done in upsert mode which means that data will be inserted if new or updated if the partition key and/or clustering key is found. Other options are insert mode or update mode.
Find the correct namespace and table name to import data to.
Determine if you want to run an all required columns check for each data row. If enabled, data rows with missing columns will be treated as failed and not imported.
Specify the path names for the success and failed output files. By default, Data Loader creates the files in the working directory.
When dealing with JSON data, determine if you want the JSON output for the success or failed log files to be in pretty print or not. By default, this option is disabled for performance
Optionally specify the threads argument to tweak performance
Run the import from the command line to start importing your data. Make sure to run the ScalarDB Data Loader in the correct storage or transaction mode depending on your running ScalarDB instance.

Command-line flags

Here is a list of flags (options) that can be used with Data Loader:

Flag	Description	Usage
--mode	The mode in which ScalarDB is running. If omitted, the default value is `storage`	`scalardb-data-loader --mode transaction`
--config	the path to the scalardb.properties file. If omitted the tool looks for a file named `scalardb.properties` in the current folder	`scalardb-data-loader --config scalardb.properties`
--namespace	Namespace to export table data from. Required when no control file is provided.	`scalardb-data-loader --namespace namespace`
--table	name of the table to export data from. Required when no control file is provided.	`scalardb-data-loader --table tableName`
--import-mode	Mode to import the data into the ScalarDB table. Supported modes are `insert`, `update` and `upsert`. Optional. Default the value is set to `upsert`	`scalardb-data-loader --import-mode=upsert`
--all-columns-required	If set, data rows cannot be imported if they are missing columns. Optional. By default, the check is not executed.	`scalardb-data-loader --all-columns-required`
--file	Specify the path to the file that will be imported. Required	`scalardb-data-loader --file <pathToFile>`
--success	The path to the file that is created to write the succeed import results to. Both succeed and failed import results will be written to a different file. Optional. By default, the a new file will be created in the current working directory. Note: if the file already exists, it will be overridden.	`scalardb-data-loader --success <path to file>`
--failed	The path to the file that will be created to write the failed import results to. Optional. By default, the a new file will be created in the current working directory. Note: if the file already exists, it will be overridden.	`scalardb-data-loader --failed <path to file>`
--threads	Thread count for concurrent processing. The default value is the number of available processors.	`scalardb-data-loader --threads 500`
--format	The format of the import file. `json` and `jsonl` files are supported. Optional, default the value `json` is selected.	`scalardb-data-loader --format json`
--ignore-null	The null values in the source file will be ignored, which means that the existing data will not be overwritten. Optional, default the value is `false`.	`scalardb-data-loader --ignore-null`
--pretty	When set, the output to the success and failed files is done in `pretty print` mode. By default the option is not enabled.	`scalardb-data-loader --pretty`
--control-file	The path to the JSON control file specifying the rules for the custom data mapping and/or multi-table import.	`scalardb-data-loader --control-file control.json`
--control-file-validation-level	The validation level for the control file. `MAPPED`, `KEYS` or `FULL`. Optional and by default the level is set to `MAPPED`	`scalardb-data-loader --control-file-validation-level FULL`
--log-put-value	Whether the value that was used in the ScalarDB `PUT` operation is included in the log files or not. Optional and disabled by default.	`scalardb-data-loader --log-put-value`
--error-file-required	To export an optional error file of type JSON when the import file contains CSV data. By default, this option is disabled.	`scalardb-data-loader --error-file-required`
--error	To specify an optional error file when the import file contains CSV data.	`scalardb-data-loader --error <pathToFile>`
--delimiter	To specify a custom delimiter if the import file contains CSV data.	`scalardb-data-loader --delimiter <value>`
--header	To specify the header row data if the import file contains CSV data and does not have a header row.	`scalardb-data-loader --header <value>`

Import mode

Data Loader supports the following import modes:

Mode	Description
INSERT	Each source record is treated as new data. If the data already exists in the ScalarDB table, based on the partition and clustering key, the import for this source data will fail.
UPDATE	Each source record is treated as an update for existing data in the ScalarDB table. If the data does not exist in the table, based on the partition key and clustering key, the import for this source data will fail.
UPSERT	If the target ScalarDB table already contains the data, the import will be done via an UPDATE. If the target data is missing, it will be treated as an INSERT.

Note:

In the case of INSERT, it is required to have matching fields in the source files for each target column via automatic or custom mapping via the control file. This also applies to an UPSERT that turns into an INSERT.

Data mapping

Automatic mapping

When no control file is provided, Data Loader will automatically map the fields in the source JSON data to the available columns in the ScalarDB table. If the name does not match, and if all columns are required, it will be treated as a validation error. In this case, the import for this record will fail and the result will be added to the failed output log.

Custom mapping

When the source fields do not match the target column name, it is necessary to use a control file. In this control, file you can specify the custom mapping rules for the field names.

e.g. the following control file to map the field custom_id in the source file to id in the target table.

{
	"tables": [{
			"namespace": "sample",
			"table_name": "table1",
			"mappings": [{
				"source_field": "custom_id",
				"target_column": "id"
			}]
		}
	]
}

Control file

To allow for custom data mapping or multi-table importing, Data Loader supports configuration via a JSON control file. This file needs to be passed in via the --control-file argument when starting Data Loader.

Control file validation levels

To enforce validation on the control file, Data Loader allows you to specify the validation level. Based on the set level, Data Loader will run a pre-check and validate the control file based on the level rules.

The following levels are supported:

Level	Description
FULL	This validation makes sure that the control file has mappings for each column in the target ScalarDB table.
KEYS	This validation makes sure that mappings are available for each ScalarDB table partition and, if available, clustering keys columns in the control file mappings.
MAPPED	The validation makes sure that the provided source fields and target columns exist for only the mappings that are provided in the control file. No other fields are checked.

The validation level is optional and can be set via the --control-file-validation-level argument when starting Data Loader.

Note:

This validation is run as a pre-check and does not mean the import process will automatically succeed.

e.g. If the level is set to mapped and the control file does not contain mappings for each column for an INSERT, the import process will still fail as all columns are required to be mapped for an INSERT.

Multi-table import

Data Loader supports multi-table target importing.

One single row in a JSON or JSON Lines file can be imported into multiple tables by specifying table mapping rules in the control file. Currently, multi-table import is not supported without a control file.

When using multi-table import in ScalarDB transaction mode, a transaction is created for each table import. e.g. If the source record is mapped to 3 tables in the control file, 3 separate transactions are created.

e.g. The import the following source record into table1 and table2 we execute the following steps:

Table1	Table2
Id	Id
price	amount

Source record

[{
	"custom_id": 1,
	"custom_price": 1000,
	"custom_amount": 100
	
}]

Control file

{
	"tables": [{
			"namespace": "sample",
			"table_name": "table1",
			"mappings": [{
				"source_field": "custom_id",
				"target_column": "id"
			}, {
				"source_field": "custom_price",
				"target_column": "price"
			}]
		},
		{
			"namespace": "sample",
			"table_name": "table2",
			"mappings": [{
				"source_field": "custom_id",
				"target_column": "id"
			}, {
				"source_field": "custom_amount",
				"target_column": "amount"
			}]
		}
	]
}

Output logs

When starting an import task, Data Loader logs the import results in two files. One file contains the import data that is successfully imported and one file contains the data that cannot be imported. The failed data will contain an added field that explains why the data could not be imported. This field is called data_loader_import_status.

The file containing the failed imports can be edited to correct the problems and used as the source file for a new import task as is. It is not required to first remove the data_loader_import_status field containing the error. This field will be ignored during the import process and the original value will not be included in the new version of the success or failed output file.

The file with the successfully imported data also contains the data_loader_import_status field. In this file, each imported data row has a status message for the data. Whether a new row was created or existing data was updated.

Log format

Field	Description
action	The result of the import process for the data record. UPDATE, INSERT or FAILED_DURING_VALIDATION
namespace	The name of the namespace of the table that the data is imported in
table_name	The name of the table that the data is imported in
is_data_mapped	Whether custom data mapping was applied or not based on an available control file.
tx_id	The transaction ID. Only available if Data Loader is run in `transaction` mode.
value	The final value, after optional data mapping, that Data Loader uses in the `PUT` operation.
row_number	The line number or record number of the source data.
Errors	A list of validation or other errors for things that went wrong during the import process.

Example of a JSON formatted success log file:

[{
	"column_1": 1,
	"column_2": 2,
	"column_n": 3,
	"data_loader_import_status": {
		"results": [{
		  "action": "UPDATE",
			"namespace": "namespace1",
			"table_name": "table1",
			"is_data_mapped": true,
			"tx_id": "value",
			"value": "value",
			"row_number": "value"
		}]
	}
}]

Example of a JSON formatted failed log file:

[{
	"column_1": 1,
	"column_2": 2,
	"column_n": 3,
	"data_loader_import_status": {
		"results": [{
		  "action": "FAILED_DURING_VALIDATION",
			"namespace": "namespace1",
			"table_name": "table1",
			"is_data_mapped": false,
			"value": "value",
			"row_number": "value",
			"errors": [
			   "missing columns found during validation"
			]
		}]
	}
}]

Data duplicates

Data Loader does not handle duplicates by itself. In ScalarDB transaction mode, trying to update the same target data in fast succession will cause No Mutation errors and these are not handled by Data Loader. These failed data rows will be added to the failed import result output file and can be re-tried for import later.

However, it is recommended to make sure the import file does not contain updates or inserts on the same partition keys and/or clustering keys as the correct state cannot be guaranteed by the data loader.

Storage vs transaction mode

ScalarDB supports both storage and transaction mode and this support is included in Data Loader import process.

When the loader is started in storage mode, each import is executed in a non-transactional way.

Starting the loader in transaction mode will use transactions to import the data. Currently, each row is imported via a separate transaction. When importing a single record into multiple tables, a separate transaction is created for each table import.

Features​

Usage​

Command-line flags​

Import mode​

Data mapping​

Automatic mapping​

Custom mapping​

Control file​

Control file validation levels​

Multi-table import​

Output logs​

Log format​

Data duplicates​

Storage vs transaction mode​