Getting started with Import
This document explains how you can get started with the ScalarDB Data Loader Import function.
Features​
- Import data from JSON or JSONLines files
- Automatic data mapping based on source field name mapping
- Custom Data mapping via a JSON control file
- Import data from one record or line into multiple tables
- Support for INSERT, UPDATE and UPSERT
Usage​
The data loader import function can be started with the following minimal configuration:
./scalardb-data-loader import --config scalardb.properties --namespace namespace --table tableName
The above configuration starts an import process where no control file is used and the data mapping is applied automatically.
Execute the following steps to successfully import new or existing data
-
Prepare a source file containing data that needs to be imported.
-
Choose the right import mode. By default, the import is done in
upsert
mode which means that data will be inserted if new or updated if the partition key and/or clustering key is found. Other options areinsert
mode orupdate
mode. -
Find the correct
namespace
andtable
name to import data to. -
Determine if you want to run an
all required columns
check for each data row. If enabled data rows with missing columns will be treated as failed and not imported. -
Specify the pathnames for the
success
andfailed
output files. By default the data loader creates the files in the working directory. -
When dealing with JSON data, determine if you want the JSON output for the success or failed log files to be in
pretty print
or not. By default, this option is disabled for performance -
Optionally specify the
threads
argument to tweak performance -
Run the import from the command line to start importing your data. Make sure to run the ScalarDB Data Loader in the correct
storage
ortransaction
mode depending on your running ScalarDB instance.
Command-line flags​
Here is a list of flags (options) that can be used with the data loader:
Flag | Description | Usage |
---|---|---|
--mode | The mode in which ScalarDB is running. If omitted, the default value is storage | scalardb-data-loader --mode transaction |
--config | the path to the scalardb.properties file. If omitted the tool looks for a file named scalardb.properties in the current folder | scalardb-data-loader --config scalardb.properties |
--namespace | Namespace to export table data from. Required when no control file is provided. | scalardb-data-loader --namespace namespace |
--table | name of the table to export data from. Required when no control file is provided. | scalardb-data-loader --table tableName |
--import-mode | Mode to import the data into the ScalarDB table. Supported modes are insert , update and upsert . Optional. Default the value is set to upsert | scalardb-data-loader --import-mode=upsert |
--all-columns-required | If set, data rows cannot be imported if they are missing columns. Optional. By default, the check is not executed. | scalardb-data-loader --all-columns-required |
--file | Specify the path to the file that will be imported. Required | scalardb-data-loader --file <pathToFile> |
--success | The path to the file that is created to write the succeed import results to. Both succeed and failed import results will be written to a different file. Optional. By default, the a new file will be created in the current working directory. Note: if the file already exists, it will be overriden. | scalardb-data-loader --success <path to file> |
--failed | The path to the file that will be created to write the failed import results to. Optional. By default, the a new file will be created in the current working directory. Note: if the file already exists, it will be overridden. | scalardb-data-loader --failed <path to file> |
--threads | Thread count for concurrent processing | scalardb-data-loader --threads 500 |
--format | The format of the import file. json and jsonl files are supported. Optional, default the value json is selected. | scalardb-data-loader --format json |
--ignore-null | The null values in the source file will be ignored, which means that the existing data will not be overwritten. Optional, default the value is false . | scalardb-data-loader --ignore-null |
--pretty | When set, the output to the success and failed files is done in pretty print mode. By default the option is not enabled. | scalardb-data-loader --pretty |
--control-file | The path to the JSON control file specifying the rules for the custom data mapping and/or multi-table import. | scalardb-data-loader --control-file control.json |
--control-file-validation-level | The validation level for the control file. MAPPED , KEYS or FULL . Optional and by default the level is set to MAPPED | scalardb-data-loader --control-file-validation-level FULL |
--log-put-value | Wether the value that was used in the ScalarDB PUT operation is included in the log files or not.Optional and disabled by default. | scalardb-data-loader --log-put-value |
Import mode​
The data loader supports the following import modes:
Mode | Description |
---|---|
INSERT | Each source record is treated as new data. If the data already exists in the ScalarDB table, based on the partition and clustering key, the import for this source data will fail. |
UPDATE | Each source record is treated as an update for existing data in the ScalarDB table. If the data does not exist in the table, based on the partition key and clustering key, the import for this source data will fail. |
UPSERT | If the target ScalarDB table already contains the data, the import will be done via an UPDATE. If the target data is missing, it will be treated as an INSERT. |
Note:
In the case of INSERT
, it is required to have matching fields in the source files for each target column via automatic or custom mapping via the control file. This also applies to an UPSERT
that turns into an INSERT
.
Data mapping​
Automatic mapping​
When no control file is provided, the data loader will automatically map the fields in the source JSON data to the available columns in the ScalarDB table. If the name does not match, and if all columns are required, it will be treated as a validation error. In this case, the import for this record will fail and the result will be added to the failed output log.
Custom mapping​
When the source fields do not match the target column name, it is necessary to use a control file. In this control, file you can specify the custom mapping rules for the field names.
e.g. the following control file to map the field custom_id
in the source file to id
in the target table.
{
"tables": [{
"namespace": "sample",
"table_name": "table1",
"mappings": [{
"source_field": "custom_id",
"target_column": "id"
}]
}
]
}
Control file​
To allow for custom data mapping or multi-table importing, the data loader supports configuration via a JSON control file. This file needs to be passed in via the --control-file
argument when starting the data loader.
Control file validation levels​
To enforce validation on the control file, the data loader allows you to specify the validation level. Based on the set level, the data loader will run a pre-check and validate the control file based on the level rules.
The following levels are supported:
Level | Description |
---|---|
FULL | This validation makes sure that the control file has mappings for each column in the target ScalarDB table. |
KEYS | This validation makes sure that mappings are available for each ScalarDB table partition and, if available, clustering keys columns in the control file mappings. |
MAPPED | The validation makes sure that the provided source fields and target columns exist for only the mappings that are provided in the control file. No other fields are checked. |
The validation level is optional and can be set via the --control-file-validation-level
argument when starting the data loader.
Note:
This validation is run as a pre-check and does not mean the import process will automatically succeed.
e.g. If the level is set to mapped and the control file does not contain mappings for each column for an INSERT, the import process will still fail as all columns are required to be mapped for an INSERT.
Multi-table import​
The data loader supports multi-table target importing.
One single row in a JSON or JSONLine file can be imported into multiple tables by specifying table mapping rules in the control file. Currently, multi-table import is not supported without a control file.
When using multi-table import in ScalarDB transaction mode, a transaction is created for each table import. e.g. If the source record is mapped to 3 tables in the control file, 3 separate transactions are created.
e.g. The import the following source record into table1
and table2
we execute the following steps:
Table1 | Table2 |
---|---|
Id | Id |
price | amount |
Source record
[{
"custom_id": 1,
"custom_price": 1000,
"custom_amount": 100
}]
Control file
{
"tables": [{
"namespace": "sample",
"table_name": "table1",
"mappings": [{
"source_field": "custom_id",
"target_column": "id"
}, {
"source_field": "custom_price",
"target_column": "price"
}]
},
{
"namespace": "sample",
"table_name": "table2",
"mappings": [{
"source_field": "custom_id",
"target_column": "id"
}, {
"source_field": "custom_amount",
"target_column": "amount"
}]
}
]
}
Output logs​
When starting an import task, the data loader logs the import results in two files. One file contains the import data that is successfully imported and one file contains the data that cannot be imported. The failed data will contain an added field that explains why the data could not be imported. This field is called data_loader_import_status
.
The file containing the failed imports can be edited to correct the problems and used as the source file for a new import task as is. It is not required to first remove the data_loader_import_status
field containing the error. This field will be ignored during the import process and the original value will not be included in the new version of the success or failed output file.
The file with the successfully imported data also contains the data_loader_import_status
field. In this file, each imported data row has a status message for the data. Whether a new row was created or existing data was updated.
Log format​
Field | Description |
---|---|
action | The result of the import process for the data record. UPDATE, INSERT or FAILED_DURING_VALIDATION |
namespace | The name of the namespace of the table that the data is imported in |
tablename | The name of the table that the data is imported in |
is_data_mapped | Whether custom data mapping was applied or not based on an available control file. |
tx_id | The id of the transaction. Only available if the data loader is run in transaction mode. |
value | The final value, after optional data mapping, that the data loader uses in the PUT operation. |
row_number | The line number or record number of the source data. |
Errors | A list of validation or other errors for things that went wrong during the import process. |
Example of a JSON formatted success log file:
[{
"column_1": 1,
"column_2": 2,
"column_n": 3,
"data_loader_import_status": {
"results": [{
"action": "UPDATE",
"namespace": "namespace1",
"tableName": "table1",
"is_data_mapped": true,
"tx_id": "value",
"value": "value",
"row_number": "value"
}]
}
}]
Example of a JSON formatted failed log file:
[{
"column_1": 1,
"column_2": 2,
"column_n": 3,
"data_loader_import_status": {
"results": [{
"action": "FAILED_DURING_VALIDATION",
"namespace": "namespace1",
"tableName": "table1",
"is_data_mapped": false,
"value": "value",
"row_number": "value",
"errors": [
"missing columns found during validation"
]
}]
}
}]
Data duplicates​
The data loader does not handle duplicates by itself. In ScalarDB transaction mode, trying to
update the same target data in fast succession will cause No Mutation
errors and these are not
handled by the data loader. These failed data rows will be added to the failed import result output
file and can be re-tried for import later.
However, it is recommended to make sure the import file does not contain updates or inserts on the same partition keys and/or clustering keys as the correct state cannot be guaranteed by the data loader.
Storage vs transaction mode​
ScalarDB supports both storage and transaction mode and this support is included in the data loader import process.
When the loader is started in storage mode, each import is executed in a non-transactional way.
Starting the loader in transaction mode will use transactions to import the data. Currently, each row is imported via a separate transaction. When importing a single record into multiple tables, a separate transaction is created for each table import.