Getting started with Import
This document explains how you can get started with the ScalarDB Data Loader Import function.
Features​
- Import data from JSON or JSON Lines files
- Automatic data mapping based on source field name mapping
- Custom Data mapping via a JSON control file
- Import data from one record or line into multiple tables
- Support for INSERT, UPDATE and UPSERT
Usage​
The Data Loader import function can be started with the following minimal configuration:
./scalardb-data-loader import --config scalardb.properties --namespace namespace --table tableName
The above configuration starts an import process where no control file is used and the data mapping is applied automatically.
Execute the following steps to successfully import new or existing data
-
Prepare a source file containing data that needs to be imported.
-
Choose the right import mode. By default, the import is done in
upsertmode which means that data will be inserted if new or updated if the partition key and/or clustering key is found. Other options areinsertmode orupdatemode. -
Find the correct
namespaceandtablename to import data to. -
Determine if you want to run an
all required columnscheck for each data row. If enabled, data rows with missing columns will be treated as failed and not imported. -
Specify the path names for the
successandfailedoutput files. By default, Data Loader creates the files in the working directory. -
When dealing with JSON data, determine if you want the JSON output for the success or failed log files to be in
pretty printor not. By default, this option is disabled for performance -
Optionally specify the
threadsargument to tweak performance -
Run the import from the command line to start importing your data. Make sure to run the ScalarDB Data Loader in the correct
storageortransactionmode depending on your running ScalarDB instance.
Command-line flags​
Here is a list of flags (options) that can be used with Data Loader:
| Flag | Description | Usage |
|---|---|---|
| --mode | The mode in which ScalarDB is running. If omitted, the default value is storage | scalardb-data-loader --mode transaction |
| --config | the path to the scalardb.properties file. If omitted the tool looks for a file named scalardb.properties in the current folder | scalardb-data-loader --config scalardb.properties |
| --namespace | Namespace to export table data from. Required when no control file is provided. | scalardb-data-loader --namespace namespace |
| --table | name of the table to export data from. Required when no control file is provided. | scalardb-data-loader --table tableName |
| --import-mode | Mode to import the data into the ScalarDB table. Supported modes are insert, update and upsert. Optional. Default the value is set to upsert | scalardb-data-loader --import-mode=upsert |
| --all-columns-required | If set, data rows cannot be imported if they are missing columns. Optional. By default, the check is not executed. | scalardb-data-loader --all-columns-required |
| --file | Specify the path to the file that will be imported. Required | scalardb-data-loader --file <pathToFile> |
| --success | The path to the file that is created to write the succeed import results to. Both succeed and failed import results will be written to a different file. Optional. By default, the a new file will be created in the current working directory. Note: if the file already exists, it will be overridden. | scalardb-data-loader --success <path to file> |
| --failed | The path to the file that will be created to write the failed import results to. Optional. By default, the a new file will be created in the current working directory. Note: if the file already exists, it will be overridden. | scalardb-data-loader --failed <path to file> |
| --threads | Thread count for concurrent processing. The default value is the number of available processors. | scalardb-data-loader --threads 500 |
| --format | The format of the import file. json and jsonl files are supported. Optional, default the value json is selected. | scalardb-data-loader --format json |
| --ignore-null | The null values in the source file will be ignored, which means that the existing data will not be overwritten. Optional, default the value is false. | scalardb-data-loader --ignore-null |
| --pretty | When set, the output to the success and failed files is done in pretty print mode. By default the option is not enabled. | scalardb-data-loader --pretty |
| --control-file | The path to the JSON control file specifying the rules for the custom data mapping and/or multi-table import. | scalardb-data-loader --control-file control.json |
| --control-file-validation-level | The validation level for the control file. MAPPED, KEYS or FULL. Optional and by default the level is set to MAPPED | scalardb-data-loader --control-file-validation-level FULL |
| --log-put-value | Whether the value that was used in the ScalarDB PUT operation is included in the log files or not.Optional and disabled by default. | scalardb-data-loader --log-put-value |
| --error-file-required | To export an optional error file of type JSON when the import file contains CSV data. By default, this option is disabled. | scalardb-data-loader --error-file-required |
| --error | To specify an optional error file when the import file contains CSV data. | scalardb-data-loader --error <pathToFile> |
| --delimiter | To specify a custom delimiter if the import file contains CSV data. | scalardb-data-loader --delimiter <value> |
| --header | To specify the header row data if the import file contains CSV data and does not have a header row. | scalardb-data-loader --header <value> |
Import mode​
Data Loader supports the following import modes:
| Mode | Description |
|---|---|
| INSERT | Each source record is treated as new data. If the data already exists in the ScalarDB table, based on the partition and clustering key, the import for this source data will fail. |
| UPDATE | Each source record is treated as an update for existing data in the ScalarDB table. If the data does not exist in the table, based on the partition key and clustering key, the import for this source data will fail. |
| UPSERT | If the target ScalarDB table already contains the data, the import will be done via an UPDATE. If the target data is missing, it will be treated as an INSERT. |
Note:
In the case of INSERT, it is required to have matching fields in the source files for each target column via automatic or custom mapping via the control file. This also applies to an UPSERT that turns into an INSERT.