Overview
The EDP specializes in harmonizing a variety of data sources through its robust import system. Importing data is the process of converting a flat file into a dataset that can be queried in real time. The EDP supports data in the most common formats, including JSONL, VCF, CSV, TSV, XML, GTF, and GFF3. Users can contact QuartzBio Support for assistance with importing many other formats (including custom, proprietary formats, and unstructured data).
The EDP’s import system automates the traditional ETL (Extract, Transform, Load) process. The process typically starts by uploading files into a vault. An import task can then be configured and launched. The import system automatically handles data extraction (file parsing), data transformation, data validation, and finally data loading. Users can refer to the Import Parameters documentation for more information about configuring optional parameters for data parsing, entity detection, validation, and annotation.
Supported Formats
The following file formats and extensions are supported:
Name | File Extension | Previewable in EDP? | Transformable into a Dataset? |
---|---|---|---|
Comma Separated Values | .csv | Y | Y |
General Feature Format | .gff3.gz | Y | Y |
Gene Transfer Format | .gtf | Y | Y |
Hyper Text Markup Language | .html | Y | N |
JavaScript Object Notation (in JSON Lines format) | .json | Y | Y |
Mutation Annotation Format | .maf | Y | Y |
Portable Document Format | Y | N | |
Tab Separated Values | .tsv | Y | Y |
Unformatted text file | .txt | Y | Y |
Variant Call Format | .vcf | Y | Y |
Extensible Markup Language | .xml | Y | Y (requires a template) |
Excel | .xlsx/.xsl | Y | Y |
Reader Parameters
EDP automatically detects the file format based on the extension, except for the Nirvana JSON file, and parses the file using a specialized “reader”. It is possible to manually specify a reader and modify reader parameters using the reader_params attribute of the DatasetImport resource.
Reader | Reader name | Extension |
---|---|---|
VCF | vcf | .vcf |
JSONL | json | .json |
CSV | csv | .csv |
TSV | tsv | .tsv, .txt, .maf |
XML | XML | .xml |
GTF | gft | .gtf |
GFF3 | gff3 | .gff3 |
Nirvana | json | nirvana .json |
Excel | xlsx | .xlsx |
Excel | xls |
EDP supports GZip compression for all file types. Gzipped files must have the .gz file extension in addition to their format extension (i.e. file.vcf.gz). Users are recommended to compress files with GZip for faster uploads and imports.
Importing from Files
The first step to getting data onto EDP is by uploading files into a vault. Users can refer to the Vaults documentation for more information.
library(quartzbio.edp)
vault <- Vault.get_personal_vault()
uploaded_file <- File_upload(vault$id, "local/path/file.vcf.gz", "/")
# Retrieve the file by its full path:
uploaded_file <- Object.get_by_full_path("~/file.vcf.gz")
Once the files have been uploaded, they can be imported into any new or existing dataset (Learn how to create a dataset). To launch an import, users can utilize the DatasetImport method. The user will need to provide the uploaded file and target dataset as inputs. Once the import has been launched, it is possible to track the progress through the API on the web interface through the Activity tab.
library(quartzbio.edp)
vault <- Vault.get_personal_vault()
dataset_full_path <- paste(vault$full_path, "/r_examples/test_dataset", sep = ":")
dataset <- Dataset.get_or_create_by_full_path(dataset_full_path)
# Launch the import
imp <- DatasetImport.create(
dataset_id = dataset$id,
object_id = uploaded_file$id,
commit_mode = "append"
)
# Wait for the import to complete
Dataset.activity(dataset$id)
Importing from URLs
If the files are on a remote server and accessible by URL, they can be imported using a manifest. A manifest is simply a list of files (URLs and other attributes) to import:
source_url <- "https://s3.amazonaws.com/downloads.solvebio.com/demo/interesting-variants.json.gz"
manifest <- list(
files = list(
list(url = source_url)
)
)
Once the manifest has been created, it can be imported into any new or existing dataset. To launch an import, users can employ the DatasetImport resource, providing the manifest and target dataset as input. Once the import has been launched it is available to track the progress through the API or on the web.
library(quartzbio.edp)
vault <- Vault.get_personal_vault()
dataset_full_path <- paste(vault$full_path, "/r_examples/manifest_dataset", sep = ":")
dataset <- Dataset.get_or_create_by_full_path(dataset_full_path)
# Launch the import
imp <- DatasetImport.create(
dataset_id = dataset$id,
manifest = manifest,
commit_mode = "append"
)
# Wait for the import to complete
Dataset.activity(dataset$id)
The EDP can also pull data from DNAnexus, SevenBridges, and many other pipelines. Users can contact QuartzBio Support for more information.
Importing from Records
The EDP can also import data as a list of records, i.e. a list of Python dictionaries or R data. Users should note that the EDP supports only importing up to 5000 records at a time through this method. Importing from records is most optimal for importing small datasets and making edits to datasets. For larger imports and transforms, users are recommended to import from compressed JSONL files.
library(quartzbio.edp)
vault <- Vault.get_personal_vault()
dataset_full_path <- paste(vault$full_path, "/r_examples/records_dataset", sep = ":")
dataset <- Dataset.get_or_create_by_full_path(dataset_full_path)
records <- list(
list(gene = "CFTR", importance = 1, sample_count = 2104),
list(gene = "BRCA2", importance = 1, sample_count = 1391),
list(gene = "CLIC2", importance = 5, sample_count = 14)
)
imp <- DatasetImport.create(
dataset_id = dataset$id,
data_records = records
)
Transforming Imported Data
Imported data can be transformed (fields added or edited) by providing a list of fields to the target_fields parameter. Expressions can be used to dynamically modify the data as it is imported, making it possible to
- Modify data types (numbers to strings or vice-versa)
- Add new fields with static or dynamic content
- Format strings and dates to clean the data
- Merge data from datasets
The following example imports a list of records and transforms the contents in a single step:
library(quartzbio.edp)
vault <- Vault.get_personal_vault()
dataset_full_path <- paste(vault$full_path, "/r_examples/transform_dataset", sep = ":")
dataset <- Dataset.get_or_create_by_full_path(dataset_full_path)
# The original records
records <- list(
list(name = "Francis Crick"),
list(name = "James Watson"),
list(name = "Rosalind Franklin")
)
# The transforms to apply through "target_fields"
# Compute the first and last names.
target_fields <- list(
list(
name = "first_name",
description = "Adds a first name column based on name column",
data_type = "string",
expression = "record.name.split(' ')[0]"
),
list(
name = "last_name",
description = "Adds a last name column based on name column",
data_type = "string",
expression = "record.name.split(' ')[-1]"
)
)
# Import and transform the records
imp <- DatasetImport.create(
dataset_id = dataset$id,
data_records = records,
target_fields = target_fields
)
# Wait until import is completed
Dataset.activity(dataset$id)
Dataset_query(dataset$id, exclude_fields = c("_id", "_commit"))
# Output:
# first_name last_name name
# 1 Francis Crick Francis Crick
# 2 James Watson James Watson
# 3 Rosalind Franklin Rosalind Franklin
Existing imported data can also be modified by using migrations. This allows a user to add a column, modify data within a column, or remove a column.
Validating Imported Data
When importing data, every record is validated to ensure it can be committed into a Dataset. Validation compares the schema of existing Dataset fields with the values of incoming data and issues validation errors if the Dataset field schema does not match the incoming value. Validation can also issue warnings.
During validation, a field’s data_type and is_list values are checked. All records are evaluated (although users may override this to fail fast on the first error). A commit will not be created if there are any validation errors.
The following settings can be passed to the validation_params field.
- disable - (boolean) default False - Disables validation completely
- raise_on_errors - (boolean) default False - Will fail the import on first validation error encountered.
- strict_validation - (boolean) default False - Will upgrade all validation warnings to errors.
- allow_new_fields - (boolean) default False - If strict validation is True, will still allow new fields to be added
The following example fails an import as soon as invalid data is detected:
imp <- DatasetImport.create(
dataset_id = dataset$id,
object_id = uploaded_file$id,
validation_params = list(
raise_on_errors = TRUE
)
)
The following example disables validation from running, which can improve import performance.
imp <- DatasetImport.create(
dataset_id = dataset$id,
object_id = uploaded_file$id,
validation_params = list(
disable = TRUE
)
)
Committing Imported Data
Once data has been extracted from files, transformed, and validated, it will be automatically indexed (“committed”) into EDP’s datastore. Dataset commits represent all changes made to the target dataset by the import process. Four commit modes can be selected depending on the scenario: append (default), overwrite, upsert, and delete. The commit mode can be specified when creating the DatasetImport using the commit_mode parameter.
append (default)
Append mode always adds records to the dataset. Imported record IDs (the _id field) will be overwritten with unique values. Only append commits can be rolled back at this time.
overwrite
Overwrite mode requires that each record have a value in the _id field. Existing records with the same _id are overwritten completely.
Performance Tips
Below are some tips for improving the performance of dataset imports.
Disable Data validation
Data validation is enabled by default when running imports or migrations. This is used for data type checking on each record that is processed. Disabling this will provide a per-record performance improvement, translating to substantial time savings for large datasets.
Dataset Capacity
For many simultaneous imports, use a larger dataset capacity. Simultaneous imports have a high upper limit (50+) but simultaneous commits are throttled. Every import spawns a commit that does the actual indexing of the data. small capacity datasets allow a single running commit per dataset at a time, the medium allows 2 simultaneous commits, and large allows 3 simultaneous commits. Commits will remain queued until running ones are completed.
Indexing operations and query operations are also faster for larger-capacity datasets. If it is expected a dataset to be queried at high frequency, then we recommend using a larger dataset. If the dataset already exists, copy the dataset into a medium or large dataset.
API Endpoints
Methods do not accept URL parameters or request bodies unless specified. Please note that if your EDP endpoint is sponsor.edp.aws.quartz.bio, you would use sponsor.api.edp.aws.quartz.bio.
Dataset Imports
Method | HTTP Request | Description | Authorization | Response |
---|---|---|---|---|
create | POST https:// |
Create a dataset import for a dataset. | This request requires an authorized user with write permission on the dataset. | The response returns “HTTP 201 Created”, along with the DatasetImport resource when successful. |
Request Body:
In the request body, provide an object with the following properties:
Property | Value | Description |
---|---|---|
commit_mode | string | A valid commit mode. |
dataset_id | integer | (Optional) The ID of an existing object on EDP. |
object_id | integer | (Optional) A file manifest (see below). |
manifest | object | (Optional) A vault location to store the export output (must be an EDP full path). |
data_records | objects | (Optional) A list of records to import synchronously. |
description | string | (Optional) A description of this import. |
entity_params | object | (Optional) Configuration parameters for entity detection. |
reader_params | object | (Optional) Configuration parameters for readers. |
validation_params | object | (Optional) Configuration parameters for validation. |
annotator_params | object | (Optional) Configuration parameters for the Annotator. |
include_errors | boolean | If True, a new field (_errors) will be added to each record containing expression evaluation errors (default: True). |
target_fields | objects | A list of valid dataset fields to create or override in the import. |
priority | integer | A priority to assign to this task |
When creating a new import, either manifest, object_id or data_records must be provided. Using a manifest allows you to import a remote file accessible by HTTP(S), for example:
# Example Manifest
{
"files": [{
"url": "https://example.com/file.json.gz",
"name": "file.json.gz",
"format": "json",
"size": 100,
"md5": "",
"base64_md5": ""
}]
}
Manifests can include the following parameters:
Property | Value | Description |
---|---|---|
url | string | A publicly accessible URL pointing to a file to import into EDP. You must pass a URL or an object_id. |
object_id | long | The ID of an existing object on EDP. You must pass an object_id or a URL. |
name | string | (Optional) The name of the file. If not passed, EDP will take it from the URL or object. |
format | string | (Optional) The file format of the file. If not passed, EDP will take it from the URL or object. |
md5 | string | (Optional) The md5 hash of the file contents. If passed, EDP will validate the file after downloading and fail if mismatched. |
entity_params | object | (Optional) Configuration parameters for entity detection. |
reader_params | object | (Optional) Configuration parameters for readers. |
validation_params | object | (Optional) Configuration parameters for validation. |
Method | HTTP Request | Description | Authorization | Response |
---|---|---|---|---|
delete | DELETE https:// |
Delete a dataset import. | This request requires an authorized user with write permission on the dataset. | The response returns “HTTP 200 OK” when successful. |
Deleting dataset imports is not recommended as data provenance will be lost.
Method | HTTP Request | Description | Authorization | Response |
---|---|---|---|---|
get | GET https:// |
Retrieve metadata about an import. | This request requires an authorized user with read permission on the dataset. | The response contains a DatasetImport resource. |
Method | HTTP Request | Description | Authorization | Response |
---|---|---|---|---|
list | GET https:// |
List the imports associated with a dataset. | This request requires an authorized user with read permission on the dataset. | The response contains a list of DatasetImport resources. |
Method | HTTP Request | Description | Authorization | Response |
---|---|---|---|---|
cancel | PUT https:// |
Cancel a dataset import. | This request requires an authorized user with write permission on the dataset. | The response will contain a DatasetImport resource with the status canceled |
Request Body
In the request body, provide a valid DatasetResource object (see create above) with status = canceled.