Overview
Dataset templates describe how data should be transformed. A template is a collection of fields (columns) that describe the desired format of some input data. Templates are used to import files, export, query or migrate data. They allow for field normalization and transformation, and also for the addition of fields and annotations. In this article, users will learn how to view, create, update and apply templates for transforming datasets.
Retrieving Templates
To list all templates:
all_templates <- DatasetTemplate.all(template_type = "dataset")
It will return all available dataset templates with their names, template id, organization, account id and status.
To retrieve a template by known ID:
template <- DatasetTemplate.retrieve(id = "template_id")
Template Fields
The list of fields is the most important part of a template. Each field describes a DatasetField, an object that defines properties of the field such as name, title, and data type as key-value pairs. Fields are written as lists in R.
Example:
Template Permissions
By default, all newly created templates are accessible by everyone at the user’s organization.
If users want the template to be shown in the UI (in the modal used for transforming files), then they can add the import tag to the template.
DatasetTemplate.update("id of your template", tags = list("import"))
If users would like to make the template private (accessible only to your user), then they can set the account_id parameter to None.
DatasetTemplate.update("id of your template", account = NULL)
If users would like to make the template accessible to all users on EDP (including those outside of your organization) then they can set the is_public parameter to True. As always, users should not share anything sensitive outside of their organization.
DatasetTemplate.update("id of your template", is_public = TRUE)
Create a Template
To create a template, users should prepare the list of DatasetFields with information about data types, expressions, entities, etc.
Example of list of fields:
fields <- list(
list(
name = "sample",
depends_on = list("subject"),
entity_type = "sample",
description = "Sample ID from SUBJECT",
data_type = "string",
ordering = 1,
expression = "record.subject"
),
list(
name = "study",
title = "STUDY",
description = "Study Code",
ordering = 2,
expression = "None if value == 'UNASSIGNED' else value",
data_type = "string"
),
list(
data_type = "string",
depends_on = list("hgvs_c"),
description = "EDP variant entity, computed from the short variant CDS change",
expression = "entity_ids('variant', record.hgvs_c) if record.hgvs_c else None",
is_transient = TRUE,
name = "variant_cdna_grch38"
)
)
The following attributes should be added: - name - the name of the field - data_type - the data type of the field - entity_type – the entity type (only necessary for entity querying)
The following attributes are optional, but responsible for much of the data transformation:
expression - The expression that will be evaluated to populate this field’s value. Put “value” to use the current value. Users can refer to the Expressions documentation for more information. In order to use data from another field (for comparison, splits, etc), users should make sure that they also add it to the list of fields, which will allow them to retrieve it using expression context variables: record.name_of_field. Users should also add the field in the depends_on parameter.
depends_on - This is a list of fields that the expression depends on. Users can add any field names here. This will ensure that those fields expressions are evaluated before its dependents. The template creation will fail if there is a circular dependency.
is_transient - A transient field is a field that is not indexed into the dataset, but calculated only while the template annotation is running. This is useful for temporary fields/variables for complex templates (default is False)
The following attributes are optional, and informational only, but encouraged:
- title - The field’s display name, shown in the UI and in CSV/Excel exports.
- description - Describes the contents of the field, shown in the UI.
- ordering - The order in which this column appears when retrieving data from the dataset. Order is 0-based. Default is 0
- is_hidden - Set to True if the field should be excluded by default from the UI.
After the list of the fields is prepared, other information about a template can be added:
template <- list(
name = "My Variant Template",
version = "1.2.0",
description = "Import a special CSV file. Genome is assumed to be GRCh38, also has variant entity for GRCh37.", template_type = "dataset",
is_public = FALSE,
entity_params = list("disable" = TRUE),
fields = fields
)
The template_type should be set to “dataset”.
my_template <- DatasetTemplate.create(
name = "My Variant Template",
version = "1.2.0",
description = "Import a special CSV file. Genome is assumed to be GRCh38, also has variant entity for GRCh37.", template_type = "dataset",
is_public = FALSE,
fields = fields
)
Printing the template object will show the template’s ID and contents.
Create a Dataset with a Template
Users can create a dataset and set the structure with a template.
After that, users can create the template:
template <- DatasetTemplate.retrieve("id of your template")
# Specify where you want to create your new dataset
vault <- Vault.get_personal_vault()
dataset_full_path <- paste(vault$full_path, "my_fields_dataset", sep = "/")
# Dataset will now have the non-transient fields from the template
# with desired titles/descriptions and expressions, but no records
dataset <- Dataset.get_or_create_by_full_path(dataset_full_path, fields = template$fields)
Users can also create and a dataset and add the fields during file import:
template <- DatasetTemplate.retrieve("id of your template")
# Specify where you want to create your new dataset
vault <- Vault.get_personal_vault()
dataset_full_path <- paste(vault$full_path, "my_fields_dataset", sep = "/")
dataset <- Dataset.get_or_create_by_full_path(dataset_full_path, fields = template$fields)
file_object <- Object.retrieve("id of file uploaded to EDP")
DatasetImport.create(
dataset_id = dataset$id,
object_id = file_object$id,
target_fields = template$fields,
commit_mode = "append"
)
# Wait for import to finish
Dataset.activity(dataset$id, follow = TRUE)
Update a Template:
Template attributes such as the list of fields can be edited. For example, new fields can be added to an existing template. In this example, a new field called “phase_numeric” is added to the template to transform roman numerals to numbers in the phase field of the dataset.
library(quartzbio.edp)
library(jsonlite)
# List all templates
all_templates <- DatasetTemplate.all(template_type = "dataset", tags = "import")
# Print ID & name of each
data.frame(all_templates$data$id, all_templates$data$name)
# Retrieve a template by ID
template <- DatasetTemplate.retrieve("template_id")
# Write to a JSON file for editing
fields_json <- toJSON(template$fields, pretty = TRUE)
writeLines(fields_json, "template.json")
# Edit the template.json file as needed
# Add a field to file:
# {
# "title": "phase_numeric",
# "name": "phase_numeric",
# "data_type": "string",
# "depends_on": ["phase"],
# "expression": "record.phase.replace('III', '3').replace('II', '2').replace('I', '1')"
# }
# Read the updated JSON file and convert it to a list
updated_fields <- fromJSON(readLines("template.json"))
# Test the template by applying it to a few records
dataset <- Dataset.get_by_full_path("vault:/my/dataset/")
records <- Dataset.query(dataset$id, limit = 10)
df <- Annotator.annotate(records, updated_fields)
head(df)
# Save the template to the API
template <- DatasetTemplate.update(template$id, fields = updated_fields)
# Apply it to the existing dataset
DatasetMigration.create(
source_id = dataset$id,
target_id = dataset$id,
target_fields = template$fields,
commit_mode = "upsert",
source_params = list(filters = list())
)
# Follow the task, wait for it to complete
Dataset.activity(dataset$id, follow = TRUE)
Building and Testing Templates with the Annotator
When creating new templates it it is useful to use the annotator to test and validate the fields and their expressions. The below snippet will use the annotator to process records in real time with the template fields.
require(quartzbio.edp)
# Get records from dataset
dataset <- Dataset.get_or_create_by_full_path(dataset_full_path)
records <- dataset.query()
# Load files from a template
template <- DatasetTemplate.retrieve(template_id)
fields <- template$fieldsAnnotator.annotate(records = records, fields = fields)
API Endpoints
Methods do not accept URL parameters or request bodies unless specified. Please note that if your EDP endpoint is sponsor.edp.aws.quartz.bio, you would use sponsor.api.edp.aws.quartz.bio.
Dataset Templates
Method | HTTP Request | Description | Authorization | Response |
---|---|---|---|---|
create | POST https:// |
Create a dataset template. | This request requires an authorized user with appropriate permissions. | The response contains the new DatasetTemplate resource. |
Request Body: In the request body, provide an object with the following properties:
Property | Value | Description |
---|---|---|
name | string | A short name for the new template. |
description | string | A description for the template. |
fields | objects | A list of valid dataset fields. |
version | string | A string representing a template version (no spaces). |
template_type | string | The type of template: dataset, recipe or search. |
is_public | boolean | True if visible to anyone in a user’s organization (default False). |
annotator_params | object | (optional) Configuration parameters for the Annotator. |
entity_params | object | (optional) Configuration parameters for entity detection. |
reader_params | object | (optional) Configuration parameters for readers. |
validation_params | object | (optional) Configuration parameters for validation. |
Method | HTTP Request | Description | Authorization | Response |
---|---|---|---|---|
delete | DELETE https:// |
Delete a dataset template. | This request requires an authorized user with write permissions on the resource. | The response returns “HTTP 200 OK” when successful. |
Method | HTTP Request | Description | Authorization | Response |
---|---|---|---|---|
generate | GET https:// |
Create a dataset template from an existing dataset. | This request requires an authorized user with permission. | The response contains an unsaved DatasetTemplate object for the specified dataset. |
Method | HTTP Request | Description | Authorization | Response |
---|---|---|---|---|
get | GET https:// |
Retrieve a dataset template. | This request requires an authorized user with permission. | The response contains a DatasetTemplate resource. |
Method | HTTP Request | Description | Authorization | Response |
---|---|---|---|---|
list | GET https:// |
Retrieve a list of available dataset templates. | This request requires an authorized user with read permission. | The response contains a list of resources visible to the requesting user. |
Parameters This request accepts the following parameters:
Property | Value | Description |
---|---|---|
limit | integer | The number of objects to return per page. |
offset | integer | The offset within the list of available objects. |