Exporting Data via API¶

Overview¶

The EDP provides several data accessibility and portability tools to facilitate the export of data to downstream tools for molecular analysis.

Datasets can be exported in multiple formats:

JSON: JSON Lines format (gzipped).
CSV: Comma Separated Value format (flattened, gzipped).
TSV: Tab Separated Value format (flattened, gzipped).
Excel (XLSX): Microsoft Excel format (flattened).

Exporting data can take anywhere from a few seconds to tens of minutes, depending on the number of records and selected format. Exports are processed server-side, and the output is a downloadable file. An exported JSON file can be re-imported into EDP without any modification.

Export Limits¶

Different export formats have different limits:

Format	Max Records
Excel	1,048,576 records
JSON	500,000,000 records
TSV	500,000,000 records
CSV	500,000,000 records

Flattened Fields (CSV/XLSX)¶

CSV and XLSX exports are processed by a flattening algorithm during export. The reason for this is to handle list fields, which are not well supported by Excel and other CSV readers. The following example illustrates the effects of the flattening algorithm:

The following dataset records:

{"a": "a", "b": ["x"]}
{"a": "a", "b": ["x", "y"]}
{"a": "a", "b": ["x", "y", "z"]}

will be exported to the following CSV:

a,b.0,b.1,b.2
a,x,,
a,x,y,
a,x,y,z

Export a Dataset¶

To export a dataset, users can retrieve it by name or ID and initiate the export. Exports can take a few minutes for large datasets. Users can always start a large export and check back when it finishes on the Activity tab of the EDP web interface. Exports can also be saved directly into a vault (with target_full_path keyword argument) and accessed from there.

In Python:

from solvebio import Dataset

dataset = Dataset.get_by_full_path('quartzbio:Public:/HGNC/3.3.1-2021-08-25/HGNC')

# Export the entire dataset (~40k records), this may take a minute...
# NOTE: `format` can be: json, tsv, csv, or excel
#       `send_email_on_completion`: enable/disable sending an email when the export is ready
export = dataset.export(format='json', follow=True, send_email_on_completion=True)

# Save the exported file to the current directory
export.download('./')

# Exports can also be saved to a path in a vault
dataset.export(target_full_path='my_vault:/path/to/json_files_folder/my_export')
Python

Exporting Large Amounts of Data¶

An example file size for a CSV file with 150M rows and 50 columns populated with floats and relatively short strings is about 50GB. In general, users are recommended not to work with files this size directly and instead to shrink the export by applying filters or selecting only specific columns. If necessary, users can also export in batches (e.g. export by chromosome or sample).

Export a Filtered Dataset

Users can leverage the dataset filtering system to export a slice of a dataset:

In Python:

from solvebio import Dataset

dataset = Dataset.get_by_full_path('quartzbio:Public:/ClinVar/5.2.0-20221105/Variants-GRCH37')

# Filter the dataset by field values, limit the number of results, select a subset of fields
query = dataset.query(limit=100, fields=["variant", "info.ORIGIN", "gene"]).filter(info.ORIGIN__gte=3)

# Export the query (100 records, filtered on a field)
# NOTE: `format` can be: json, tsv, csv, or excel
#       `send_email_on_completion`: enable/disable sending an email when the export is ready
export = query.export(format='json', follow=True, send_email_on_completion=True)

# Save the exported file to a specific location (optionally with a specific name)
export.download(path='./my_variants.json.gz')

Export in Batches¶

Users can export in batches using the Python client library, such as to export data by chromosome:

from solvebio import Datasetdataset = Dataset.get_by_full_path('quartzbio:Public:/ClinVar/5.2.0-20221105/Variants-GRCH37')# Get available chromosomesfacets = dataset.query().facets(**{'genomic_coordinates.chromosome': {'facet_type': 'terms', 'limit': 100}})print("Found {} chromosomes".format(len(facets['genomic_coordinates.chromosome'])))for chromosome, records_count in facets['genomic_coordinates.chromosome']:    # Defines a location on EDP to export to ("~/" represents a shortcut to the user's personal vault)    # Appends chromosome to the filename    target_full_path = "~/clinvar_{}.csv.gz".format(chromosome)    # Filter the query by chromosome    filtered_query = query.filter(**{'genomic_coordinates.chromosome': chromosome})    # Export    filtered_query.export(format='csv-gz', follow=False, send_email_on_completion=True, target_full_path=target_full_path)

API Endpoints¶

Methods do not accept URL parameters or request bodies unless specified. Please note that if your EDP endpoint is sponsor.edp.aws.quartz.bio, you would use sponsor.api.edp.aws.quartz.bio.

Dataset Exports¶

Method	HTTP Request	Description	Authorization	Response
create	POST https://<EDP_API_HOST>/v2/dataset_exports	Create a dataset export for a dataset.	This request requires an authorized user with read permission for the dataset.	The response contains a single DatasetExport resource.

Request Body:

In the request body, provide an object with the following properties:

Property	Value	Description
dataset_id	integer	A valid dataset ID.
format	string	The export file format.
params	object	Dataset query parameters.
target_full_path	string	(Optional) A vault location to store the export output (must be an EDP full path).
priority	integer	(Optional) A priority to assign to this task.
send_email_on_completion	boolean	(Optional) An email is sent when the export is ready (default: false)

The following export formats (format property) are available:

Format	Extension	Description
json	.json.gz	JSONL format, gzipped.
csv	.csv	Comma-separated format
csv-expand	.csv	Comma-separated format, with expanded list values.
excel	.xlsx	Excel (XLSX) format.
excel-expand	.xlsx	Excel (XLSX) format, with list values expanded.

When using an “expanded” mode, fields containing list values (multiple distinct values) will be expanded into independent columns in the output. This is useful in some downstream applications that do not natively support list within columns.

The following query parameters (params property) are supported for exports:

Property	Value	Description
limit	integer	The number of records to export (between 1 and 1,000,000).
filters	objects	A valid filter object.
fields	string	A list of fields to include in the results.
exclude_fields	string	A list of fields to exclude in the results.
query	string	A valid query string.

Method	HTTP Request	Description	Authorization	Response
delete	DELETE https://<EDP_API_HOST>/v2/dataset_exports/{ID}	Delete a dataset export.	This request requires an authorized user with write permissions on the dataset.	The response returns “HTTP 200 OK” when successful.

download | GET https://<EDP_API_HOST>/v2/dataset_exports/{ID}/download | Download a dataset export. | This request requires an authorized user with read permissions on the dataset. | The default response is a 302 redirect.

When redirect mode is disabled, the response contains a URL to the file. |

Parameters

This request accepts the following parameter:

Property	Value	Description
redirect	boolean	Return a 302 redirect to the download location (default: true).

Dataset exports may expire after 24 hours, after which the download URL will not work. Please re-run the export if necessary.

Method	HTTP Request	Description	Authorization	Response
get	GET https://<EDP_API_HOST>/v2/dataset_exports/{ID}	Retrieve metadata about an export.	This request requires an authorized user with read permissions on the dataset.	The response contains a DatasetExport resource.

Method	HTTP Request	Description	Authorization	Response
list	GET https://<EDP_API_HOST>/v2/datasets/{DATASET_ID}/exports	List the exports associated with a dataset.	This request requires an authorized user with read permissions on the dataset.	The response contains a list of DatasetExport resources.

Method	HTTP Request	Description	Authorization	Response
cancel	PUT https://<EDP_API_HOST>/v2/datasets_exports/{ID}/cancel	List the exports associated with a dataset.	This request requires an authorized user with read permissions on the dataset.	The response will contain a DatasetExport resource with the status canceled.