Exporting Data via API

Overview

The EDP provides several data accessibility and portability tools to facilitate the export of data to downstream tools for molecular analysis.

Datasets can be exported in multiple formats:

  • JSON: JSON Lines format (gzipped).

  • CSV: Comma Separated Value format (flattened, gzipped).

  • TSV: Tab Separated Value format (flattened, gzipped).

  • Excel (XLSX): Microsoft Excel format (flattened).

Exporting data can take anywhere from a few seconds to tens of minutes, depending on the number of records and selected format. Exports are processed server-side, and the output is a downloadable file. An exported JSON file can be re-imported into EDP without any modification.

Export Limits

Different export formats have different limits:

Format

Max Records

Excel

1,048,576 records

JSON

500,000,000 records

TSV

500,000,000 records

CSV

500,000,000 records

Flattened Fields (CSV/XLSX)

CSV and XLSX exports are processed by a flattening algorithm during export. The reason for this is to handle list fields, which are not well supported by Excel and other CSV readers. The following example illustrates the effects of the flattening algorithm:

The following dataset records:

{"a": "a", "b": ["x"]}
{"a": "a", "b": ["x", "y"]}
{"a": "a", "b": ["x", "y", "z"]}

will be exported to the following CSV:

a,b.0,b.1,b.2
a,x,,
a,x,y,
a,x,y,z

Export a Dataset

To export a dataset, users can retrieve it by name or ID and initiate the export. Exports can take a few minutes for large datasets. Users can always start a large export and check back when it finishes on the Activity tab of the EDP web interface. Exports can also be saved directly into a vault (with target_full_path keyword argument) and accessed from there.

In Python:

from solvebio import Dataset

dataset = Dataset.get_by_full_path('quartzbio:Public:/HGNC/3.3.1-2021-08-25/HGNC')

# Export the entire dataset (~40k records), this may take a minute...
# NOTE: `format` can be: json, tsv, csv, or excel
#       `send_email_on_completion`: enable/disable sending an email when the export is ready
export = dataset.export(format='json', follow=True, send_email_on_completion=True)

# Save the exported file to the current directory
export.download('./')

# Exports can also be saved to a path in a vault
dataset.export(target_full_path='my_vault:/path/to/json_files_folder/my_export')
Python

Exporting Large Amounts of Data

An example file size for a CSV file with 150M rows and 50 columns populated with floats and relatively short strings is about 50GB. In general, users are recommended not to work with files this size directly and instead to shrink the export by applying filters or selecting only specific columns. If necessary, users can also export in batches (e.g. export by chromosome or sample).

Export a Filtered Dataset

Users can leverage the dataset filtering system to export a slice of a dataset:

In Python:

from solvebio import Dataset

dataset = Dataset.get_by_full_path('quartzbio:Public:/ClinVar/5.2.0-20221105/Variants-GRCH37')

# Filter the dataset by field values, limit the number of results, select a subset of fields
query = dataset.query(limit=100, fields=["variant", "info.ORIGIN", "gene"]).filter(info.ORIGIN__gte=3)

# Export the query (100 records, filtered on a field)
# NOTE: `format` can be: json, tsv, csv, or excel
#       `send_email_on_completion`: enable/disable sending an email when the export is ready
export = query.export(format='json', follow=True, send_email_on_completion=True)

# Save the exported file to a specific location (optionally with a specific name)
export.download(path='./my_variants.json.gz')

Export in Batches

Users can export in batches using the Python client library, such as to export data by chromosome:

from solvebio import Datasetdataset = Dataset.get_by_full_path('quartzbio:Public:/ClinVar/5.2.0-20221105/Variants-GRCH37')# Get available chromosomesfacets = dataset.query().facets(**{'genomic_coordinates.chromosome': {'facet_type': 'terms', 'limit': 100}})print("Found {} chromosomes".format(len(facets['genomic_coordinates.chromosome'])))for chromosome, records_count in facets['genomic_coordinates.chromosome']:    # Defines a location on EDP to export to ("~/" represents a shortcut to the user's personal vault)    # Appends chromosome to the filename    target_full_path = "~/clinvar_{}.csv.gz".format(chromosome)    # Filter the query by chromosome    filtered_query = query.filter(**{'genomic_coordinates.chromosome': chromosome})    # Export    filtered_query.export(format='csv-gz', follow=False, send_email_on_completion=True, target_full_path=target_full_path)

API Endpoints

Methods do not accept URL parameters or request bodies unless specified. Please note that if your EDP endpoint is sponsor.edp.aws.quartz.bio, you would use sponsor.api.edp.aws.quartz.bio.

Dataset Exports

Method

HTTP Request

Description

Authorization

Response

create

POST https://<EDP_API_HOST>/v2/dataset_exports

Create a dataset export for a dataset.

This request requires an authorized user with read permission for the dataset.

The response contains a single DatasetExport resource.

Request Body:

In the request body, provide an object with the following properties:

Property

Value

Description

dataset_id

integer

A valid dataset ID.

format

string

The export file format.

params

object

Dataset query parameters.

target_full_path

string

(Optional) A vault location to store the export output (must be an EDP full path).

priority

integer

(Optional) A priority to assign to this task.

send_email_on_completion

boolean

(Optional) An email is sent when the export is ready (default: false)

The following export formats (format property) are available:

Format

Extension

Description

json

.json.gz

JSONL format, gzipped.

csv

.csv

Comma-separated format

csv-expand

.csv

Comma-separated format, with expanded list values.

excel

.xlsx

Excel (XLSX) format.

excel-expand

.xlsx

Excel (XLSX) format, with list values expanded.

When using an “expanded” mode, fields containing list values (multiple distinct values) will be expanded into independent columns in the output. This is useful in some downstream applications that do not natively support list within columns.

The following query parameters (params property) are supported for exports:

Property

Value

Description

limit

integer

The number of records to export (between 1 and 1,000,000).

filters

objects

A valid filter object.

fields

string

A list of fields to include in the results.

exclude_fields

string

A list of fields to exclude in the results.

query

string

A valid query string.

Method

HTTP Request

Description

Authorization

Response

delete

DELETE https://<EDP_API_HOST>/v2/dataset_exports/{ID}

Delete a dataset export.

This request requires an authorized user with write permissions on the dataset.

The response returns “HTTP 200 OK” when successful.

download | GET https://<EDP_API_HOST>/v2/dataset_exports/{ID}/download | Download a dataset export. | This request requires an authorized user with read permissions on the dataset. | The default response is a 302 redirect.

When redirect mode is disabled, the response contains a URL to the file. |

Parameters

This request accepts the following parameter:

Property

Value

Description

redirect

boolean

Return a 302 redirect to the download location (default: true).

Dataset exports may expire after 24 hours, after which the download URL will not work. Please re-run the export if necessary.

Method

HTTP Request

Description

Authorization

Response

get

GET https://<EDP_API_HOST>/v2/dataset_exports/{ID}

Retrieve metadata about an export.

This request requires an authorized user with read permissions on the dataset.

The response contains a DatasetExport resource.

Method

HTTP Request

Description

Authorization

Response

list

GET https://<EDP_API_HOST>/v2/datasets/{DATASET_ID}/exports

List the exports associated with a dataset.

This request requires an authorized user with read permissions on the dataset.

The response contains a list of DatasetExport resources.

Method

HTTP Request

Description

Authorization

Response

cancel

PUT https://<EDP_API_HOST>/v2/datasets_exports/{ID}/cancel

List the exports associated with a dataset.

This request requires an authorized user with read permissions on the dataset.

The response will contain a DatasetExport resource with the status canceled.

Request Body:

In the request body, provide a valid DatasetExport object (see create above) with status = canceled.