Querying Datasets and Files

Overview

The EDP is designed for easy access to molecular information. It provides an easy-to-use, real-time API for querying any dataset or file on the platform through the EDP Python or R client libraries. Users can also use Bash to query datasets. Users can also apply complex filters when querying datasets and files; to learn more about using filters, users can refer to the Filters documentation.

Querying Datasets

Dataset query results are returned in pages, similar to a search engine. To narrow down search results, datasets can be filtered on one or more fields. Users can either build queries using a programming language (or even write raw JSON) or by building them directly on any dataset page in the EDP web application. The easiest way to query datasets is by using the EDP Python or R client libraries.

A basic query returns a page of results from the specified public dataset. Users can set the paginate parameter to True to retrieve all records or use the limit parameter to specify how many records to retrieve. Users should note that in the R client, the limit parameter allows users to retrieve a maximum of 10,000 records in a single request. Additionally, the query function accepts the following parameters:

Parameter

Value

Description

filters

objects

A valid filter object.

facets

objects

A valid facets object

fields

string

A list of fields to include in the results.

exclude_fields

string

A list of fields to exclude in the results.

ordering

string

A list of fields to order results by.

query

string

A valid query string.

limit

integer

The number of results to return per-page.

offset

integer

The record offset in the result-set.

paginate

boolean

If True, returns all records. Default is False.

page_size

integer

The internal batch size per request. Default is 100, with a maximum size of 10,000. Increasing the page_size can increase the speed of the query, but large numbers can in some cases cause requests to fail due to the large amount of data coming out in a single response.

output_format

string

The output format of the query (‘csv’, ‘tsv’, or ‘json’). Default is ‘json’.

In Python:

# Users can set how many records they want to retrieve with the "limit" parameter
Dataset.get_by_full_path('quartzbio:Public:/ClinVar/5.2.0-20210110/Variants-GRCH37').query(limit=1000)

# Users can order query results using the ordering argument

# Order the query results by clinical_significance ascending
Dataset.get_by_full_path('quartzbio:Public:/ClinVar/5.2.0-20210110/Variants-GRCH37').query(ordering='clinical_significance')

# Order the query results by clinical_significance descending
Dataset.get_by_full_path('quartzbio:Public:/ClinVar/5.2.0-20210110/Variants-GRCH37').query(ordering='-clinical_significance')

# Query results can be ordered by multiple columns

# Order the query results by clinical_significance descending and gene_symbol ascending
Dataset.get_by_full_path('quartzbio:Public:/ClinVar/5.2.0-20210110/Variants-GRCH37').query(ordering=['-clinical_significance', 'gene'])

Saving Queries

Dataset queries can be saved and then used to make queries on datasets with a similar structure. Saved queries can be created for any dataset and can be shared with members of a user’s organization.

For example, users may save a query for a set of interesting genes. They can then make this query available for all datasets that contain genes. If shared with other users in the organization, they will also be able to apply this query.

The Saved Queries API

To retrieve Saved Queries that apply to a dataset, or all those available:

In Python:

dataset_queries = SavedQuery.all(dataset="<DATASET_ID>")

all_saved_queries = SavedQuery.all()

To use a saved query, users can retrieve the SavedQuery object and then apply the parameters.

In Python:

saved_query = SavedQuery.retrieve("SAVED_QUERY_ID")

# Option 1: from the SavedQuery instance (Python only)
results = saved_query.query("<DATASET_ID>")

# Option 2: from the Dataset.query() function
results = Dataset.retrieve("<DATASET_ID").query(**saved_query.params)

To create a SavedQuery, users can define the query parameters and provide a valid dataset, as well as give it a name and description.

In Python:

params = {
    "entities": [["gene", "MTOR"], ["gene", "BRCA2"], ["gene", "CFTR"]]
}

saved_query = SavedQuery.create(
    name="Interesting Genes",
    description="Interesting genes as defined in Pubmed article 512312"
    dataset="<DATASET_ID>",
    params=params
)

Using Saved Queries

Saved queries can be used via the EDP API or the web UI. The UI will only display queries compatible with the current dataset. This compatibility check is handled automatically by the platform.

When viewing a dataset in the web UI, previously saved queries can be retrieved by selecting “Load Filters” and then selecting one. Users can save a new query by applying filters to the dataset and then by clicking “Save Filters.” For more information, users can refer to the Dataset Exploration documentation.

Querying Files

File objects can be queried and filtered on one or more fields. The query results are returned in pages. It is important to note that text files such as CSV, TXT, TSV, or BED must be uploaded with headers; otherwise, the query will return incorrect results because the query logic considers the first row from the file as the header.

A basic query returns a page of results from the specified file object:

In Python:

clinvar = Object.get_by_full_path('quartzbio:Public:/ClinVar/5.2.0-20210110/ClinVar-5-2-0-20210110-Variants-GRCH37-1425664822266145048-20221110194518.json.gz')
clinvar.query()

Users can retrieve a specified number of records from the file by setting the limit query parameter:

In Python:

clinvar = Object.get_by_full_path('quartzbio:Public:/ClinVar/5.2.0-20210110/ClinVar-5-2-0-20210110-Variants-GRCH37-1425664822266145048-20221110194518.json.gz')
q = clinvar.query(limit=50)

All fields from the file can be retrieved by calling the fields method:

In Python:

fields = Object.get_by_full_path('quartzbio:Public:/ClinVar/5.2.0-20210110/ClinVar-5-2-0-20210110-Variants-GRCH37-1425664822266145048-20221110194518.json.gz').query().fields()

Users can also use the download_url() method to load files into readers such as pandas:

In Python:

from solvebio import *
import pandas

# Get file using ID or full path
f = Object.retrieve("ID")
f = Object.get_by_full_path("vault/path/to/file.csv")

# Get file download URL and load into reader
url = f.download_url()
pandas.read_csv(url)

Supported File Extensions and Compressions

File querying is only supported for the following file extensions and compressions:

File Extensions

Compression

txt

GZIP, BZIP2

csv

GZIP, BZIP2

tsv

GZIP, BZIP2

bed

GZIP, BZIP2

json

GZIP, BZIP2

parquet

GZIP

The only supported encoding is UTF-8.

The output format of the query can be provided by using the output_format parameter which can be one of the following:

Output Format

Description

json (default)

applicable to all file extensions

csv

applicable only to csv, txt, tsv, or bed file extensions

tsv

applicable only to csv, txt, tsv, or bed file extensions

Example:

In Python:

clinvar = Object.get_by_full_path('quartzbio:Public:/ClinVar/5.2.0-20210110/ClinVar-5-2-0-20210110-Variants-GRCH37-1425664822266145048-20221110194518.json.gz')
clinvar.query(output_format='json')

API Endpoints

Methods do not accept URL parameters or request bodies unless specified. Please note that if your EDP endpoint is sponsor.edp.aws.quartz.bio, you would use sponsor.api.edp.aws.quartz.bio.

Dataset Query

Method

HTTP Request

Description

Authorization

Response

query

POST /v2/datasets/{DATASET_ID}/query

Query a dataset.

This request requires an authorized user with permission to query the target dataset.

The dataset query response has a structure defined below.

Request Body:

The request body should contain valid query parameters:

Property

Value

Description

filters

objects

A valid filter object.

facets

objects

A valid facets object

fields

string

A list of fields to include in the results.

exclude_fields

string

A list of fields to exclude in the results.

ordering

string

A list of fields to order results by.

query

string

A valid query string.

limit

integer

The number of results to return per-page.

offset

integer

The record offset in the result-set.

Users can refer to the Filters documentation for more information about constructing filters.

The dataset query response has the following structure:

Property

Value

Description

dataset

string

The name of the dataset.

dataset_id

integer

The ID of the dataset.

facets

objects

Facet results (if a facets are requested).

offset

integer

The current offset within the whole result-set.

results

objects

A list of dataset records.

took

integer

Time to retrieve the records, in miliseconds.

total

integer

The total number of records in the result-set.

Saved Queries

Method

HTTP Request

Description

Authorization

Response

get

GET https://<EDP_API_HOST>/v2/saved_queries/{ID}

Retrieve a Saved Query.

This request requires an authorized user with permission.

The response contains a SavedQuery resource.

Method

HTTP Request

Description

Authorization

Response

create

POST https://<EDP_API_HOST>/v2/saved_queries

Create a new Saved Query.

This request requires an authorized user with appropriate permissions.

The response contains the new SavedQuery resource.

Request Body:

Property

Value

Description

name

string

A short name for the Saved Query.

description

string

A description for the Saved Query.

dataset

string

The ID or full_path of a dataset to validate this query parameters against. This is needed on initial creation to ensure valid parameters.

params

objects

The query parameters (see query parameters above for query method).

is_shared

boolean

If True, this query will be shared with other members of you organization

Method

HTTP Request

Description

Authorization

Response

delete

DELETE https://<EDP_API_HOST>/v2/saved_queries/{ID}

Delete a Saved Query.

This request requires an authorized user with write permissions on the resource.

The response returns “HTTP 200 OK” when successful.

Method

HTTP Request

Description

Authorization

Response

list

GET https://<EDP_API_HOST>/v2/saved_queries

Retrieves all Saved Queries available to a user.

This request requires an authorized user.

The response contains a list of SavedQuery resources.