Querying Datasets and Files¶
Overview¶
The EDP is designed for easy access to molecular information. It provides an easy-to-use, real-time API for querying any dataset or file on the platform through the EDP Python or R client libraries. Users can also use Bash to query datasets. Users can also apply complex filters when querying datasets and files; to learn more about using filters, users can refer to the Filters documentation.
Querying Datasets¶
Dataset query results are returned in pages, similar to a search engine. To narrow down search results, datasets can be filtered on one or more fields. Users can either build queries using a programming language (or even write raw JSON) or by building them directly on any dataset page in the EDP web application. The easiest way to query datasets is by using the EDP Python or R client libraries.
A basic query returns a page of results from the specified public dataset. Users can set the paginate parameter to True to retrieve all records or use the limit parameter to specify how many records to retrieve. Users should note that in the R client, the limit parameter allows users to retrieve a maximum of 10,000 records in a single request. Additionally, the query function accepts the following parameters:
Parameter |
Value |
Description |
---|---|---|
filters |
objects |
A valid filter object. |
facets |
objects |
A valid facets object |
fields |
string |
A list of fields to include in the results. |
exclude_fields |
string |
A list of fields to exclude in the results. |
ordering |
string |
A list of fields to order results by. |
query |
string |
A valid query string. |
limit |
integer |
The number of results to return per-page. |
offset |
integer |
The record offset in the result-set. |
paginate |
boolean |
If True, returns all records. Default is False. |
page_size |
integer |
The internal batch size per request. Default is 100, with a maximum size of 10,000. Increasing the page_size can increase the speed of the query, but large numbers can in some cases cause requests to fail due to the large amount of data coming out in a single response. |
output_format |
string |
The output format of the query (‘csv’, ‘tsv’, or ‘json’). Default is ‘json’. |
In Python:
# Users can set how many records they want to retrieve with the "limit" parameter
Dataset.get_by_full_path('quartzbio:Public:/ClinVar/5.2.0-20210110/Variants-GRCH37').query(limit=1000)
# Users can order query results using the ordering argument
# Order the query results by clinical_significance ascending
Dataset.get_by_full_path('quartzbio:Public:/ClinVar/5.2.0-20210110/Variants-GRCH37').query(ordering='clinical_significance')
# Order the query results by clinical_significance descending
Dataset.get_by_full_path('quartzbio:Public:/ClinVar/5.2.0-20210110/Variants-GRCH37').query(ordering='-clinical_significance')
# Query results can be ordered by multiple columns
# Order the query results by clinical_significance descending and gene_symbol ascending
Dataset.get_by_full_path('quartzbio:Public:/ClinVar/5.2.0-20210110/Variants-GRCH37').query(ordering=['-clinical_significance', 'gene'])
Saving Queries¶
Dataset queries can be saved and then used to make queries on datasets with a similar structure. Saved queries can be created for any dataset and can be shared with members of a user’s organization.
For example, users may save a query for a set of interesting genes. They can then make this query available for all datasets that contain genes. If shared with other users in the organization, they will also be able to apply this query.
The Saved Queries API
To retrieve Saved Queries that apply to a dataset, or all those available:
In Python:
dataset_queries = SavedQuery.all(dataset="<DATASET_ID>")
all_saved_queries = SavedQuery.all()
To use a saved query, users can retrieve the SavedQuery object and then apply the parameters.
In Python:
saved_query = SavedQuery.retrieve("SAVED_QUERY_ID")
# Option 1: from the SavedQuery instance (Python only)
results = saved_query.query("<DATASET_ID>")
# Option 2: from the Dataset.query() function
results = Dataset.retrieve("<DATASET_ID").query(**saved_query.params)
To create a SavedQuery, users can define the query parameters and provide a valid dataset, as well as give it a name and description.
In Python:
params = {
"entities": [["gene", "MTOR"], ["gene", "BRCA2"], ["gene", "CFTR"]]
}
saved_query = SavedQuery.create(
name="Interesting Genes",
description="Interesting genes as defined in Pubmed article 512312"
dataset="<DATASET_ID>",
params=params
)
Using Saved Queries¶
Saved queries can be used via the EDP API or the web UI. The UI will only display queries compatible with the current dataset. This compatibility check is handled automatically by the platform.
When viewing a dataset in the web UI, previously saved queries can be retrieved by selecting “Load Filters” and then selecting one. Users can save a new query by applying filters to the dataset and then by clicking “Save Filters.” For more information, users can refer to the Dataset Exploration documentation.
Querying Files¶
File objects can be queried and filtered on one or more fields. The query results are returned in pages. It is important to note that text files such as CSV, TXT, TSV, or BED must be uploaded with headers; otherwise, the query will return incorrect results because the query logic considers the first row from the file as the header.
A basic query returns a page of results from the specified file object:
In Python:
clinvar = Object.get_by_full_path('quartzbio:Public:/ClinVar/5.2.0-20210110/ClinVar-5-2-0-20210110-Variants-GRCH37-1425664822266145048-20221110194518.json.gz')
clinvar.query()
Users can retrieve a specified number of records from the file by setting the limit query parameter:
In Python:
clinvar = Object.get_by_full_path('quartzbio:Public:/ClinVar/5.2.0-20210110/ClinVar-5-2-0-20210110-Variants-GRCH37-1425664822266145048-20221110194518.json.gz')
q = clinvar.query(limit=50)
All fields from the file can be retrieved by calling the fields method:
In Python:
fields = Object.get_by_full_path('quartzbio:Public:/ClinVar/5.2.0-20210110/ClinVar-5-2-0-20210110-Variants-GRCH37-1425664822266145048-20221110194518.json.gz').query().fields()
Users can also use the download_url() method to load files into readers such as pandas:
In Python:
from solvebio import *
import pandas
# Get file using ID or full path
f = Object.retrieve("ID")
f = Object.get_by_full_path("vault/path/to/file.csv")
# Get file download URL and load into reader
url = f.download_url()
pandas.read_csv(url)
Supported File Extensions and Compressions¶
File querying is only supported for the following file extensions and compressions:
File Extensions |
Compression |
---|---|
txt |
GZIP, BZIP2 |
csv |
GZIP, BZIP2 |
tsv |
GZIP, BZIP2 |
bed |
GZIP, BZIP2 |
json |
GZIP, BZIP2 |
parquet |
GZIP |
The only supported encoding is UTF-8.
The output format of the query can be provided by using the output_format parameter which can be one of the following:
Output Format |
Description |
---|---|
json (default) |
applicable to all file extensions |
csv |
applicable only to csv, txt, tsv, or bed file extensions |
tsv |
applicable only to csv, txt, tsv, or bed file extensions |
Example:
In Python:
clinvar = Object.get_by_full_path('quartzbio:Public:/ClinVar/5.2.0-20210110/ClinVar-5-2-0-20210110-Variants-GRCH37-1425664822266145048-20221110194518.json.gz')
clinvar.query(output_format='json')
API Endpoints¶
Methods do not accept URL parameters or request bodies unless specified. Please note that if your EDP endpoint is sponsor.edp.aws.quartz.bio, you would use sponsor.api.edp.aws.quartz.bio.
Dataset Query
Method |
HTTP Request |
Description |
Authorization |
Response |
---|---|---|---|---|
query |
POST /v2/datasets/{DATASET_ID}/query |
Query a dataset. |
This request requires an authorized user with permission to query the target dataset. |
The dataset query response has a structure defined below. |
Request Body:
The request body should contain valid query parameters:
Property |
Value |
Description |
---|---|---|
filters |
objects |
A valid filter object. |
facets |
objects |
A valid facets object |
fields |
string |
A list of fields to include in the results. |
exclude_fields |
string |
A list of fields to exclude in the results. |
ordering |
string |
A list of fields to order results by. |
query |
string |
A valid query string. |
limit |
integer |
The number of results to return per-page. |
offset |
integer |
The record offset in the result-set. |
Users can refer to the Filters documentation for more information about constructing filters.
The dataset query response has the following structure:
Property |
Value |
Description |
---|---|---|
dataset |
string |
The name of the dataset. |
dataset_id |
integer |
The ID of the dataset. |
facets |
objects |
Facet results (if a facets are requested). |
offset |
integer |
The current offset within the whole result-set. |
results |
objects |
A list of dataset records. |
took |
integer |
Time to retrieve the records, in miliseconds. |
total |
integer |
The total number of records in the result-set. |
Saved Queries
Method |
HTTP Request |
Description |
Authorization |
Response |
---|---|---|---|---|
get |
GET https://<EDP_API_HOST>/v2/saved_queries/{ID} |
Retrieve a Saved Query. |
This request requires an authorized user with permission. |
The response contains a SavedQuery resource. |
Method |
HTTP Request |
Description |
Authorization |
Response |
---|---|---|---|---|
create |
POST https://<EDP_API_HOST>/v2/saved_queries |
Create a new Saved Query. |
This request requires an authorized user with appropriate permissions. |
The response contains the new SavedQuery resource. |
Request Body:
Property |
Value |
Description |
---|---|---|
name |
string |
A short name for the Saved Query. |
description |
string |
A description for the Saved Query. |
dataset |
string |
The ID or full_path of a dataset to validate this query parameters against. This is needed on initial creation to ensure valid parameters. |
params |
objects |
The query parameters (see query parameters above for query method). |
is_shared |
boolean |
If True, this query will be shared with other members of you organization |
Method |
HTTP Request |
Description |
Authorization |
Response |
---|---|---|---|---|
delete |
DELETE https://<EDP_API_HOST>/v2/saved_queries/{ID} |
Delete a Saved Query. |
This request requires an authorized user with write permissions on the resource. |
The response returns “HTTP 200 OK” when successful. |
Method |
HTTP Request |
Description |
Authorization |
Response |
---|---|---|---|---|
list |
GET https://<EDP_API_HOST>/v2/saved_queries |
Retrieves all Saved Queries available to a user. |
This request requires an authorized user. |
The response contains a list of SavedQuery resources. |