Filters

Overview

The EDP provides an easy-to-use, real-time API for querying datasets and files through the Python and R client libraries, which also enable the application of complex filters on dataset and file fields. Users can also utilize Bash to query and filter datasets.

Users can filter on any field in a dataset or file:

#Filter ClinVar dataset to pathogenic variants

dataset = Dataset.get_by_full_path('quartzbio:Public:/ClinVar/5.2.0-20210110/Variants-GRCH37')
dataset.query().filter(clinical_significance='pathogenic')

#Filter ClinVar file to pathogenic variants

clinvar = Object.get_by_full_path('quartzbio:Public:/ClinVar/5.2.0-20210110/ClinVar-5-2-0-20210110-Variants-GRCH37-1425664822266145048-20221110194518.json.gz')
clinvar.query().filter(clinical_significance='pathogenic')

Filters can be modified using “filter actions”, which let users adjust the condition of a filter. To specify a filter action, users can append it to the field name when building a filter:

<field>__<action>
#For example, date__gte for filtering by dates greater or equal to the input

Action

Description

iexact (default)

Field is equal to value (case-insensitive). If the field’s value is a list, this will match values within the list, not the list as a whole.

exact

Field is an exact match to value. Useful for case-sensitive string field queries.

in

Field is “one of” a list of values. (similar to Python’s in operator).

range

Field is a number within two values. Ranges are inclusive (fully closed).

gt

Field is a number greater than value.

lt

Field is a number less than value.

gte

Field is a number greater than or equal to value.

lte

Field is a number less than or equal to value.

contains

Field contains this string value.

regexp

Field value matches this regular expression. (Note: The action is only compatible with datasets)

Some filter actions (rangegtltgtelte) may only be used on numeric and date fields.

Full-text fields use the contains filter action by default and act like a typical search would. Results are ordered by relevance based on the provided search terms. When using the contains action on string fields, the system converts the filter into a regular expression: .*{VALUE}.*, which is equivalent to the SQL expressions %{VALUE}%.

String Filters

Users may filter string fields using the exact (or case-insensitive) match, regular expression match (regexp), or prefix match (prefix). Users can also match against multiple strings at once (a boolean or) using the “in" filter. By default, filters on string fields use the “equals” match.

In Python:

#Query Dataset
q = quartzbio.Dataset.get_by_full_path('quartzbio:Public:/ClinVar/5.2.0-20210110/Variants-GRCH37').query()

# Equals match
q.filter(gene='BRCA1')

# Equals match (in list)
q.filter(gene__in=['BRCA1', 'BRCA2'])

# Regular expression match
q.filter(gene__regexp='BRCA[12]')

# Prefix match
q.filter(gene__prefix='BRCA')

Text Filters

Long (paragraph-length) fields typically use the text data type. The “contains" filter in text fields works more like a search than a filter. Results that match the search term are brought back in the order of relevance.

In Python:

q = Dataset.get_by_full_path('quartzbio:Public:/MEDLINE/2.3.4-2018/MEDLINE-sample').query()

# Contains match for text fields
q.filter(abstract__contains='diabetes')

Numeric & Date Filters

Numeric and date fields can be filtered by exact match, exact match in a list (in), half-open range match (range), and standard operators (gtltgtelte). Dates are in the format YYYY-MM-DD.

In Python:

q = Dataset.get_by_full_path('quartzbio:Public:/ClinVar/5.2.0-20210110/Variants-GRCH37').query()

# Equals match
q.filter(info.ORIGIN=4)

# Equals match (in list)
q.filter(info.ORIGIN__in=[1, 2, 3])

# Range query
q.filter(info.ORIGIN__range=[1, 3])

# Operator query (gt/gte/lt/lte)
q.filter(info.ORIGIN__gt=4)

Entity Filters

EDP-supported Entities can be used for filtering, without requiring the exact field name that the Entity resides in. The entity filters are only compatible with datasets.

In Python:

clinvar = Dataset.get_by_full_path('quartzbio:Public:/ClinVar/5.2.0-20210110/Variants-GRCH37'')

# Gene entity query
clinvar.query(entities=[['gene', 'BRCA2']])

# Variant entity query
clinvar.query(entities=[['variant', 'GRCH37-13-32890599-32890599-C']])

Genomic Coordinate Filters

A dataset’s genomic build is indicated by the suffix of the dataset’s full_path. The genomic coordinate filters are only compatible with datasets.

In Python:

# GRCh37
q = Dataset.get_by_full_path('quartzbio:Public:/ClinVar/5.2.0-20210110/Variants-GRCH37'').query()

# GRCh38
q = Dataset.get_by_full_path('quartzbio:Public:/ClinVar/5.2.0-20210110/Variants-GRCH38').query()

# Position (all overlapping features) - these two queries are equivalent
q.position('chr11', 18313400)
q.position(chromosome='11', position=18313400)

# Exact position only
q.position(chromosome='11', position=17552955, exact=True)

# Range (all overlapping features) - these two queries are equivalent
q.range('chr11', 18313300, 18315000)
q.range(chromosome='11', start=18313300, stop=18315000)

# Exact range only
q.range(chromosome='11', start=18313399, stop=18313403, exact=True)

Combining Filters

The examples below show how to filter a dataset on one or two fields. In many cases, users will probably need to combine many filters into a single query.

When manually writing queries in JSON, users can combine and nest filters using boolean operators (‘and’, ‘or’, ‘not’). In the Python client, users can combine filters using the Filter and GenomicFilter classes (”&” for “and”, “|” for “or”, and “~” for “not”).

In Python:

q = Dataset.get_by_full_path('quartzbio:Public:/ClinVar/5.2.0-20210110/Variants-GRCH37').query()

# AND
f = quartzbio.Filter(gene='BRCA1') & quartzbio.Filter(clinical_significance='pathogenic')
q.filter(f)

# OR
f = quartzbio.Filter(gene='BRCA1') | quartzbio.Filter(gene='BRCA2')
q.filter(f)

# NOT
f = ~ quartzbio.Filter(gene='BRCA1')
q.filter(f)

Query Strings

Query strings are parsed into a series of terms and operators. A query string can be provided as part of an EDP dataset query in combination with filters, or as an alternative to filters.  The query strings are only compatible with datasets.

Terms in a query string can be single words - “quick” or “brown” - or a phrase surrounded by double quotes - “quick brown” - which will search for all the words in the phrase, in the same order. The query syntax is based on the Lucene query syntax.

Queries are useful to find records that best match a word or phrase, relative to others. Filters are designed to reduce the potential result set by asking yes/no questions on every record in a dataset.

Query string operators allow users to customize a search. The available options are explained below:

Field Names

By default, when no field names are specified, all string or text fields are searched for each term. Users can provide an explicit field name if they know the field in question:

status:active

Users can do an exact match on a specific field, for example:

gene:"BRCA1"

To search for one-or-more terms in a field, users can combine them with OR (default) or AND:

gene:(TTN)
gene:(TTN OR BRCA1)
gene_family:(Olfactory AND receptors)

Users can also find records with missing fields:

_missing_:sample_id

Or, records where the field has a value (i.e. “not missing”):

_exists_:sample_id

Wildcards

Wildcard searches can be run on individual terms, using ? to replace a single character, and * to replace zero or more characters:

BRCA*

Ranges

Ranges can be specified for almost any field data type. They are most useful for dates and numeric fields. Inclusive ranges are specified with square brackets [min TO max] and exclusive ranges with curly brackets {min TO max}.

For example, this query will retrieve records for all days in 2012:

date:[2012-01-01 TO 2012-12-31]

Similarly, users can also use ranges on numeric fields:

count:[1 TO 5]

And use infinite ranges:

count:[10 TO *]

Standard numeric comparison operators can also be used:

age:>10 age:>=10 age:<10 age:<=10

To combine an upper and lower bound with the simplified syntax, users can join two clauses with an AND operator:

age:(>=10 AND <20) age:(+>=10 +<20)

As processing ranges from a query string is much slower and less reliable than using an explicit range filter, users should try range filters first.

Grouping

Multiple terms or clauses can be grouped together with parentheses to form sub-queries:

(Serine OR Cysteine) AND protease

Groups can be used to target a particular field, or to boost the result of a sub-query:

status:(active OR pending) title:(full text search)^2

Reserved characters

The following characters are reserved in query strings and must be escaped with a leading backslash when used as part of a query term:

+ - = && || > < ! ( ) { } [ ] ^ " ~ * ? : \ /

For example, to search for the string (1+1)=2, the query should be written as \(1\+1\)\=2.

Advanced Filters

Users can also compose filters in JSON and apply these filters via R, Python, or the EDP UI:

In Python:

dataset = Dataset.get_by_full_path('quartzbio:Public:/MEDLINE/2.3.4-2018/MEDLINE-sample')

# Include all abstracts with "diabetes"
filters = [
    ["abstract__contains", "diabetes"]
]
dataset.query(filters=filters)

# Exclude all abstracts with "diabetes"
filters = [{
    "not": ["abstract__contains", "diabetes"]
}]
dataset.query(filters=filters)

# Find abstracts without "diabetes" from 1977
filters = [{
    "and": [
        {"not": ["abstract__contains", "diabetes"]},
        {
            "or": [
                ["date_published__regex", "*1977*"],
                ["date_created__range", ["1977-01-01", "1977-12-31"]]
            ]
        }
    ]
}]
dataset.query(filters=filters)

The advanced filter syntax is composed of the following elements:

  • {OPERATOR} is one of ANDOR, or NOT.

  • {FIELD} is a documented field name in the dataset.

  • {ACTION} is a valid field action in the format {FIELD}__{ACTION} (see below).

  • {VALUE} can be a string, numeric, or list value.

By default, a {FIELD} with no attached {ACTION} implies the “case-insensitive equals” (iexact) operator. Full-text (text data type) fields automatically use the “contains" filter action instead.

String and text actions include:

Action

Description

iexact

(default for string/text) Field is equal to value (case-insensitive). If the field’s value is a list, matches a value within the list, not the list as a whole.

exact

Field is an exact match to value. Useful for longer string and text fields.

in

Field is “one of” a list of values.

contains

Field contains this string value.

regexp

Field value matches this regular expression. (Note: The action is only compatible with datasets)

Numeric and date field actions include:

Action

Description

exact

(default for numeric/date) Field is an exact match to value.

range

Field is a number within two values (inclusive/fully-closed).

gt

Field is a number greater than value.

lt

Field is a number less than value.

gte

Field is a number greater than or equal to value.

lte

Field is a number less than or equal to value.