Filters • quartzbio.edp

Overview

The EDP provides an easy-to-use, real-time API for querying datasets and files through the R client library, which also enable the application of complex filters on dataset and file fields. Users can also utilize Bash to query and filter datasets.

Users can filter on any field in a dataset or file. Filters can be applied using 3 syntaxes:

Math syntax
API JSON syntax
API syntax in R data structure

# Filter ClinVar dataset to pathogenic variants

clinvar <- Dataset.get_by_full_path("quartzbio:Public:/ClinVar/5.2.0-20210110/Variants-GRCH37")
filters <- '[["clinical_significance", "pathogenic"]]'
records <- Dataset_query(dataset_id = clinvar$id, filters = filters)

# Filter ClinVar file to pathogenic variants

clinvar <- Object.get_by_full_path("quartzbio:Public:/ClinVar/5.2.0-20210110/ClinVar-5-2-0-20210110-Variants-GRCH37-1425664822266145048-20221110194518.json.gz")
Object.query(id = clinvar$id, filters = '[["clinical_significance", "pathogenic"]]')


# There are 3 basic syntaxes that can be used

### generally, 3 ways
# math syntax
df1 <- Dataset_query(clinvar$id, filters = filters('clinical_significance = "pathogenic"'))

# API JSON syntax
df2 <- Dataset_query(clinvar$id, filters = '[["clinical_significance", "pathogenic"]]')

# API syntax in R data structure
df3 <- Dataset_query(clinvar$id, filters = list(list("clinical_significance", "pathogenic")))

Filters can be modified using “filter actions”, which let users adjust the condition of a filter. To specify a filter action, users can append it to the field name when building a filter:

<field>__<action> 
#For example, date__gte for filtering by dates greater or equal to the input

Action	Description
iexact (default)	Field is equal to value (case-insensitive). If the field’s value is a list, this will match values within the list, not the list as a whole.
exact	Field is an exact match to value. Useful for case-sensitive string field queries.
in	Field is “one of” a list of values. (similar to Python’s in operator).
range	Field is a number within two values. Ranges are inclusive (fully closed).
gt	Field is a number greater than value.
lt	Field is a number less than value.
gte	Field is a number greater than or equal to value.
lte	Field is a number less than or equal to value.
contains	Field contains this string value.
regexp	Field value matches this regular expression. (Note: The action is only compatible with datasets)

Some filter actions (range, gt, lt, gte, lte) may only be used on numeric and date fields.

Full-text fields use the contains filter action by default and act like a typical search would. Results are ordered by relevance based on the provided search terms. When using the contains action on string fields, the system converts the filter into a regular expression: .*{VALUE}.*, which is equivalent to the SQL expressions %{VALUE}%.

String Filters

Users may filter string fields using the exact (or case-insensitive) match, regular expression match (regexp), or prefix match (prefix). Users can also match against multiple strings at once (a boolean or) using the “in” filter. By default, filters on string fields use the “equals” match.

# Filtering dataset using JSON syntax
# Query Dataset
clinvar <- Dataset.get_by_full_path("quartzbio:Public:/ClinVar/5.2.0-20210110/Variants-GRCH37")

# Equals match
Dataset_query(dataset_id = clinvar$id, filters = '[["gene", "BRCA1"]]')

# Equals match (in list)
Dataset_query(dataset_id = clinvar$id, filters = '[["gene__in", ["BRCA1", "BRCA2"]]]')

# Regular expression match
Dataset_query(dataset_id = clinvar$id, filters = '[["gene__regexp", "BRCA[12]"]]')

# Prefix match
Dataset_query(dataset_id = clinvar$id, filters = '[["gene__prefix", "BRCA"]]')


# Filtering dataset using Math syntax
# Equals match
Dataset_query(dataset_id = clinvar$id, filters = filters('(gene = "BRCA1")'))

# Equals match (in list)
Dataset_query(dataset_id = clinvar$id, filters = filters('gene IN ("BRCA1", "BRCA2")'))

# Regular expression match
Dataset_query(dataset_id = clinvar$id, filters = filters('gene regexp "BRCA[12]"'))


# Filtering dataset using R data structure
# Equals match
Dataset_query(dataset_id = clinvar$id, filters = list(list("gene", "BRCA1")))

# Equals match (in list)
Dataset_query(dataset_id = clinvar$id, filters = list(list("gene__in", c("BRCA1", "BRCA2"))))

# Regular expression match
Dataset_query(dataset_id = clinvar$id, filters = list(list("gene__regexp", "BRCA[12]")))

# Prefix match
Dataset_query(dataset_id = clinvar$id, filters = list(list("gene__prefix", "BRCA")))

Text Filters

Long (paragraph-length) fields typically use the text data type. The “contains” filter in text fields works more like a search than a filter. Results that match the search term are brought back in the order of relevance.

# Contains match for text fields
MedLine <- Dataset.get_by_full_path("quartzbio:Public:/MEDLINE/2.3.4-2018/MEDLINE-sample")

# JSON Syntax
Dataset_query(dataset_id = MedLine$id, filters = '[["abstract__contains", "Immunological"]]')

# Math Syntax
Dataset_query(dataset_id = MedLine$id, filters = filters('(abstract CONTAINS "Immunological")'))

# R data structure syntax
Dataset_query(dataset_id = MedLine$id, filters = list(list("abstract__contains", "Immunological")))

Numeric & Date Filters

Numeric and date fields can be filtered by exact match, exact match in a list (in), half-open range match (range), and standard operators (gt, lt, gte, lte). Dates are in the format YYYY-MM-DD.

# Equals match
clinvar <- Dataset.get_by_full_path("quartzbio:Public:/ClinVar/5.2.0-20210110/Variants-GRCH37'")

# JSON syntaz
Dataset_query(dataset_id = clinvar$id, filters = '[["info.ORIGIN", 4]]')

# Math syntax
Dataset_query(dataset_id = clinvar$id, filters = filters("(info.ORIGIN = 4)"))

# R data structure syntax
Dataset_query(dataset_id = clinvar$id, filters = list(list("info.ORIGIN", 4)))

# Equals match (in list)
Dataset_query(dataset_id = clinvar$id, filters = '[["info.ORIGIN__in", [1, 2, 3]]]')

# Range query
Dataset_query(dataset_id = clinvar$id, filters = '[["info.ORIGIN__range", [1, 3]]]')

# Operator query (gt/gte/lt/lte)
Dataset_query(dataset_id = clinvar$id, filters = '[["info.ORIGIN__gt", 4]]')

Entity Filters

EDP-supported Entities can be used for filtering, without requiring the exact field name that the Entity resides in. The entity filters are only compatible with datasets.

clinvar <- Dataset.get_by_full_path("quartzbio:Public:/ClinVar/5.2.0-20210110/Variants-GRCH37'")

# Gene entity query
Dataset.query(id = clinvar$id, entities = list(c("gene", "BRCA2")))

# Variant entity query
Dataset.query(id = clinvar$id, entities = list(c("variant", "GRCH37-13-32890599-32890599-C")))

Combining Filters

The examples below show how to filter a dataset on one or two fields. In many cases, users will probably need to combine many filters into a single query.

When manually writing queries, users can combine and nest filters using boolean operators (‘and’, ‘or’, ‘not’).

clinvar <- Dataset.get_by_full_path("quartzbio:Public:/ClinVar/5.2.0-20210110/Variants-GRCH37")

# Combining filters with boolean operaters with different syntaxes

# AND
Dataset_query(dataset_id = clinvar$id, filters = '[{"and": [["gene", "BRCA1"], ["clinical_significance", "pathogenic"]]}]')

Dataset_query(dataset_id = clinvar$id, filters = filters('(gene = "BRCA1") AND (clinical_significance = "Pathogenic")'))

# OR
Dataset_query(dataset_id = clinvar$id, filters = '[{"or": [["gene", "BRCA1"], ["gene_symbol", "BRCA2"]]}]')

Dataset_query(dataset_id = clinvar$id, filters = filters('(gene = "BRCA1") OR (gene_symbol = "BRCA2")'))

# NOT
Dataset_query(dataset_id = clinvar$id, filters = '[{"not": [["gene", "BRCA1"], ["gene_symbol", "BRCA2"]]}]')

Dataset_query(dataset_id = clinvar$id, filters = filters('NOT ((gene = "BRCA1") OR (gene = "BRCA2"))'))

Query Strings

Query strings are parsed into a series of terms and operators. A query string can be provided as part of an EDP dataset query in combination with filters, or as an alternative to filters. The query strings are only compatible with datasets.

Terms in a query string can be single words - “quick” or “brown” - or a phrase surrounded by double quotes - “quick brown” - which will search for all the words in the phrase, in the same order. The query syntax is based on the Lucene query syntax.

Queries are useful to find records that best match a word or phrase, relative to others. Filters are designed to reduce the potential result set by asking yes/no questions on every record in a dataset.

Query string operators allow users to customize a search. The available options are explained below:

Field Names

By default, when no field names are specified, all string or text fields are searched for each term. Users can provide an explicit field name if they know the field in question:

status:active

Users can do an exact match on a specific field, for example:

gene:"BRCA1"

To search for one-or-more terms in a field, users can combine them with OR (default) or AND:

gene:(TTN) 
gene:(TTN OR BRCA1)
gene_family:(Olfactory AND receptors)

Users can also find records with missing fields:

_missing_:sample_id

Or, records where the field has a value (i.e. “not missing”):

_exists_:sample_id

Wildcards

Wildcard searches can be run on individual terms, using ? to replace a single character, and * to replace zero or more characters:

BRCA*

Ranges

Ranges can be specified for almost any field data type. They are most useful for dates and numeric fields. Inclusive ranges are specified with square brackets [min TO max] and exclusive ranges with curly brackets {min TO max}.

For example, this query will retrieve records for all days in 2012:

date:[2012-01-01 TO 2012-12-31]

Similarly, users can also use ranges on numeric fields:

count:[1 TO 5]

And use infinite ranges:

count:[10 TO *]

Standard numeric comparison operators can also be used:

age:>10 age:>=10 age:<10 age:<=10

To combine an upper and lower bound with the simplified syntax, users can join two clauses with an AND operator:

age:(>=10 AND <20) age:(+>=10 +<20)

As processing ranges from a query string is much slower and less reliable than using an explicit range filter, users should try range filters first.

Grouping

Multiple terms or clauses can be grouped together with parentheses to form sub-queries:

(Serine OR Cysteine) AND protease

Groups can be used to target a particular field, or to boost the result of a sub-query:

status:(active OR pending) title:(full text search)^2

Reserved characters

The following characters are reserved in query strings and must be escaped with a leading backslash when used as part of a query term:

+ - = && || > < ! ( ) { } [ ] ^ " ~ * ? : \ /

For example, to search for the string (1+1)=2, the query should be written as \(1\+1\)\=2.

Advanced Filters

Users can also compose filters and apply these filters via R, Python, or the EDP UI:

# Contains match for text fields
dataset <- Dataset.get_by_full_path("quartzbio:Public:/MEDLINE/2.3.4-2018/MEDLINE-sample")

# Include all abstracts with "diabetes"
filters <- '[
    ["abstract__contains", "diabetes"]
]'
Dataset_query(dataset_id = dataset$id, filters = filters)

# Exclude all abstracts with "diabetes"
filters <- '[{
    "not": ["abstract__contains", "diabetes"]
}]'
Dataset_query(dataset_id = dataset$id, filters = filters)

# Find abstracts without "diabetes" from 1977
filters <- '[{
    "and": [
        {"not": ["abstract__contains", "diabetes"]},
        {
            "or": [
                ["date_published__regex", "*1977*"],
                ["date_created__range", ["1977-01-01", "1977-12-31"]]
            ]
        }
    ]
}]'
Dataset_query(dataset_id = dataset$id, filters = filters)

The advanced filter syntax is composed of the following elements:

{OPERATOR} is one of AND, OR, or NOT.
{FIELD} is a documented field name in the dataset.
{ACTION} is a valid field action in the format {FIELD}__{ACTION} (see below).
{VALUE} can be a string, numeric, or list value.

By default, a {FIELD} with no attached {ACTION} implies the “case-insensitive equals” (iexact) operator. Full-text (text data type) fields automatically use the “contains” filter action instead.

String and text actions include:

Action	Description
iexact	(default for string/text) Field is equal to value (case-insensitive). If the field’s value is a list, matches a value within the list, not the list as a whole.
exact	Field is an exact match to value. Useful for longer string and text fields.
in	Field is “one of” a list of values.
contains	Field contains this string value.
regexp	Field value matches this regular expression. (Note: The action is only compatible with datasets)

Numeric and date field actions include:

Action	Description
exact	(default for numeric/date) Field is an exact match to value.
range	Field is a number within two within two values (inclusive/fully-closed).
gt	Field is a number greater than value.
lt	Field is a number less than value.
gte	Field is a number greater than or equal to value.
lte	Field is a number less than or equal to value.