Formalization of Genome Interval Relations

In order to take full advantage of next generation genomics data, I need informatics methods to be based on agreed upon formally specified standards that can be implemented easily in a uniform fashion without ambiguity. These standards should be encoded as logical formulae, so that provably correct and efficient decision procedures can be used for query answering and validation. In this paper I present the core of such a standard for sequence data: a collection of definitions of relations that hold between genomic intervals, and an alegbra for performing operations upon these intervals. I show how these relations can be used to extend formalize concepts in the Sequence Ontology (SO).


Introduction 1.Genome Databases and the need for inference
The genome of every organism is organized as a collection of sequentially ordered nucleotide bases, with distinct regions of the sequence comprising a variety of structural and functional elements: genes, exons, regulatory regions, untranslated regions and so on.A precise understanding of these elements is vital for the life sciences and clinical applications, yet despite this importance, there is no formally defined terminology for describing the different relations that can hold at the primary structure level.The lack of any such standard is a hindrance to interoperability between computer systems, which could have serious ramifications as genomic database continue to grow in size and importance.
To illustrate the problem, consider basic biological questions such as how many distinct introns are there in the human genome?, what is the proportion of size of intergenic to genic regions across all sequenced eukaryotes? and what is the ratio of SNPs in coding regions vs those in UTRs?These are reasonable questions that turn out to be difficult to answer without recourse to the suboptimal solution of writing programs to obtain the answers.One reason existing databases have problems answering these questions is that they are each inconsistent in what information is represented and what information must be derived algorithmically.A collection of formal genome interval relations would be of use not only for precisely specifying region-based queries, but also as the basis for computable definitions of the inference rules required to obtain a set of introns given a set of exons, or to obtain UTRs and coding sequence from transcript and start and stop codon information.
These relations should extend existing work on defining relations in biology [1], in which a formal or semi-formal approach is adopted, with the definitions of relations specified as unambiguously as possible.One way to eliminate ambiguity is to specify relations using First-Order Logic (FOL) axioms.This allows the use of theorem provers to determine the consequences of the statements.

Sequence Ontology
The Sequence Ontology (SO) [2] was originally conceived of as a structured controlled vocabulary for genome databases and exchange formats.The SO can also be treated as a collection of axioms stating truth-conditions for relations between instances of biological features such as genes, transcripts and exons.The translation is between SO relationships and FOL axioms is shown in table 1.
We can translate each feature type T in SO to a unary predicate such that T (x) is true whenever x is an instance of T .This exon(x) holds for all values of x where x is an instance of an exon.Here we treat SO as a representation of what the Basic Formal Ontology calls generically dependent continuants [3].This means that a single instance of a SO type can have multiple molecular bearers: an individual human consists in part of trillions of chromosome molecule instances which (excluding somatic variation) in toto bear a single genome instance.It is these instances that form the domain of discourse of genome databases, rather than the individual chromosome molecules.We can derive fact axioms such as exon(ENSE00001545001) from rows in a relational database such as Ensembl [4] or Chado [5].
The SO is a hierarchy is translated such that T is a U becomes T (x) → U (x).For example mRNA(x) → transcript(x) (every mRNA is a transcript).All-some relationships such as part of specified using the methodology of the Relations Ontology [6] can be translated to quantified statements over individuals, such that for example the SO type-level relationship TSS part of transcript becomes TSS(x) → ∃y, transcript(y), part of(x, y).SO also includes relation axioms, such as transitivity of the part of relation.
Compositional terms in SO generally have logical definitions, stating necessary and sufficient conditions expressed in simple genus differentia form, i.e. an X is a G that D, for example, transposable element gene is a gene that is part of a transposable element .This translates to the definitional axiom: The convention of using italics for type level relations and bold for instance level relations is taken from [6].The rules for translating to instances from a genome database are not specified.
transposable element gene(x) ↔ gene(x) ∧ (∃y : transposable element(y), part of(x, y)) These logical definitions can be used by reasoning engines to automatically classify the ontology (i.e.infer is a relations).They can also be translated to relational database queries to find instances of implicit feature types.
SO is currently lacking computable logical definitions for many terms such as UTR, region, five prime UTR, CDS, intron and five prime coding exon.These definitions are currently specified in natural language, which means they must be translated by humans if they are to be used algorithmically (e.g. to query a database for all implicit intron or five prime UTR features by performing arithmetic operations on the ranges of stated features).SO is also lacking axioms constraining the genomic positioning of related features.For example, that each exon must lie within the region of the gene of which it is a part, or that the TSS of a transcript must be upstream of that transcript.
Ideally we would like to add these kinds of logical definitions and axioms to SO, but we first need to extend on the set of relations used in SO.We can build upon existing relations designed to support qualitative spatial and temporal reasoning, namely the Region Connection Calculus and the Allen Interval Algebra.

Region Connection Calculus (RCC-8)
The region connection calculus (RCC) is a system for qualitative spatial representation and reasoning.RCC abstractly describes regions (in Euclidian space, or in a topological space) by their possible relations to each other.RCC8 consists of 8 basic relations that can hold between two regions: disconnected (DC), externally connected (EC), equal (EQ), partially overlapping (PO), tangential proper part (TPP), tangential proper part inverse (TPPi), non-tangential proper part (NTPP) and non-tangential proper part inverse (NTPPi).See figure 1.
We can compose these relations using logical operators.For example, For example P P = T P P ∪N T P P .For qualitative reasoning about the relations between regions, there is a composition table [7].Given the relation R 1 between x and y and the relation R 2 between y and z, the composition table allows us to de- RCC-8 is commonly used in geographical information systems.It could also be of use in reasoning about biological or biochemical entities in any number of dimensions.One consideration is that RCC-8 operates over continuous regions, rather than discrete units, such as nucleotides.

Allen's Interval Algebra
Allen's Interval Algebra (AIA) [8] defines possible relations between time intervals, and operations on these intervals, that can be used as a basis for qualitative reasoning about temporal descriptions of events.
Composing relations together gives a total possible 8192 relations.The composition operators are intersection (∩), union (∪) and complementation (¬).For example, the union p ∪ pi holds whenever two time intervals have no time point in common.
Satisfiability is NP-Complete with AIA i.e. given a collection of intervals and the relations that hold between them we cannot in general compute if there are time values for which the relations are true.However, there are tractable sub-algebras for which efficient decision procedures exist [9].
Whilst the AIA is generally described as consisting of temporal intervals, it can be applied to any kind of interval, including genomic intervals.One consideration is that like RCC-8, the AIA assumes continuous intervals, rather than discrete intervals.

An Algebra of Genomic Intervals
We base our Genomic Interval Algebra (GIA) on the Allen Interval Algebra (AIA), and extend it with additional relations requires for reasoning about DNA sequences.AIA is more suited than RCC-8 for a basis as genome intervals, like tenporal intervals are directional.We change some of the terminology of Allen (e.g. using "upstream" and "downstream" instead of "precedes" and "precededBy"), and take some terms from RCC-8 (e.g."adjacent"), but use Allen-based definitions.
When considering the biological meaning of these relations, it is important to stress that they hold between sequences or sequence intervals (i.e.primary structures) and not the molecules that are the bearers of those sequences.For example, an RNA molecule or intron may exhibit connectedness/adjacency between bases at the secondary structure level.Similarly, a transcription factor protein may exhibit binding between its amino acid chain and the DNA sequence upstream of a gene.We consider both these cases to be non-adjacent and disconnected at the sequence level.
The core of the GIA consists of 16 basic relations R 16 that can hold between any two intervals on the same strand of a sequence, defined in terms of Allen relations.Whilst the IAI treats intervals as primitives, we also provide definitions in terms of junctions (the equivalent being time-points in a temporal calculus), yielding a junction calculus.We then extend the core set of relations to account for strandedness, deriving an additional 32 relations.We have declared 16 interval relations that can hold between two intervals on the same strand of a sequence.These are shown in table 2. Some relations are defined in terms of other relations using relation-intersection and relation-union operators.These are defined as follows: intersection: Note that we do not take the same set of primitives as Allen; we choose our set based on utility within genome databases and in the SO.This means that there is not an isomorphic correspondence between the GIA and Allen.We provide Allen-based definitions using relation intersection and and union.Unlike Allen, our set is not pairwise disjoint.For example, adjacent to = upstream adjacent to ∩ downstream adjacent to.We also make different terminological choices from the IAI.For example, we use overlaps in a more general sense, in accord with how this term is typically use in bioinformatics, and how it is defined in the Relation Ontology.

Junction-based definitions
We also provide definitions for all interval relations in terms of point-positions or junctions.We define a proper junction as a discrete point connecting two nucleotide bases.Junctions are the union of proper junctions and the outermost boundary points of a sequence (this inclusive definition of junction simplifies the axioms).
We define two functions α and ω each of which maps an interval to a point (junction), correspoding to the start and end of the interval.We introduce a relation succ which holds between two junctions separated by a base, such that the first is at the 5' end and the second is at the 3' end.We overload the symbol < as the transitive version of this relation: x < y ↔ succ(x, y) ∨ ∃z : (succ(x, z), z < y.We use > as the inverse of this relation and define <= as the union of < and = and >= as the union of > and =.In the sequence ontology we give these full names such as before.
For any interval, the start is before the end.Formally: Both < and > are irreflexive, and for any non-circular genome they are anti-symmetric i.e. ¬∃x, y : x < y, y < x.
Table 3 gives the definition of interval relations in terms of junctions.We can eliminate the function terms from the definitions by translating to composition rules such that for example: ω(x) <= α(y) ↔ upstream of(x, y) Is translated to: has end(x, x s ), before or on(x s , y e ), start of(y e , y) ↔ upstream of(x, y) We can eliminate the variables using an equivalence assertion and a relation chain: has end • before or on • start of ↔ upstream of(x, y) Here has start and has end are a functional relations between a region and a junction, with inverses start of and end of.
These additional relations give us an algebra over junctions and relations which we call GIA J , which can be used with qualitative reasoning systems.Note the correspondence of succ to the function +1 and between < and > and their arithmetic counterparts -the correspondence allows the use of either artihmetic operations or qualitative reasoning.

Derived Reverse-Complement Relations
The major difference between temporal and genomic intervals is that DNA is stranded.The GIA must therefore be an extension of the AIA to fully account for strandedness.
We treat each strand of a double-stranded DNA molecule as bearing two distinct sequence intervals s + and s − , related by the RC relation.Each junction j + a on s + has a unique single cognate junction j − a on s − related via RC such that upstream and downstream are reversed: Further axioms can added to state the relationship between base types on opposing strands (not shown here).
For any genome interval relation in r ∈ R 16 , we can define a reversecomplement cognate, r R .We obtain definitions for these automatically using the formula: In table 4 we show only one RC relation, upstream overlaps R .This is equivalent to upstream overlaps with RC applied to the second argument.Note that unlike upstream overlaps, this is a symmetric relation.Conversely, nt contained by R (not shown), is the inverse of nt contains R .See figure 2 for an illustration of why this is the case.
For any r ∈ R 16 , we can further define a relation r U = r ∪ r R .Table 4 illustrates this with upstream overlaps R ∪ upstream overlaps which we call this upstream overlaps U .These union relations correspond to common use cases (e.g.Region-of-Interest queries in Genome Browsers [10][11] [12]), so we should have intuitive names for them.However, from the point of view of axiomatisation, it is simpler to treat these as derived rather than basic relations.
we have 16 relations in the core set (including inverse relations for non-symmetric relations).We declare relations for this core set, plus their RC equivalents, plus the union set.This gives us 48 relations in total.This may seem excessive but as we will see we will need most for our use cases.

Operations over collections of intervals
Many genomic features correspond to collection of intervals on a sequencefor example, the coding part of a multi-exon gene.Neither RCC-8 not AIA dictate operations over collections of regions or intervals.We propose a simple extension in GIA for dealing with such collections.
We overload the relations that are used for single intervals, but provide distinct definitions.We also define new relations that are only applicable to Relation Definition Inverse uo upstream overlaps Table 4: Example of Reverse Complementation cognate relations for the upstream overlaps relation.Note the interaction between RC and inverse relations: in particular the inverse of uo R does not correspond to a single named relation.
collections.For example, in figure 2 genes A and B stand in an o R (overlaps) relation, even though their respective exons share no bases.However, we can introduce a new relation interleaves R , defined such that the exons in gene A interleave (on the opposite strand) the exons in gene B. (note that it is important that we distinguish between the sets of exons interleaving and the genes overlapping).
We present here a subset of the full set of relations, which have yet to be finalized.We treat each collection as a set of intervals.
Overlaps must hold for some pair of elements from each pair: Adjacency must hold of any one pair, and in addition there should be no overlap: Upstream must hold for all elements: Interleaves is more complex:

Composition table
The full composition table for GIA(J) is too large to show but is available as part of the Genome Intervals relations file.Some examples include: Transitivity of upstream of: upstream of • upstream of → upstream of starts and finishes compose to make (non-tangential) containment: The RC upstream of relation is transitive over downstream of (consider Ax1 u Bx2 d Bx1 in figure 2): The full composition table was derived by an automated theorem prover (see methods).

Extending the Sequence Ontology
We can use the genome interval relations above to extend the SO, adding new logical definitions and constraint axioms -we call the resulting artefact SO + .These logical definitions can be used for both reasoning within the ontology, and to infer the presence of unstated genomic features in genome databases.The constraint axioms can be used to detect inconsistencies within the ontology, and to provide constraints for genome databases.

Logical Definitions
Logical definitions provide necessary and sufficient conditions in computable form.We have created 140 new definitions for existing SO types based on the GIA relations.Table 5 shows a subset of these.SO uses the type-level version of these relations

Type
Genus Differentia .CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is made available under The copyright holder for this preprint (which was not this version posted June 27, 2014.; https://doi.org/10.1101/006650doi: bioRxiv preprint these.Note that we exclude the relationships which are trivially obtained from the definitions in table 5.

Type
Relationship TSS upstream of primary transcript polyA sequence downstream adjacent to mRNA polyA site end of mRNA Table 6: Proposed new relationship axioms for SO using genome interval relations.These are type-level relationships that can be translated to instance level relationships such that for example TSS(x) → ∃y, primary transcript(y), upstream of(y) We can use these type-level relationships as constraints on a genome database.We must do this carefully, bearing in mind the fact that these axioms make an open-world assumption: just because an instance in reality is entailed to exist, it does not follow that this musy be explicitly represented in the genome database.
Note that many of the relationships in table 6 are in fact underconstrained as constraints.For example, take TSS upstream of primary transcript .This is in fact an extremely weak axiom -it states that every TSS is upstream of some primary transcript -the TSS and transcript do not have to be otherwise related.This will be trivially true: a TSS located at the end of a chromosome sequence will be upstream of all the other genes on the same strand.
In fact we want to say that every TSS lies upstream of the same transcript that it regulates.The current version of SO uses a part of relation between the TSS and the primary transcript to indicate that every TSS is functionally coupled to a primary transcript.We can write a fully constrained axiom: part of(x, y) ∧ TSS(x) ∧ transcript(y) → upstream of(x, y) We can use this axioms to check instance-level data, as found in genome databases and data files.

Reasoning over genome intervals
We can use SO + (the combination of the SO extended with GIA relations) to perform reasoning tasks.These tasks can be broken down according to whether we are performing validation of axioms or inference of unstated axioms.We can further break this down according to whether we are performing reasoning over just SO + , or the combination of SO + and some genomic instance data.
We can use a number of different systems for reasoning.Relational databases are generally considered to be the least expressive (meaning that we cannot translate some axioms to the relational model), but are also considered to scale to large genome-sized datasets.At the other extreme are first-order logic theorem provers, which allow for high expressivity, but do not scale well.In between there are a number of different systems that can be roughly divided into partially overlapping rule-based and description-logic (DL) approaches.
Rule based approaches extend the expressivity of the relational model with arbitrary recursive rules; the OBO-Edit reasoner is an example of a rule-based system.More expressive systems include disjunctive datalog engines such as DLV.OWL-DL (Web Ontology Language Description Logic) was designed to maximine expressivity and decidability, and there are a variety of OWL-DL reasoners.
The set of relations consisting of the core 16 basic GIA relations do not correspond to any of the tractable sub-algebras of the Allen Algebra (for example, the adjacent to relation, defined in terms of Allen as m ∪ mi appears in none of the tractable subsets).However, this is not a major concern for the majority of genome databases in which interval junctions can always be assigned to discrete ordered bases.
We present first examples of reasoning using SO + , and then examples of reasoning using a combination of SO + and genome databases.

Reasoning over the ontology
Given an ontology consisting of a set of asserted relationships, a reasoner can infer the entailed relationships.The full set of entailed relationships for a relation R is called the deductive closure of R.
Computing entailed relationships is useful for ontology maintenance -the a for 200 of the terms in SO is maintained automatically be the OBO-Edit reasoner, using the existing genus-differentia definitions.
Computing the deductive closure of all ontology relations is also useful for improving genome database queries.
Given that the SO is by itself several orders of magnitude smaller than a typical genome database, we can afford to use a system with higher expressivto do the reasoning.We have used both OBO and OWL-DL reasoners to compute entailed relationships in SO + , and both give the same results.
In theory an OWL-DL reasoner can compute relationships that are difficult to compute using a rule-based approach.For example, inferring that every codon overlaps a CDS, based on the five axioms: (a) a codon is either a start or stop codon (b) start codons start a CDS, and (c) start implies overlaps (d) stop codons stop a CDS, and (e) stop implies overlap.Currently the OBO-Edit reasoner does not make use of class unions.In this particular case it doesn't matter, as the relationship was already asserted at the codon level.
Figure 3 shows examples of entailed relationships.We can ask how are the type 5'UTR and start codon related?and get the answer upstream adjacent to, even though this fact is not explicitly stated in the ontology.
The other application of reasoning over ontologies is to find inconsistencies or unsatisfiable classes.Using both OBO-Edit and OWL-DL reasoners, we could find no inconsistencies between axioms within the extended SO.This is not surprising as the SO is carefully scrutinized by its editors prior to each release, and automated procedures are currently in use with the existing axioms.We still expect that the extended SO presents more precise axioms on which these reasoners can operate. .Not all inferences are shown.This is an example of qualitative/symbolic reasoning -we can make inferences even without arithmetic, using the GIA An example of a mistake that can be detected by reasoners is: (a) three prime UTR downstream adjacent to CDS (b) stop codon is a codon (c) codon nt contained by CDS (d) stop codon starts three prime UTR .This example is not entirely artificial: prior to the existence of SO there was inconsistency amongst the genomics community as to whether the stop codon should be considered part of the CDS.This inconsistency caused interoperability problems, which were solved for systems adopting SO.

Enhanced rigor in the Sequence Ontology
In describing the GIA and extending SO we came across portions of SO that were in need of more precise textual definitions.For example, intergenic region was defined as A region containing or overlapping no genes that is bounded on either side by a gene.In formalizing the definition for this using the upstream adjacent to and downstream adjacent to relations to represent bounded on either side we realized that this definition excluded the two regions on either end of a chromosome.The textual definition was extended to be include the disjunctive clause or bounded by either a gene or the end of the chromosome.Another example was splice site, which had a textual definition indicating that it was a junction but a placement in the is a hierarchy indication a region.Once the computable definition was added, the inconsistency could be detected by a reasoner (although it was in fact detected whilst preparing the logical definition).This resulted in the definition being clarified and the addition of a new term splice junction.

Junction-oriented vs Base-oriented
Our formulation is a junction or interbase one.We could equally have defined interval relations in terms of the bases themselves.We consider an interbase system to have a slight advantage in terms of simplicity of representation of positioning of splice junctions, insertion regions and so on.However, the two systems would be equivalent in terms of expressivity, so the choice of one over the other is arbitrary.

Splicing, transcripts and exon identity
The examples presented in this paper make a simplifying assumption, namely that all types in the SO represent features along a DNA sequence.The relationships and definitions presented here do not account for the biology of splicing and translation.For example, exons are non-adjacent on the DNA sequence and on the unspliced RNA sequence, but become adjacent after splicing.A full treatment will have to account for this temporal aspect.One approach is to introduce different types, such as exon G and exon T for exons on genomes and exons on transcripts respectively.Another is to use n-ary relations such that it is possible to state exon adjacent to exon on mature transcript and exon disconnected from exon on genome .

Circular genomes
The current axiomatization needs to be modified to handle circular genomes.One approach is to weaken the definition of < such that it is reflexive on circular genomes.However, this will have some consequences for the other axioms, which needs to be fully worked out.For example, every feature would be upstream of every other feature.
Another solution would be to have some kind of probabilistic metric or arbitrary cut-off, whereby junctions are no longer considered upstream if they loop around the circular DNA too far.But this would be difficult to integrate into these existing axioms.
The most likely approach is to use origin of replication (oriC in bacteria) as the origin of the sequence.

Conclusions
We have defined a collection of genome interval relations, and used them to define an extension of the Sequence Ontology (SO).This extension and these relations will soon become part of the core SO.The relations help clarify the meaning of terms in the SO to humans, and can be used by automated reasoning systems to assist with the construction and quality control of the ontology.
In additions the relations are useful for querying over and checking the conformance of datasets to constraints in the SO.The relations help clarify the meaning of certain queries to human beings, such that a query for "all exons upstream of gene ABC" has precise semantics.The extended SO can be used to enhance database queries such that implicit features such as introns are found.We found the most effective system for querying over genome datasets was a relational database with the help of a query expansion system.OWL-DL systems do not yet scale over genome sized datasets.
The composition table is useful for making inferences over the ontology, but for making inferences over data it is simpler to use arithmetic definitions rather than a composition table.
We believe that as genome datasets grow in size, complexity and importance the need for formal computable specifications of genomic data will increase.Specifications such as the one outlined here will be vital for ensuring the semantic and biological correctness of important datasets, and for performing advanced queries over these datasets.

Defining the Genome Interval Relations
We used OBO-Edit [13] to specify the genome interval relations and to create an extended subset of SO that used these relations for genus-differentia definitions.This was stored in an OBO Format 1.3 file.

Generation of composition table
To generate the full composition table, we first translated the genome interval relations to Prover9 syntax and used the Prover9 tool to calculate the table by brute force attempts to prove every possible R1 • R2 → R for all values of R, R1 and R2.

Availability
Common Logic specifications of the relations can be obtained from the github repository2

Figure 1 :
Figure 1: Relations in the Region Connection Calculus

Figure 2 :
Figure 2: Example of genomic sequence interval relations: two interleaved genes A and B on opposite strands.The lookup table shows the mnemonics for the relations between any two feature intervals.To determine the relation xRy, look up (row:x column:y).For example the first exon of A (Ax1) is upstream of the reverse-complement projection of all the exons of B.

Figure 2
illustrates these relations with a simplified example.

Figure 3 :
Figure 3: Reasoning over ontologies: this example illustrates the deductive closure involving a few SO types, and makes use of the relation composition axiom upstream adjacent to • started by → upstream adjacent to to infer that every five prime UTR upstream adjacent to start codon (inferred relationships are shown with dashed lines).Not all inferences are shown.This is an example of qualitative/symbolic reasoning -we can make inferences even without arithmetic, using the GIA

Table 1 :
Translation table for relationships in the Sequence Ontology (SO).

Table 2 :
16Core Relations in the Genome Interval Algebra.Glyphs depict the relation holding between A and B, with the strand indicated by an arrow.This table provides definitions based on Allen relations.We provide both humanfriendly names (e.g.upstream of) and mnemonics (e.g.u)2.1.1Basic single-strand relations