Abstract
The same nucleotide sequence can encode two protein products in different reading frames. Overlapping gene regions are known to encode higher levels of intrinsic structural disorder (ISD) than non-overlapping genes (39% vs. 25% in our viral dataset). Two explanations for elevated ISD have been proposed: that high ISD relieves the increased evolutionary constraint imposed by dual-coding, and that one member per pair was recently born de novo in a process that favors high ISD. Here we quantify the relative contributions of these two alternative hypotheses, as well as a third hypothesis that has not previously been explored: that high ISD might be an artifact of the genetic code. We find that the recency of de novo gene birth explains ∼ 32% of the elevation in ISD in overlapping regions of viral genes, with the rest attributed, by a process of elimination, to relieving constraint. While the two reading frames within a same-strand overlapping gene pair have markedly different ISD tendencies, their effects cancel out such that the properties of the genetic code do not contribute overall to elevated ISD. Same-strand overlapping gene birth events can occur in two different frames, favoring high ISD either in the ancestral gene or in the novel gene; surprisingly, most de novo gene birth events contained completely within the body of an ancestral gene favor high ISD in the ancestral gene (23 phylogenetically independent events vs. 1). This can be explained by mutation bias favoring the frame with more start codons and fewer stop codons.