Coding sequence-independent homology search identifies highly divergent homopteran putative effector gene family

Many genomes contain rapidly evolving and highly divergent genes whose homology to genes of known function often cannot be determined from sequence similarity alone. However, coding sequence-independent features of genes, such as intron-exon boundaries, often evolve more slowly than coding sequences and can provide complementary evidence for homology. We found that a linear logistic regression classifier using only structural features of rapidly evolving bicycle aphid effector genes identified many putative bicycle homologs in aphids, phylloxerids, and scale insects, whereas sequence similarity search methods yielded few homologs in most aphids and no homologs in phylloxerids and scale insects. Subsequent examination of sequence features and intron locations supported homology assignments. Differential expression studies of newly-identified bicycle homologs, together with prior proteomic studies, support the hypothesis that BICYCLE proteins act as plant effector proteins in many aphid species and perhaps also in phylloxerids and scale insects.


30
One challenge that is faced by evolutionary and functional genetic studies is that 31 many genes have been identified as "orphan" or lineage-specific genes because 32 homologous sequences cannot be identified outside of a limited taxonomic range.

33
Many lineage-specific genes may not be truly novel, however, since sequence 34 divergence can cause homologs to become undetectable by sequence-search 35 methods (Vakirlis et al., 2020;Weisman et al., 2020). Identifying such extremely 36 divergent homologs remains a significant bioinformatic challenge and limits functional 37 inferences derived from homology (Loewenstein et al., 2009). 38 Genes involved in host-parasite systems often evolve extremely divergent 39 sequences as a result of genetic conflict (Eizaguirre et al., 2012;Paterson et al., 2010). 40 Aphids and their host plants represent one such antagonistic pair. Aphids are small 41 insects that feed by inserting their thin mouthparts (stylets) into the phloem vessels of 42 plants to extract nutrients (Dixon, 1978). Like many herbivorous insects, aphids 43 introduce effector molecules into plant tissues to manipulate the physiology and 44 development of plants to the insects' advantage (Elzinga and Jander, 2013;Hogenhout 45 and Bos, 2011). For example, aphids introduce calcium binding proteins that prevent 46 the plant's ability to block phloem cell transport in response to phloem vessel damage 47 (Will et al., 2007). It is believed that aphids introduce a wide range of effector proteins 48 and that these molecules contribute to the debilitating effects of multiple aphid species 49 on plants, which imposes significant financial damage on most major agricultural crops 50 (Schaible et al., 2008). 51 group of such genes that shared sequence similarity. These 476 genes are extremely 114 divergent from one another at the amino-acid sequence level and appear to have 115 evolved rapidly due to positive natural selection. We noticed, however, that non-116 sequence specific features of bicycle genes appeared to be relatively well conserved. with the same phase are rare in most genomes (Ruvinsky and Watson, 2007). 124 We hypothesized that some aspects of bicycle gene structure may evolve 125 more slowly than the primary protein sequences and thus provide an evolutionary 126 signal to detect distantly related bicycle homologs (Aguiar et al., 2015). For example, 127 the pattern of exon phases can allow identification of highly-divergent homologous 128 genes (Roy and Gilbert, 2005;Ruvinsky and Watson, 2007). Development of a full 129 model to characterize gene structure evolution is beyond the scope of this paper, but   that bicycle genes contain many microexons, the removal of average internal exon 163 size had almost no effect on model performance ( Figure 1C), suggesting that internal 164 exon size is not a uniquely defining feature of bicycle genes. When we applied this classifier to all genes in the H. cornu genome, we 166 identified five false negative and 181 "false positive" genes. Since we chose a 167 probability cutoff that should report an approximately equal number of false negatives 168 and false positives, we hypothesized that the vast majority of "false positives" are 169 newly discovered bicycle homologs. We clustered these genes together with the 170 original bicycle genes based on sequence similarity (Figure 2A), and found that 528 171 represented a single group of related genes (Cluster 2, Figure 2A) that all share 172 sequence similarity with, and included, the originally defined 476 bicycle genes ( Figure   173 2C). 174 We also identified a second cluster of 129 genes that did not share strong 175 sequence similarity with the originally defined Bicycle proteins and encode proteins 176 with incomplete CYC motifs (Cluster 1, Figure 2A). Approximately 20 of these genes 177 were strongly enriched in the salivary glands of gall-inducing foundresses, but many of

191
Since some of these new putative bicycle homologs shared little apparent 192 sequence similarity with the originally defined bicycle genes, we sought additional 193 evidence for homology. Since these genes were identified by our gene-structure based 194 classifier, we hypothesized that other details of gene structure, such as intron positions 195 in the sequence, may provide additional evidence of shared ancestry. We therefore 196 performed gene-structure-aware multiple sequence alignment of the original bicycle 197 genes and the new putative bicycle genes identified by the classifier (Gotoh, 2021). 198 Initial attempts to align all 653 candidate proteins were uninformative and we 199 hypothesized that this was because these proteins display a large range of lengths and proteins into four groups based on protein length (blue dotted lines in Figure 3A). Logo 202 plots of these four groups revealed a striking pattern, proteins contain one, two, three, 203 or four CYC motifs ( Figure 3B-E), and we therefore call these unicycle, bicycle, tricycle, 204 and tetracycle genes, respectively. This evolutionary comparison suggests that the 205 CYC motif is a functional unit, which can be multimerized within individual proteins. 206 We then divided all 653 potential H. cornu bicycle homologs into four length 207 categories and performed gene-structure aware protein sequence alignments to 208 search for shared intron position between the originally defined and newly discovered 209 bicycle homologs. We identified many evolutionarily shared intron positions between 210 11 the two classes of putative homologs (Figure 3 Supplement 1A-D), providing evidence, 211 independent of our classifier, that all of these genes shared a common ancestor.  There are two major clades of gall forming aphids, the Hormaphididae, to which 226 H. cornu belongs, and the Pemphigidae. These two aphid families are thought to have 227 shared a common ancestor that induced galls. Therefore, to determine whether bicycle 228 genes were present in the common ancestor of gall forming aphids, we assembled and

12
We applied the H. cornu bicycle gene classifier to predicted genes of the T. ulmi 234 genome and identified 279 candidate bicycle homologs. In contrast, sequence-based 235 search using BLAST and hmmer identified 1 and 53 candidate homologs at E-value < 236 0.01, respectively. We clustered the 279 T. ulmi genes discovered by the gene-  To test whether these genes likely shared a common ancestor, we performed gene-  Figure 4F), suggesting that they may 251 contribute to the effector protein repertoire in T. ulmi. 252 The discovery of bicycle genes in T. ulmi using the gene-structure based 253 classifier provides evidence that the classifier can identify putative bicycle homologs 254 even when the genes cannot be identified using sequence-based homology methods 255 and that bicycle genes were likely present in the common ancestor of gall-forming 256 13 aphids. We therefore next tested whether the classifier can detect putative bicycle 257 homologs in an aphid species that does not induce galls. The gene-structure based classifier detected 159 putative bicycle homologs in 263 the genome of Acyrthosiphon pisum, a species that does not induce galls (Li et al.,264 2019; Richards et al., 2010). In contrast, sequence-based search using BLAST and 265 hmmer identified 1 and 5 candidate hits at E-value < 0.01, respectively. Multiple  To explore whether these A. pisum bicycle homologs may contribute to the 273 effector gene repertoire of A. pisum, we performed RNA-seq of salivary glands and 274 carcasses. We found that 55% of the putative bicycle genes are strongly enriched in 275 salivary glands ( Figure 5B). In addition, 19 proteins encoded by genes with multiple 276 microexons were identified by a previously-published proteomic study of proteins 277 secreted by A. pisum into an artificial food medium (Dommel et al., 2020), and we 278 found that at least 13 of these proteins are bicycle homologs (Table S1). The presence of bicycle homologs in three divergent aphid species, H. cornu, T. 287 ulmi, and A. pisum implies that bicycle genes were present in the common ancestor of 288 aphids, which lived approximately 280 MYA. To further explore the origins and 289 evolution of bicycle genes, we downloaded genomes and RNA-seq data for nine 290 additional aphid species and ten outgroup species from NCBI. We annotated predicted 291 genes in all genomes using the same bioinformatic pipeline that we have used 292 previously to discover bicycle genes in other species, applied the bicycle gene 293 classifier to all predicted genes, and manually annotated all putative bicycle homologs 294 guided by RNA-seq data. One caveat of this analysis is that many of these genomes 295 are fragmented into many contigs and genes bridging contigs are often misannotated . In addition, we employed 303 chromosome-level genome assemblies for two outgroup species where no predicted 304 bicycle genes were found, P. venusta (Li et al., 2020) and T. vaporariorum (Xie et al.,305 2020), suggesting that the failure to identify bicycle genes in these species did not 306 result from poor genome assemblies. 307 We estimated the phylogeny for these 22 species using whole-genome 308 proteomic predictions (Emms, 2019), and this phylogeny is in general agreement with  In many of the aphid genomes we studied, our classifier identified a single 342 extremely large gene containing more than 100 microexons and encoding a protein of 343 approximately 2000 amino acids as a candidate bicycle homolog ( Figure 7A). Since this 344 gene was so large and most existing aphid genomes are incompletely assembled, we 345 observed that many of these gene models were incomplete. We therefore performed 346 de novo transcript assembly for all species studied using TRINITY (Grabherr et al.,347 2011), and assembled consensus transcripts of these long candidate bicycle genes 348 manually.

349
Since we had previously found that bicycle gene length is correlated with the  Supplement 1A), suggesting that they share a common ancestor.

363
These large genes appear to be divergent bicycle proteins that have duplicated 364 the core motif many times. We therefore call these megacycle genes. Compared to  information independent of both sequence and protein structure (Bazan, 1991;Betts, 387 2001; Brown et al., 1995). This source of information is becoming increasingly available 388 with the recent availability of many well-assembled, annotated genomes. It is possible 389 to conceptualize a model that jointly utilizes both sequence and exon-intron structure 390 in homology search, but we realized that bicycle genes are encoded with extremely We found that a linear classifier using information on gene length, exon sizes, 394 exon numbers, and exon phases provided a highly accurate predictor of bicycle 395 homologs that could not be identified using sequence similarity searches. This 396 predictor allowed identification of bicycle homologs across aphids and in several 397 outgroup taxa ( Figure 6). Homology was supported by post hoc observation of CYC 398 motifs and shared introns across many homologs. Thus, sequence independent 399 features provide substantial evidence of homology and integration of even a few 400 sequence independent features into existing sequence-based search methods may 401 substantially increase power to detect ancient homology relationships between genes.

402
All gene structure features that we used contributed to classifier performance, but no 403 single feature was critically important for classifier performance. Therefore, in future 404 work it would be ideal to incorporate multiple features of gene structure into predictive 405 models.

406
For several reasons, our classifier may have underestimated the number of 407 bicycle homologs in each species. First, this approach is sensitive to the quality of 408 genome annotation, which is itself dependent on genome assembly quality. We found 409 that most automatically annotated gene models of bicycle genes required manual re-

486
We observed, in this paper and previously (Korgaonkar et al., 2021), pervasive, repeated, and 487 23 apparently strong positive selection on bicycle genes and here we report that some bicycle 488 genes have diverged so dramatically that they cannot be recognized as bicycle genes from In the training set, the dependent variable of the GLM is whether the transcript was 507 previously classified as a bicycle gene (Korgaonkar et al., 2021), where bicycle genes were 508 coded as 1 and non-bicycle genes were coded as 0. We used only genes classified as either 509 bicycle genes or previously annotated genes in the training set due to the possibility that   genome was annotated for protein-coding genes using BRAKER (Hoff et al., 2019) with 559 multiple RNA-seq libraries (Table S3). 560 This genome has been deposited to GenBank under the accession JAIUCT000000000. Candidate bicycle genes from H. cornu, T. ulmi, and A. pisum identified by the 566 classifier were manually re-annotated in APOLLO (Dunn et al., 2019) guided by RNAseq data 567 viewed in IGV ( Thorvaldsdóttir et al., 2013). Samples used for analysis are described in Table   568 S3-5. Differential expression analyses were performed as described previously (Korgaonkar et 569 al., 2021) and full analysis scripts are provided at https://doi/figshare_spaceholder.

577
We searched for bicycle homologs in all aphid species and outgroups using Position-578 Specific Interactive Basic Local Alignment Search Tool (PSI-BLAST) (Altschul, 1997) and 579 hmmer (Eddy, 2011 (Table S5). We downloaded RNAseq data from NCBI (Table S4) and predicted coding genes 607 using BRAKER (Hoff et al., 2019) with the same workflow we had used previously that had 608 allowed discovery of bicycle genes (Korgaonkar et al., 2021).