Whole Genome Assembly of a Hybrid Trypanosoma cruzi Strain Assembled with Nanopore Sequencing Alone

Trypanosoma cruzi is the causative agent of Chagas disease, which causes 10,000 deaths per year. Despite the high mortality caused by the pathogen, relatively few parasite genomes have been assembled to date; even some commonly used laboratory strains do not have publicly available genome assemblies. This is at least partially due to T. cruzi’s highly complex and highly repetitive genome: while describing the variation in genome content and structure is critical to better understanding T. cruzi biology and the mechanisms that underlie Chagas disease, the complexity of the genome defies investigation using traditional short read sequencing methods. Here, we have generated a high-quality whole genome assembly of the hybrid Tulahuen strain, a commercially available Type VI strain, using long read Nanopore sequencing without short read scaffolding. Using automated tools and manual curation for annotation, we report a genome with 25% repeat regions, 17% variable multigene family members, and 27% transposable elements. Notably, we find that regions with transposable elements are significantly enriched for surface proteins, and that on average surface proteins are closer to transposable elements compared to other coding regions. This finding supports a possible mechanism for diversification of surface proteins in which mobile genetic elements such as transposons facilitate recombination within the gene family. This work demonstrates the feasibility of nanopore sequencing to resolve complex regions of T. cruzi genomes, and with these resolved regions, provides support for a possible mechanism for genomic diversification.


26
Trypanosoma cruzi causes Chagas disease, a poorly understood and potentially fatal 27 illness that is estimated to infect 6 million people worldwide. Chagas disease exhibits substantial 28 phenotypic variability: only 30% of infected patients develop symptoms following chronic 29 infection, which can involve different organ systems. Moreover, even parasite strains adapted to 30 laboratory culture show phenotypic differences in drug susceptibility, in vitro growth capacity, 31 and experimental infectivity in mice (Bice and Zeledon, 1970;Brener and Chiari, 1965;32 cruzi isolates are usually categorized into six genotypes termed Discrete Typing Units (DTUs), 41 based on the results of several PCR and gel electrophoresis steps. However, these genotypes 42 insufficiently describe the full gamut of parasite diversity and are not strongly associated with 43 any clinical phenotype (Messenger et al., 2015;Zingales et al., 2012). Nevertheless, there are 44 few high-quality whole genomes for T. cruzi available. 45 This is due, in part, to the fact that basic features of the parasite's genome are difficult to 46 study due to the genome's repetitiveness, variability in genome sizes, number of chromosomes, 47 spontaneous aneuploidies and polyploidies, and lack of synteny between strains. Moreover, two 48 parasite genotypes (DTUs V and VI) are the results of hybridization between other parasite 49 genotypes, resulting in highly heterozygous genomes that are even more difficult to resolve 50 (Sturm et al., 2003). Additionally, what accounts for most of the genomic diversity between 51 strains are 6 highly diverse and multicopy gene families, referred to as multi-gene families

71
We sought to develop a scalable pipeline for generating T. cruzi genomes to fully resolve 72 these multigene families and TEs. We chose to develop this pipeline using the Tulahuen strain.

73
The Tulahuen strain is commercially available from ATCC and has been used in numerous 74 experimental studies, especially drug susceptibility studies due to its endogenously expressed  Here, we have used Oxford Nanopore Technology (ONT) long read sequencing, without 81 supplementation with short Illumina reads, to generate a whole genome of the Tulahuen lacZ 82 clone C4 strain. We find a high proportion of heterozygosity in the Tulahuen genome and 83 annotate the genome for multigene family members and transposable elements. Using the newly 84 . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made raises interesting questions about the mechanism of virulence factor diversification.    was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted July 27, 2023. ; https://doi.org/10.1101/2023.07.27.550875 doi: bioRxiv preprint encoding for multigene family members, the search was performed against the entire assembly, 126 and the genes that were found outside coding regions were defined as pseudogenes.

129
All downstream analysis and figures were generated in R. All scripts used to generate the 130 assembly, annotations, and analysis are available at https://github.com/mugnierlab/Hakim2023.

131
Raw data and the final assembly will be deposited upon manuscript submission.   We then used BUSCO scores to assess assembly completeness based on the presence of 147 conserved orthologous genes shared in the euglenozoan phylum. The Tulahuen assembly has a 148 BUSCO completeness of 88.7%, which is slightly lower than recent assemblies using both short 149 and long reads, though higher than most assemblies using short reads alone, including the 150 reference CL Brener assembly (Fig1B). This suggests the need for updated assemblies using 151 long read technologies. Notably, we find that one BUSCO gene, which encodes for UV excision 152 repair RAD23-like protein, is reported as fragmented in every T. cruzi assembly assessed here, 153 indicating that the gene is likely divergent from orthologs in other euglenozoan organisms.

154
It is important to note that BUSCO completeness, while a useful benchmarking tool for 155 recovery of conserved, non-repetitive regions of the genome, may fail to accurately assess the 156 resolution of more complex, repetitive regions of a genome, especially if the single copy genes 157 are less likely to occur within repetitive regions. For example, our attempt at assembling this 158 genome using Flye produced a genome that was 96% BUSCO complete, but only 26 MB long, 159 suggesting that the conserved regions were well resolved in this assembly, but that much of the 160 genome made up by repetitive regions, especially diverse MGF members and TEs, were lost. To 161 more fully assess the accuracy of a T. cruzi genome assembly, additional analyses may be 162 beneficial, especially ones tailored to a specific assembly method's known systematic errors, 163 such as ONT's known issues in homopolymer resolution.

164
To evaluate systematic errors of the Nanopore only assembly, we took advantage of a 165 NEO:LacZ construct cloned into the nuclear genome of the Tulahuen strain available from 166 ATCC (Buckner et al., 1996). We found LacZ tandemly expressed ten times within the insertion 167 . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made are identical to each other in this assembly, suggesting that the differences between the recovered 170 sequences and the published constructs are likely true SNPs and not results of sequencing or 171 assembly error. Importantly, we find no indels in this region, which is a known systematic error 172 of Nanopore sequencing that typically requires short read supplementation to correct. Our ability 173 to resolve this known sequence without indels is likely a result of the sequencing data from new 174 10.4.1 chemistry.

175
To further assess whether our assembly was affected by a systematic error that led to 176 indels in the assembly, we compared the average open reading frame (ORF) length between this 177 assembly and the reference genome, Cl Brener, and two other high quality long read assemblies, 178 Brazil A4 and TCC. We chose Brazil A4 because of its high BUSCO completeness, and TCC 179 because it is also a hybrid strain. Indels in low complexity regions are problematic when 180 estimating ORFs, as indels will cause frame-shifts across the whole contig and result in 181 erroneously short predicted open reading frames. We find that the distribution of ORF lengths is 182 comparable for each genome (Fig 1C).

183
Transposable elements are physically distributed bimodally throughout the genome. 184 Following assembly, we annotated the genome for repetitive regions that are generally 185 difficult to resolve during assembly, specifically multi-gene family members and TEs. We find 186 that a large proportion of the genome is made up of these elements, again in agreement with 187 previous work: 25% are simple repeats, 27% are transposable elements, and 22.7% are MGF 188 members. We found many RNA transposable elements, the most abundant of which were LINE 189 and LTR retrotransposons (Fig 1D). We also found DNA transposons, which have not for T. brucei compared to other ORFs; a closer look at the genes in this peak reveals many genes 208 involved in TE biology, such as reverse transcriptase and RNAse H (supplemental table 2). 209 This observation in a better annotated genome further supports the hypothesis that TE-mediated 210 diversification may be evolutionarily conserved in trypanosomes.

211
. CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted July 27, 2023. ; https://doi.org/10.1101/2023.07.27.550875 doi: bioRxiv preprint

212
We have produced a full genome for a hybrid T. cruzi strain frequently used in 213 biomedical research. Despite the complexity of this genome, it was assembled with relatively 214 few reads: a total of 3.6Gbp, though 3.8% of those bases mapped to multi copy mitochondrial 215 DNA, which was not used for the chromosomal assembly. This is the first T. cruzi genome that 216 has been assembled with ONT long reads alone, without supplementation of other technologies, 217 and the quality of this genome is comparable to those of other high-quality genomes. Low startup 218 costs of Nanopore technology compared to Illumina or Pacbio, as well as the relatively low 219 sequencing depth required, suggests that Nanopore sequencing may prove an excellent tool for 220 generating whole genomes from a large number of strains, especially in low resource settings.

221
The main systematic error, indels within homopolymers, seems to have a minimal effect on the 222 predicted ORF lengths in this assembly; this is likely due to the new R10.4.1 sequencing 223 technology, which in independent assessments shows improved resolution for these challenging 224 regions (Sereika et al., 2022).

225
Using the long read assembly, we were able to annotate multi-gene family members and 226 transposable elements and describe their relationship to each other in linear genome sequence.

227
TEs in T. cruzi are known to be often clustered together (Olivares et al., 2000). Novel to this 228 study is our observation that there seems to exist a genomic compartment containing coding was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made  was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted July 27, 2023. ; https://doi.org/10.1101/2023.07.27.550875 doi: bioRxiv preprint