Origin Matters: Using a Local Reference Genome Improves Measures in Population Genomics

Genome-level sequencing enables us to ask fundamental questions about the genetic basis of adaptation, population structure, and epigenetic mechanisms, but usually requires a suitable reference genome for mapping population-level re-sequencing data. In some model systems, multiple reference genomes are available, giving researchers the challenging task of determining which reference genome best suits their data. Here we compare the use of two different reference genomes for the three-spined stickleback (Gasterosteus aculeatus), one novel genome derived from a European gynogenetic individual and the published reference genome of a North American individual. Specifically, we investigate the impact of using a local reference versus one generated from a distinct lineage on several common population genomics analyses. Through mapping genome resequencing data of 60 sticklebacks from across Europe and North America, we demonstrate that genetic distance among samples and the reference impacts downstream analyses. Using a local reference genome increased mapping efficiency and genotyping accuracy, effectively retaining more and better data. Despite comparable distributions of the metrics generated across the genome using SNP data (i.e., π, Tajima’s D, and FST), window-based statistics using different references resulted in different outlier genes and enriched gene functions. A marker-based analysis of DNA methylation distributions had a comparably high overlap in outlier genes and functions, yet with distinct differences depending on the reference genome. Overall, our results highlight how using a local reference genome decreases reference bias to increase confidence in downstream analyses of the data. Such results have significant implications in all reference-genome-based population genomic analyses.

including phylogenomics (Fang et al., 2018;Wang et al., 2013), comparative evolution (Palmer 82 All reads belonging to the same individual from different lanes were combined using 243 MergeSamFile, and then duplicate reads were flagged using MarkDuplicates. Variant calls 244 were performed using GATK v4.0.6.1 (McKenna et al., 2010), calling variants in all genomes 245 simultaneously, split by chromosome. The final set of SNPs was produced using hard filtering 246 following the best practise workflow (see supplementary methods for filtering thresholds; 247 (Depristo et al., 2011). Genome mapping and variant calls were conducted on the QMUL 248 Apocrita High Performance Computing Cluster (King et al., 2017). 249 The gynogenesis process employed here aims to purge heterozygous variation, enabling us 250 to reconstruct complex genomic regions. To assess the proportion of the genome that remains 251 heterozygotic we used the same GATK variant calling and filtering pipeline detailed above with 252 the paired-end Illumina libraries used to polish the gynogen assembly. 253

254
In order to assess the phylogenetic relationship between the reference genomes and the 60 255 resequenced genomes, we compared maximum-likelihood phylogenies based on each variant 256 call with its respective reference genome. Maximum-likelihood phylogenies for both SNP calls 257 were inferred using RAxML v8.2.11 (Stamatakis, 2014). We randomly sampled 1% of all 258 segregating sites by setting the "-select-random-fraction" parameter to 0.01 in the GATK 259 function SelectVariants. The resulting VCFs were converted to PHYLIP format (Felsenstein, 260 1989 The LUMPY call was genotyped using SVTyper v0.7.1 (Chiang et al., 2015), and all SVs 288 marked with the "LowQual" DELLY flag were removed. To ensure SVs found by both programs 289 were the same, only SVs with an overlap of at least 50% were accepted and merged into a 290 cohort level VCF file using SURVIVOR v1.0.3 (Jeffares et al., 2017). For downstream 291 analyses, we included autosomal duplications, deletions, and inversions with a length of 292 between 1kb and 1Mb supported by at least 6 split or discordant reads. All samples that were 293 homozygote non-reference or heterozygote across all samples were analysed separately as 294 these variants likely arise from the reference genome assemblies. Genes with coordinates 295 entirely nested within SVs were defined as structurally variable genes (SVG) using the 296 foverlaps function in the R package data.table v1.9.6. 297 Cytosine methylation ratios in CpG sites was estimated for each fish and differentially 312 methylated sites (DMS) were calculated between the two treatment groups (no parasite 313 exposed or exposed to the nematode parasite Camallanus lacustris) respectively (for more 314 information, see Sagonas et al., 2020) using the R package MethylKit v1.14.2. CpG 315 methylation ratios were estimated by calculating the number of reads mapping to a given 316 position carrying a cytosine divided by those reads carrying either C or T. CpG sites with 317 coverage below 10X and sites that have more than 99.9th percentile of coverage were 318 discarded in each sample from downstream analyses. We selected DMS between treatments 319 using the following criteria: change in fractional methylation larger than 15%; q-values lower 320 than 0.01 (SLIM method), and presence of methylated Cs in at least 50% of the samples within a treatment group. Differentially methylated genes were identified using the genomation R 322 package v.1.1.0 (Akalin et al., 2015); a gene was considered differentially methylated if at least 323 one DMSs was located no further than 1.5-kb upstream and 500 bases downstream of it. 324 zygosity, and population-genetic indices (i.e., Tajima's D, π, and FST). For mapping efficiency, 338 the proportion of mapped reads that were singletons or duplications were independently used 339 as response variables. An interaction between reference genome origin (i.e., North America 340 or Europe) and population origin (i.e., North America or Europe) were assigned as fixed 341 effects, and population ID was set as a random factor. The same approach was used for all 342 analyses, independently replacing the response variable with the zygosity categories or 343 population-genetic indices, retaining the same fixed and random effects.

348
The contig-level European gynogen assembly is 458.4Mb, making it 1.0% smaller than the 349 North American reference genome (463.0Mb; Peichel et al., 2017). We achieved an N50 of 350 0.746Mb, comprised of 1,906 contigs (table 1) Next, we called SNPs in the paired-end libraries used to polish the European genome 361 assembly to assess the remaining heterozygosity in the gynogen genome assembly, which is 362 expected to be homozygous after the gynogenesis procedure. In total we identified 299,931 363 SNPs (average genome-wide read depth of 60.6) in the genome: 3,118 SNPs were 364 homozygote non-reference, and 296,813 were heterozygote. Hence, after two generations of 365 gynogenesis only 0.07% of the genome remains polymorphic, compared to 0.53% ± 0.01% 366 (mean ± SE) across all samples mapped to both reference assemblies (supplementary table 367 S1). These SNPs likely stem from paralogous sequences that were not identified by the 368 genome assembler or from errors in the SNP calling process. 369

370
To assess the impact of reference genome origin on downstream analyses, we started by 371 mapping whole genome resequencing reads from 60 individuals to the reference genomes. In total, we identified 10,672,162 SNPs using the North American reference genome, and 373 10, 757 fig. 3).  respectively, and specifically 298 and 299 genes were 1-to-1 orthologs between the two 501 assemblies. A total of 204 of those genes with DMS (68%) were shared in both genomes. 502 There were three significantly enriched GO terms (protein binding, calcium ion binding, and 503 binding) shared in both analyses, and two enriched GO terms (cation binding and metal ion 504 binding) only identified when using a divergent reference genome (Table 2).

506
Until genome-graphs are more widely available, individual reference genomes remain an 507 integral part of population genomic analyses. However, the reference genome can introduce 508 mapping biases that significantly influence downstream analyses and inferences (Bohling,  mapping and genotyping performance. Specifically, mapping efficiency was significantly better 517 using the European reference genome, but the increase in performance was greater for local 518 samples. When using a local reference, more genomic sites were genotyped, and genome 519 window-based estimates of TD and FST increased whilst π slightly decreased. Similarly, 520 structural variants (SVs) analysis gave slightly different results based on the reference 521 genome used. Consequently, most GO analyses resulted in only a minor proportion of 522 matching enriched GO functions when using different reference genomes. In contrast to the 523 window-based methods, the marker-based DNA methylation analysis pipeline was relatively 524 less affected by reference genome origin, but still about one third of differentially methylated 525 genes was uniquely identified by one but not both references. 526 genome-wide estimates of FST and TD were higher when using a local reference genome, 568

Genome Assembly of a Gynogenetic Individual
whereas the opposite pattern was true for estimates of nucleotide diversity (π). Despite the 569 differences in genome-wide estimates, the genome-wide distribution of outlier windows of π, 570 FST, and TD did not significantly differ across the genome for the same resequenced 571 populations mapped to different reference genomes. Crucially, however, the detection of 572 outlier genes was strongly impacted when using different reference genomes, even when 573 conservatively limiting the analysis to 1-to-1 orthologs identified among the references. 574 Specifically, genes overlapping π outliers had a low proportion of matching orthologs in the 575 same populations when mapped to different reference genomes. Genes overlapping outliers 576 from scans for FST and TD were more consistent, but still revealed a large number of 577 differences. In total, only 5.45% of the orthologs among assemblies were mapped to different 578 chromosomes, clearly affecting the detection of outlier genes but only explaining a small 579 proportion of differences observed. The difference in gene placement may instead highlight 580 numerous resolved differences among assemblies that have accumulated since the 581

588
Similar to the effect of reference genome origin on SNP-based scans, the detection of SVs 589 appears to be affected by both reference genome origin and the assembly regardless of origin. 590 Firstly, fewer deletions and inversions were identified when using a local reference genome. 591 This result follows expectations, as there is less time between MRCA for SVs to build up 592 between the sampled population and the reference genome. Secondly, more deletions and 593 duplications and fewer inversions were detected when using the European reference, 594 irrespective of population origin, suggesting differences in the genome assembly plays a role 595 in the detection of SVs. However, the differences in SV detection did not translate into any 596 significant differences in the number of genes overlapping deletions. Overall, our results 597 suggest the detection of SVs may be affected by a combination of mapping efficiency, time to 598 MCRA, and the methods used to assemble a reference genome. 599

600
The most consistent analysis in terms of overlap among reference genomes was the marker-601 based DNA methylation test. Firstly, using a local reference genome significantly increased 602 mapping efficiency, which resulted in more methylated sites and DMS being detected. The 603 number of genes overlapping the DMS using either reference genomes was the most 604 consistent among all analyses, with one fewer gene (0.14% of total genes) and ortholog 605 (0.34% of total 1-to-1 orthologs) being identified when using a local reference. The DMS 606 analyses recovered a relatively high proportion of overlapping outlier orthologs among 607 reference genomes (68% of 1-to-1 orthologs), but still revealed an effect of the choice of 608 reference. The higher proportion of overlap compared to window-based analyses may be due 609 to the marker-based approach, which is less sensitive to genomic translocation, genome 610 evolution, or assembly errors.

612
The largest effect of reference genome origin was in the GO enrichment analyses of outlier 613 genes from genome scans, with only a minor proportion of enriched GO terms overlapping 614 when using different reference genomes. Given the overall small proportion of overlapping 615 outlier genes, these results were to be expected. GO enrichment analyses are particularly 616 sensitive to minor changes in the number of genes or annotations in lists of genes (Gaudet & 617 Dessimoz, 2017). It should be noted, however, that to allow for a direct comparison of the 618 effects of the different reference genomes, we focused on 1-to-1 orthologs. GO  The original assembly used entirely Sanger sequence data (F. C. Jones et al., 2012), 632 compared to the PacBio and Illumina sequence data used for the gynogen genome. As such, 633 only when there is a significant interaction between population and reference origin and no 634 obvious bias in the observations towards either reference can we exclude that the differences 635 in sequencing and assembly methodologies are not the primary cause for the observed 636 patterns.
The aim of this study was to investigate the effects of a local reference genome and its effects 638 on downstream analyses. The reference-specific patterns of our results highlight that there is 639 no simple solution. We suggest that the quality of the reference genome and annotations 640 remains the single most important factor when choosing which reference to use. However, 641 when multiple similar quality references are available, a local reference genome offers higher 642 mapping efficiencies and decreases the proportion of missing data. The smallest reference 643 effect among our analyses was for the marker-based methylation analysis, which had 644 markedly more overlap among outliers in comparison to the window-based approach or the 645 SV analysis, but still had over 30% of outliers that did not overlap. Taken together, using a 646 local reference genome should increase the confidence of inferences made within a study, 647 even if the difference is only minor. 648