CoalQC - Quality control while inferring demographic histories from genomic data: Application to forest tree genomes

Estimating demographic histories using genomic datasets has proven to be useful in addressing diverse evolutionary questions. Despite improvements in inference methods and availability of large genomic datasets, quality control steps to be performed prior to the use of sequentially Markovian coalescent (SMC) based methods remains understudied. While various filtering and masking steps have been used by previous studies, the rationale for such filtering and its consequences have not been assessed systematically. In this study, we have developed a reusable pipeline called “CoalQC”, to investigate potential sources of bias (such as repeat regions, heterogeneous coverage, and callability). First, we demonstrate that genome assembly quality can affect the estimation of demographic history using the genomes of several species. We then use the CoalQC pipeline to evaluate how different repeat classes affect the inference of demographic history in the plant species Populus trichocarpa. Next, we assemble a draft genome by generating whole-genome sequencing data for Mesua ferrea (sampled from Western Ghats, India), a multipurpose forest plant distributed across tropical south-east Asia and use it as an example to evaluate several technical (sequencing technology, PSMC parameter settings) and biological aspects that need to be considered while comparing demographic histories. Finally, we collate the genomic datasets of 14 additional forest tree species to compare the temporal dynamics of Ne and find evidence of a strong bottleneck in all tropical forest plants during Mid-Pleistocene glaciations. Our findings suggest that quality control prior to the use of SMC based methods is important and needs to be standardised.


Introduction 72
Coalescent theory continues to be the fundamental tool for the study of gene genealogies in a 73 population genetic framework. Changes in coalescence time and the Watterson estimator of 74 genetic diversity (θ w ) along the genome serve as a record of the population history of a 75 species through time. Increasing availability of whole genomic datasets for a multitude of 76 species has made it possible to analyse demographic histories to answer a suite of questions 77 such as host-parasite co-evolution  Genome assembly quality encompasses multiple factors such as sequence contiguity 157 (generally quantified as N50), number and length of gaps, the fraction of genes assembled 158 (quantified using BUSCO's) and fraction of the genome assembled (quantified based on the 159 percent of reads mapping to the genome assembly). To assess the effect of genome quality on 160 demographic inference, we compared Ne trajectories estimated using a single human 161 individual (NA12878) mapped to five different versions (hg4, hg10, hg15, hg19, and hg38) 162 of human genome assemblies with varying levels of quality. We found that all the measures 163 of genome quality used by us showed an improvement in recent versions of the human 164 genome (see Table S1). The estimated effective population size (Ne) showed greater 165 variability between genome assembly versions during ancient i.e., 1-7 MYA (Mean of 166 standard deviations in Ne of each atomic interval from 43 to 64 = 0.84) and recent i.e., 0-15 167 KYA (Mean of standard deviations in Ne of each atomic interval from 0 to 6 = 0.82) 168 compared to mid-time period i.e., 100-400 KYA (Mean of standard deviations in Ne of each 169 atomic interval from 18 to 32 = 0.29) (see Fig. 1 To ensure that the effect of genome assembly quality is not limited to just the human 186 genome, we compared the demographic histories inferred from the initial and recent versions 187 of the Tribolium castaneum and Danio rerio genomes (see Table S1). We find that similar to 188 the differences seen between different versions of the human genome assembly, the estimates 189 of Ne inferred from different versions of the genome show distinct trends (see Fig. S1 heterozygosity showed an increasing trend towards older atomic intervals, we found that the 246 Ts/Tv ratio did not show any discernible trend (see Fig. 3). Similar estimates of the Ts/Tv 247 ratio in repeat and non-repeat regions suggests that the heterozygous sites identified in repeat 248 regions are truly polymorphic. 249 250 Repeat regions of the genome have very high coverage due to the mapping of reads 251 from multiple copies and low callability due to a large number of mismatches across the 252 reads mapped to the same genomic region. Hence, some studies tend to mask genomic 253 regions based on criteria determined based on coverage or callability instead of the presence 254 of repeats. We separately evaluated the effect of masking genomic regions based on coverage 255 or callability classes (see Fig. S3 and S4) and find that masking based on these criteria needs 256 to be treated independently of repeat region-based masking. 257 258

Which biological and technical factors influence demographic inference? 259
Several technical factors such as the optimal PSMC parameter settings, sequencing platform 260 used, the prevalence of cross-contamination from closely related species, misleading or 261 incomplete metadata in public datasets are important considerations during the comparative 262 interpretation of demographic history. Similarly, biological factors such as the prevalence of 263 whole-genome duplications, changes in the karyotype or genome size and high intraspecific 264 variation in genetic diversity also need to be considered. We generated whole-genome 265 sequencing data for a tropical plant species (Mesua ferrea) and use it as an example to 266 understand how biological and technical factors influence demographic inference. For any 267 newly sequenced genome, the PSMC parameters -r (initial theta/rho ratio), -p (pattern of 268 parameters specifying distribution of free intervals and atomic intervals) and -t (the 269 maximum time to TMRCA) need to be optimised so that all the atomic intervals have 270 sufficient number of recombination events. 271

The maximum time to TMRCA 272
For the first PSMC run of the Mesua ferrea genome, -t 5 -r 5 -p "4+25*2+4+6" options were 273 used. However, the resultant PSMC output did not have enough recombination events in 274 some of the atomic intervals. Therefore, the -p parameter was optimized until all the atomic 275 intervals had a sufficient number of recombination events. A mutation rate of 2.5e-09 per site 276 per year i.e. 3.75e-08 per site per generation was used assuming 15 years of generation time 277 for scaling the results. The resultant trajectory did not go back to older (i.e., beyond 150 Kya; 278 see Fig. 4a) time points. So, we decided to optimize all the parameters so that the trajectory 279 will give meaningful results beyond 150 Kya. Hence, the maximum time to TMRCA, i.e., -t 280 parameter was increased so that the trajectory extended to older (i.e., 150 to 400 Kya; see 281 we considered some of the longest scaffolds and visualised the assignment of specific 291 genomic regions to various atomic intervals. Comparing the atomic intervals assigned to the 292 same genomic region at different values of -t, we found that genomic regions which were 293 assigned to older atomic intervals for smaller values of -t were assigned to relatively recent 294 atomic intervals with an increase in the -t parameter (see Fig. 4b). This redistribution of 295 regions with increasing values of -t can be better understood by looking at changes in the 296 distribution of lengths of genomic regions assigned to each atomic interval (see Fig. S5). For 297 instance, in the case of Mesua ferrea the length of older atomic intervals tends to decrease 298 with increasing values of -t. Genomic regions contributing to the older atomic intervals at 299 higher values of the -t parameter become shorter and highly heterozygous. We ensured that 300 such short high heterozygosity regions are not merely variant calling artefacts by visualising 301 the atomic intervals assigned to genomic regions along scaffolds with associated 302 heterozygosity and callability at these regions (see Fig. 4b). 303

305
The initial theta/rho ratio 306 The ratio of heterozygosity to recombination rate generally does not have an impact over the 307 trajectory inferred using PSMC, but an increase in the value of -r does increases the number 308 of recombination events and achieves convergence faster. We used the same -p and -t values 309 (i.e., -t 65 -p "3+2*17+15*1+1*12") with different values of -r. With decreasing values of -r, 310 the Ne trajectory of Mesua ferrea extended further back in time (i.e., beyond 400KYA; see 311 we found that a re-sequencing dataset labeled as Mesua ferrea sampled from Yunnan, China 321 (see Table S2) was available for download. Surprisingly, the demographic trajectory inferred 322 using this dataset extended back in time with -t of 5 (optimised with -r 5 -p "4+25*2+5*2") 323 and gave a different inference (see Fig. S7). However, we found that the sequencing for the 324 sample from Yunnan had been performed using the BGISEQ-500 platform. In order to rule 325 out the possibility of sequencing platform-specific technical issues, we compared the 326 demographic trajectory of the Human individual NA12878 sequenced using BGISEQ-500 327 (see Table S2 for dataset details) with the trajectory obtained for the same individual when 328 the sequencing was performed using Illumina platform (see Fig. S8). We did not find any 329 differences in the Ne trajectories estimated using BGISEQ-500 and the Illumina platform. Using genomes and re-sequencing datasets of several species, we here explore how genome 377 quality (contiguity, prevalence of gaps, assembly completeness), repeat abundance, 378 technological heterogeneity (sequencing platform used, parameter settings optimised) and 379 biological factors (changes in genome size) can affect the inference of demographic history. 380 The scripts used for our analysis are implemented in the form of a re-usable pipeline with 381 detailed documentation. We are thus confident that these quality control strategies and the 382 associated pipeline will prove useful while comparing demographic trajectories between 383 species to obtain insights into the underlying processes. The following paragraphs highlight 384 our major findings and their relevance with respect to previous studies. 385 386

Genome quality 387
We demonstrate that the estimates of Ne for the same sequencing dataset can be drastically 388 different when the quality of the reference genome assembly changes. A recent study by 389 Certain historical time periods might be of greater importance due to specific climatic 441 or geological events that have occurred in that time frame. Hence, it can be desirable to have 442 a greater resolution while estimating the effective population size histories during these time 443 periods. While increasing the number of free intervals in a particular time period will increase 444 the resolution it can also lead to certain atomic intervals having too few recombination 445 events. Hence, the atomic intervals need to be distributed such that each atomic interval has 446 more than 10 recombination events after the 20th iteration. We hope that our results provide 447 some clarity regarding the strategies to be used while choosing the -p parameter. Future studies that plan to compare datasets originating from these technologies are likely to 474 benefit from our comparative analysis. Parameter settings used for running PSMC have also 475 been investigated in considerable detail with guidelines for optimal use of parameters. 476 Having investigated various sources of technical variability and parameter settings, we 477 consider the genomic datasets of 15 forest trees and compare their demographic histories as a 478 case study. We not only provide guidelines for performing filtering of genomic datasets but 479 also develop the CoalQC pipeline that we hope will become a standard part of quality control 480 prior to demographic analysis using whole-genome datasets. Our genome assembly length of 614.35 MB is slightly less than but comparable to previous 502 estimates based on flow cytometry. To assess the quality of the assembled genome, we 503 employed multiple quality assessment methods. While N50, N75, etc. are accepted metrics of 504 genome contiguity, the number of genes assembled in the assembly serves as a metric of 505 assembly completeness. The amount of repeats assembled is also one of the relevant metrics 506 which informs about the quality of the assembly in low-complexity regions. The Quast 507  program was used to calculate assembly statistics i.e. N50, N75, 508 Number of N´s per 100 KB, etc. (see Table S3). BUSCO (Benchmarking Universal Single-509 Copy Orthologues) (Simão et al. 2015) was used to assess the completeness of the assembly, 510 using eudicotyledons_odb10 (see Table S4 and S12) and embryophyta_odb10 (see Table S5) 511 dataset together with previously sequenced genomes of order Malpighiales. LTR_retriever's 512 LAI module was used to determine assembly quality based on the LAI (LTR Assembly 513 Index) score which assesses repeat content assembled (see Table S6).  Table S7). 525 The raw sequencing read data was used to separately assemble the chloroplast 526 genome using the NOVOPlasty (Dierckxsens et al. 2017) program. The Maturase K gene 527 sequence of Mesua ferrea was used as a seed sequence and Garcinia mangostana chloroplast 528 genome was used as a reference. The assembler uses seed sequence to find reads that cover 529 this sequence and starts overlapped sequence assembly. The assembled chloroplast genome 530 had two sets of contigs. The orientation of the contigs was determined by dot-plot analyses 531 (see

Datasets used 542
To demonstrate the utility of our quality control pipeline and generality of our observations 543 across diverse taxa we used species from different phyla such as nematodes, plants, 544 vertebrates, etc. The published genome assemblies used as reference genomes were 545 downloaded from NCBI/UCSC genome browser (details provided in Table S8). We searched 546 the European Nucleotide Archive (ENA) for genomic sequencing datasets and downloaded 547 those datasets that had >20X coverage (see Table S2). The raw read datasets were mapped to 548 corresponding unmasked genomes using the short-read aligner BWA-MEM (Li 2013) with 549 default settings. 550 551

Genome assembly quality comparison 552
The latest version of the Human genome assembly, i.e., hg38, was downloaded from 553 Ensembl, whereas previous assemblies i.e. hg19, hg15, hg10, and hg4 were downloaded from 554 UCSC genome browser (see Table S8). The genome assembly statistics i.e. N50 statistics and 555 Number of N's per 100 Kb were calculated using Quast. SAMTOOLS flagstat module was 556 used to get mapping percentages for each assembly using mapped alignments of each 557 assembly. To get assembly completeness statistics, BUSCO was used with dataset 558 Mammalia_odb9 on each of the assemblies. Genome assembly quality was also assessed for 559 red flour beetle (Tribolium castaneum) and zebrafish (Danio rerio) genomes. These 560 comparative assembly quality statistics are available in Table S1. were set as -t 5, -r 5, -p "4+25*2+4+6". The output was evaluated to see if a sufficient 569 number of recombination events had occurred in each atomic interval. If there were some 570 atomic intervals that did not have at least ten recombination events after the 20th iteration, 571 then the -p parameters were modified. For example, if -p parameter "4+25*2+4+6" is set, it 572 has 64 atomic intervals distributed across 28 free intervals i.e. (1+25+ 1+ 1). Changing the 573 distribution of atomic intervals across free intervals, e.g., "8+25*2+2+4" would be the first 574 step to see if enough recombination events are obtained. If not, then changing the free 575 intervals, i.e., "3*2+1*10+15*2+1*14+1*4" (written as free intervals * atomic intervals) 576 was done to get sufficient recombination events. Only after obtaining a sufficient number of 577 recombination events, the -p parameter was finalised. PSMC was run with the -d (decode) 578 option for identifying genomic regions that contributed to each atomic interval. After 579 obtaining the PSMC output, an appropriate mutation rate and generation time were used to 580 generate the scaled plot using the psmc_plot.pl script. For bootstrapping analyses, the psmcfa 581 file was first split into equal lengths of 5 MB, and was used for 100 runs of PSMC.  Table S2 and S8 for details). PSMC analysis was performed on 626 each of these species with default parameters i.e. -t5 -r5 -p "4+25*2+4+6". A mutation rate 627 estimate of 2.5e-09 per site per year which has been used for Populus trichocarpa in an 628 earlier study (Bai et al. 2018) was used for all the species. Generation time for each species 629 was obtained through a literature search. For each species per generation, mutation rates were 630 obtained using corresponding generation times (Table S9).  Whereas, tropically distributed species have a common trend of decline during and after 1103 Mid-Pleistocene glaciations. Some of the species such as Faidherbia albida were able to 1104 recover from these bottlenecks, which translates into their adaptation to dryer 1105 environments but most of the other plants were not able to recover from the same. were not supported by any read), Low-coverage (Bases in the genome which showed  1131 small support of reads compared to mean) and Poorly-mapped (Bases in the genome  1132 which showed poor mapping quality of reads). All of these classes were masked one at a 1133 time and the effect of each non-callable group was evaluated. After that all the non-1134 callable groups were merged as a non-callable class and they were masked followed by 1135 another