A new method for determining ribosomal DNA copy number shows differences between Saccharomyces cerevisiae populations

Ribosomal DNA genes (rDNA) encode the major ribosomal RNAs (rRNA) and in eukaryotic genomes are typically present as one or more arrays of tandem repeats. Species have characteristic rDNA copy numbers, ranging from tens to thousands of copies, with the number thought to be redundant for rRNA production. However, the tandem rDNA repeats are prone to recombination-mediated changes in copy number, resulting in substantial intra-species copy number variation. There is growing evidence that these copy number differences can have phenotypic consequences. However, we lack a comprehensive understanding of what determines rDNA copy number, how it evolves, and what the consequences are, in part because of difficulties in quantifying copy number. Here, we developed a genomic sequence read approach that estimates rDNA copy number from the modal coverage of the rDNA and whole genome to help overcome limitations in quantifying copy number with existing mean coverage-based approaches. We validated our method using strains of the yeast Saccharomyces cerevisiae with previously-determined rDNA copy numbers, and then applied our pipeline to investigate rDNA copy number in a global sample of 788 yeast isolates. We found that wild yeast have a mean copy number of 92, consistent with what is reported for other fungi but much lower than in laboratory strains. We show that different populations have different rDNA copy numbers. These differences can partially be explained by phylogeny, but other factors such as environment are also likely to contribute to population differences in copy number. Our results demonstrate the utility of the modal coverage method, and highlight the high level of rDNA copy number variation within and between populations.

between repeats resulting in reads from all rDNA copies mapping to a single reference rDNA unit, 85 thus providing a high coverage signal that is proportional to copy number. Existing bioinformatic 86 approaches calculate the mean rDNA read coverage and normalize to the mean WG coverage to 87 estimate copy number [5,12,25,34,49], thus assuming that mean coverage represents the "true Here we present a bioinformatics pipeline that measures rDNA copy number using modal (most 98 frequent) NGS read coverage as a way to overcome the limitations of the mean coverage 99 bioinformatics approach. We assessed the parameters important for performance and validated the 100 pipeline using S. cerevisiae strains with known rDNA copy numbers. We then employed our 101 pipeline to investigate whether S. cerevisiae populations maintain different homeostatic rDNA 102 copy numbers.  Coverage files for the whole genome and rDNA were obtained using a four step pipeline:

148
Step-1: Processed reads were mapped to the indexed W303-rDNA genome using bowtie2 (v. Step-2: The subsequent SAM format alignment was converted to BAM format using the Step-3: Mapped reads in the BAM file were sorted according to the location they mapped to in 156 W303-rDNA using the SAMtools sort command:  Step-4: Per-base read coverages across the entire W303-rDNA genome and the rDNA were The pipeline for modal calculation of rDNA copy number from an alignment of sequence reads 178 to a reference genome containing one rDNA copy is available through Github 179 (https://github.com/diksha1621/rDNA-copy-number-pipeline). and 1/50 th for the whole genome, and a window size of 600 bp for both estimates, were used.

244
Violin plots were plotted using the ggplot() package in R. approach is that modal coverage will provide an estimate of the relative coverage representation 286 of a given region in a genome that is more robust to biases away from normality than the mean or  the rDNA copy number means and ranges. We found that, as expected, the accuracy and precision 324 (defined here as similarity to known copy number and copy number range, respectively) of the 325 pipeline was poorer at lower coverage levels, while larger sliding window sizes could compensate 326 for a lack of reads to improve both measures (Fig 3). Coverage levels above 10-fold with a sliding 327 window size between 500-800 bp produced accurate rDNA estimates. However, our method also 328 demonstrated adequate performance even with a coverage level of 5-fold, when the sliding window 329 was 600-700 bp (Fig 3). We found that the method works similarly when just using the rRNA 330 coding region (Supplementary Figure 3) rather than the full repeat, which is important as the full 331 rDNA unit sequence is often not available. We also examined the performance of median 332 coverage, but found that while it had greater precision compared to the modal coverage approach, 333 the accuracy was poorer (Supplementary Figure 4). Given the rapid rate at which copy number  while similar to the reported copy numbers for these strains, are not identical. Therefore, to check 352 the actual copy numbers of these strains, and to provide a direct validation of our modal pipeline 353 method, we next experimentally determined the rDNA copy numbers of these strains.

355
We chose ddPCR to experimentally determine rDNA copy number because it is less sensitive than 356 qPCR to biases in secondary structure regions that are common in the rDNA coding region [22].

357
The ddPCR data showed that the rDNA copy numbers of our strains are similar to those calculated 358 by our modal coverage method, with both methods suggesting that the "80-copy" strain actually 359 has substantially fewer copies than reported (Supplementary Table 2; Supplementary Figure   360 5A), perhaps due to a stochastic change in copy number that has occurred in our version of this 361 strain. We also compared our modal coverage approach with the mean coverage calculated from 362 the same datasets. We used a simple mean calculation to match the implementation of our modal 363 approach, using the same down-sampled 10-fold WG coverage datasets. The copy number 364 estimates made using the mean coverage approach were uniformly lower than the other estimates 365 (Supplementary Table 2), which we suggest is the result of sequencing biases against regions in 366 the rRNA coding region. Importantly, correlating read coverage and ddPCR copy number 367 estimates showed the modal coverage slope was closer to the expected value of 1 than the mean 368 coverage slope (Fig 4). We also estimated the copy number using pulsed field gel electrophoresis  all reads map to a single repeat copy and the sequence is known, such as mitochondrial and 396 chloroplast genome copy numbers. Given its strong performance, we applied our method to We obtained WG sequence data for 788 isolates from the 1002 Yeast Genome project. Reads for 409 each isolate were downsampled to 10X genome coverage, mapped to our W303-rDNA reference 410 genome, and rDNA copy numbers estimated using our modal coverage pipeline. The rDNA copy 411 numbers ranged between 22-227 (x ̅ = 92) across the 788 isolates (Supplementary Table 3). The  Table 4). 414 However, the copy number we estimate are, in general, much lower than those (~150-200) to be a property of these isolates.

423
The difference in copy number between lab and wild S. cerevisiae isolates suggests that S.     of copy number ranges (Fig 7). These results suggest that S. cerevisiae rDNA copy number is no 511 more variable than that of S. brevipes at least, and illustrate the tremendous variation in rDNA 512 copy number that is likely to be present in many eukaryotic species.