Estimating repeat spectra and genome length from low-coverage genome skims with RESPECT

The cost of sequencing the genome is dropping at a much faster rate compared to assembling and finishing the genome. The use of lightly sampled genomes (genome-skims) could be transformative for genomic ecology, and results using k-mers have shown the advantage of this approach in identification and phylogenetic placement of eukaryotic species. Here, we revisit the basic question of estimating genomic parameters such as genome length, coverage, and repeat structure, focusing specifically on estimating the k-mer repeat spectrum. We show using a mix of theoretical and empirical analysis that there are fundamental limitations to estimating the k-mer spectra due to ill-conditioned systems, and that has implications for other genomic parameters. We get around this problem using a novel constrained optimization approach (Spline Linear Programming), where the constraints are learned empirically. On reads simulated at 1X coverage from 66 genomes, our method, REPeat SPECTra Estimation (RESPECT), had 2.2% error in length estimation compared to 27% error previously achieved. In shotgun sequenced read samples with contaminants, RESPECT length estimates had median error 4%, in contrast to other methods that had median error 80%. Together, the results suggest that low-pass genomic sequencing can yield reliable estimates of the length and repeat content of the genome. The RESPECT software will be publicly available at https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_shahab-2Dsarmashghi_RESPECT.git&d=DwIGAw&c=-35OiAkTchMrZOngvJPOeA&r=ZozViWvD1E8PorCkfwYKYQMVKFoEcqLFm4Tg49XnPcA&m=f-xS8GMHKckknkc7Xpp8FJYw_ltUwz5frOw1a5pJ81EpdTOK8xhbYmrN4ZxniM96&s=717o8hLR1JmHFpRPSWG6xdUQTikyUjicjkipjFsKG4w&e=.

where r h denotes the number of distinct k-mers that appear exactly h times in the genome. As the value of o h depends upon r, c, L, and also on sequencing error, we consider the inverse problem of estimating genomic parameters given o as input. The problem was studied in a seminal paper by Li and Waterman [17] who mostly considered the case of high coverage and no sequencing errors. Williams et al. [18] improved upon this model by ignoring o 1 assuming that a large proportion of unique k-mers can be attributed to sequencing errors. This assumption works better for high coverage because at low coverage, many informative k-mers are also seen only once. Hozza et al. [19] point this out, and focus attention on k-mer spectra. Their method, CovEst, models spectra using a geometric distribution of unknown parameters, uses that parameterized model to estimate both parameters and r 1 , r 2 , r 3 , and improves estimates even for low coverage and high error.
A distinct but related line of research relates to estimating o itself by sub-sampling or streaming reads. Melsted and colleagues [20,21] describe streaming algorithms to estimate o 1 as well as moments F k = ∑ i i k o i . Interestingly, these moments can also be used to estimate genome parameters. For example, E½F 1 � ¼ lL, where λ = (1 − (k − 1)/ℓ)c denotes the k-mer coverage, or the average number of k-mers covering a position derived from reads of length ℓ. We note that streaming is akin to low-coverage sampling and consider the case of estimating parameters over a range of λ.

Estimating genome repetitiveness and other parameters using k-mers
While previous research has emphasized the estimation of genome length and coverage, we focus specifically on estimating the k-mer spectrum r, defined below. Consider a genome of length L. Decompose the genome into a collection of all fixed-length (overlapping) sequences of length k, called k-mers. Let variables r j (j � 1) denote the number of k-mers that occur exactly j times in the genome. When k is large enough (k � log 4 L), high values of r j , for j � 2, can be attributed to the repetitive structures in the genome rather than chance similarities. Therefore, we define r = [r 1 , r 2 , � � �] as the (k-mer)-repeat-spectrum of the genome.
While the repetitive sequences occur in a variety of arrangements in terms of their multiplicity, complexity and the size of repeating unit, the repeat spectrum provides a valuable summary of the extent of repetition in the genome as well as other parameters. For example, the genome length can be estimated as L = k − 1 + ∑ j jr j ' ∑ j jr j . Define the uniqueness ratio of a genome as r 1 /L, or the ratio of the number of k-mers seen only once to the genome length (which is the total number of k-mers in the genome). We computed the uniqueness ratio for 622 eukaryotic genomes in RefSeq using k = 31 (S1 Fig). The ratio revealed a broad spectrum of values, ranging from 0.287 for A. tauschii (Tausch's goatgrass) to 0.995 for a mite species, V. jacobsoni ( Fig 1A). Expectedly, there is some phylogenetic correlation and the variation of uniqueness ratio within a genus (intra-generic) is significantly lower than inter-generic variation of uniqueness ratios (S2 Fig). At higher taxonomic ranks, we observed that plants had a significantly lower uniqueness ratio compared to other groups (Fig 1B), consistent with a prevalence of whole genome duplication (WGD) events (see Methods). Nevertheless, the correlation is not strong enough to predict uniqueness ratios solely from taxonomy. For example, rice species O. sativa and O. brachyantha have different ratios 0.91 and 0.75, respectively.
The repeat spectrum provides other insights. In genomes composed largely of unique sequences, r 1 /L ' 1 and r j values decrease rapidly for j � 2 with log r 1 /r 5 � 4.5 (Fig 1C). On the other hand, genomes with higher repetitive content have a smoother decrease of r j values ( Fig 1D) with log r 1 r 5 � 2:5 (S3 Fig). Additionally, a genome that has duplicated very recently will have r 1 ' 0 and a very high value of r 2 . Over time, however, r 1 increases due to the accumulation of mutations. Similarly, r j > 0 for large values of j suggest the presence of interspersed repeats.
Our method RESPECT (Repeat Spectrum identification) derives genomic length and coverage from low-coverage genome skims, while also providing insight into the repeat structure. We showed, through a mix of theoretical reasoning and empirical evidence, that the k-mer repeat spectra estimation problem is fundamentally difficult because of severe ill-conditioning of the system. In fact, the spectra are hard to estimate even when the coverage and sequencing error rate are known. We resolve this problem for the case of known coverage and sequencing error by imposing constraints on r h and solving a constrained optimization problem. This approach provides greatly improved estimates of r, which in turn lead to even better estimation of coverage, genome length and sequencing error through a stochastic iteration method. Results on genomes sampled from different parts of the tree of life and with differing repeat structures illustrate the validity of our approach. Characterizing repeats at k-mer level. A: RefSeq plant taxonomy. The species are color-coded based on the uniqueness ratio, from red (highly repetitive) to blue (non-repetitive). B: Uniqueness ratio distribution among four major taxonomic groups of eukaryotes in RefSeq. Plants (green) have significantly lower r1/L compared to invertebrates (pink), mammals (yellow), and other vertebrates (blue). P-values shown on the figure, are the result of statistical tests that the uniqueness ratio is lower among plants compared to other groups. Also, to understand the extent of difference, we tested if the ratios are lower among plants by X% margin. The results are 5% p-value = 1.1 × 10 −6 , 10% p-value = 4.3 × 10 −6 , and 10% p-value = 4.2 × 10 −6 when comparing plants against invertebrates, mammals, and other vertebrates, respectively. C: Dot-plot of V. jocobsoni genome's (self)alignment with very few off-diagonal points, and a rapidly decaying repeat spectrum (r 1 /L = 0.99). D: Dot-plot of D. citri's highly-repetitive genome marked by many offdiagonal elements and a smoothly decreasing repeat spectrum (r 1 /L = 0.51). https://doi.org/10.1371/journal.pcbi.1009449.g001

A simple model for estimating repeat spectra from unassembled data performs poorly
Assume that reads in the genome-skim are sequenced with a fixed mean error rate of � per bp, and that the read start positions follow a Poisson distribution with a mean coverage of λ per bp. Denote the observed k-mer data as the vector o = [o 1 , o 2 , � � �], where o h denotes the number of k-mers observed exactly h times in the genome-skim input. The value o h is the outcome of a random variable O h that depends upon the parameter set F = {λ, �, r} (See Methods: 'Modeling genomic parameters'). Specifically, we assume that each k-mer with copy number j in the genome is sampled h times according to a Poisson distribution with rate dependent upon k, F. Let P hj represent the probability of h observances of a k-mer with copy number j. Then, in expectation, where E is the expected number of erroneous k-mers that in turn depends upon F. F could be estimated using: In principle, an iterative procedure could be used to solve the optimization; we start with initial estimates of λ and �, and use them to compute P and E. Then, we can use the least-square (LS) method to find r which minimizes ko − (rP T + 1 h=1 E)k (Eq 2) (See Methods: 'Least-squares estimate of repeat spectrum').
To study the accuracy of this model for repeat spectra estimation, we simulated genome skims at 1X coverage with no sequencing errors (E = 0) for all 622 genomes in RefSeq in four major taxonomic groups of eukaryotes. A subset of 66 species was selected as the test set. The test genomes were sampled such that their uniqueness-ratio (r 1  For a baseline test, we assumed that the coverage λ was known, so that r could be estimated using ko − rP T k 2 (Eq 2). Using an LS solver (see Methods: 'Least-squares estimate of repeat spectrum'), we obtained highly accurate estimates of r 1 on the test data (Fig 2A; LS method). However, even in this simple case with perfect knowledge of coverage and no sequencing error, the error in estimating r j increased rapidly with increasing j, as the LS solution was often sparse and the estimation set r j = 0 for many j's, contrary to its true value in the genome.
Empirical and theoretical results showed that the poor performance could be attributed to severe ill-conditioning. We proved that the condition number of P grows exponentially with the number of spectra (see S1 Appendix). Therefore, small changes in o relative to E½O� (Eq 1), for example due to the sampling variability or the simplifying assumptions of model, led to very large errors in estimates of r.

Overview of RESPECT algorithm
The negative result suggested a fundamental limitation to the use of k-mer based methods for estimating repeat spectra. Regularization is a proposed remedy for ill-conditioned matrices. However, most regularization methods enforce sparsity and r is known to be not sparse. A second challenge is that both observed counts and k-mer spectra are very skewed towards lower indices. Thus, a small (even 1%) relative error in r 1 could lead to a larger error in r j for j > 1.
To get around the ill-conditioning problem, we focused on constraining possible values of r. We observed empirically that ratio of consecutive spectral values r j+1 /r j was tightly constrained. across all taxonomic groups. A similar, albeit less tight, constraint was observed for r 3 /r 2 ( Fig  2C) and other values as well (S5, S6 and S7 Figs). These ideas provided the basis of a constrained linear-program for estimating r. As a first step, we added the constraint that L j � r j r jþ1 � U j for each j, where L j and U j are the smallest and the largest r j r jþ1 ratios over the training genomes, and solved the following LP to find r (see Methods: 'Linear programming for constrained optimization based estimates') The genome-skims are simulated at 1X with no sequencing error. B: Correlation between true r 2 /r 1 ratios, and our estimates of r 1 /∑ i=1 r i for each genome. C: Similar correlation plot between true r 3 /r 2 and estimated r 2 /∑ i=2 r i . In both B and C, true spectral ratios on Y axis are computed from the assemblies, and the estimated indices on X axis are obtained by applying the LP method to the simulated skims described in A. https://doi.org/10.1371/journal.pcbi.1009449.g002 This approach significantly improved the average error in estimating the spectra at multiplicity j = 3 and higher (Fig 2A; LP method), and resulted in small improvement at j = 1, 2 as well.
Using the repeat spectra from 556 training genomes, we observed a strong correlation between r 2 /r 1 and r 1 /∑ i�1 r i (Fig 2B). Therefore, we estimated r 2 /r 1 by using the LP estimate of r 1 /∑ i�1 r i and a spline fitted on the training data based on a generalized additive model [22,23] (see Methods: 'Spline Linear programming'). The estimated r 2 /r 1 value and the LP estimated r 1 value provided a new estimate (named SLP) of r 2 . In a similar fashion, we computed SLP estimates of r j+1 from LP estimate of r j and r j /∑ i�j r i for j = 2, 3, 4, 5 (Fig 2C, S5, S6 and S7 Figs, and Methods: 'Spline Linear programming'). Using the additional information learned from the training genomes captured by the fitted splines, we obtained significant reduction in the average error of repeat spectra estimation (SLP vs. LP in Fig 2A). To solve the full optimization problem in Eq 2, we used a simulated annealing procedure. Specifically, starting with initial estimates of parameters obtained under no-repeat assumption, at each iteration a new values for λ is suggested, and SLP method is used to estimate r. If a candidate λ results in a reduction in error, the algorithm accepts the move. Moreover, to avoid getting stuck at local minima, occasionally moves to states with higher error are also accepted. Lastly, the initial estimate of � is corrected for the repetitiveness of genome using a regression learned over a subset of training genomes (S8 Fig). The algorithm is outlined below (also see Methods: 'RESPECT algorithm' for a detailed description).

Compute the initial values of P and error function E.
3. For t = 1, � � �, N repeat: 3.1. Choose λ next randomly within a neighborhood of current λ, and compute P next .
3.2. Solve for r next using SLP method.
3.3. Use P next and r next to compute E next .
4. Correct the initial estimate of �, and update λ 5. Output c = λℓ/(ℓ − k + 1), r, L = B/c, and � at the end of iterations (B is the total amount of nucleotides sequenced).

Estimating genome lengths
We applied RESPECT and CovEst to simulated genome-skims-Illumina reads sampled from the 66 test genomes skimmed at 1X coverage with 1% sequencing-error rate-and compared their relative error in the estimation of r 1 through r 5 and genome length (Fig 3), after their convergence (see S9-S14 Figs for the convergence of RESPECT's estimates). The median RESPECT error in estimating r 1 was less than 1.5% (average: 2.9%), while the median error of CovEst was 15% (average: 34%). The error profile extended to higher multiplicities, where, as noted earlier, CovEst used a parametric model. The tight relation between r 1 and r 2 and the large absolute differences between the two values implied that a small error in r 1 would translate into a large relative error for r 2 , and we observed that for r 2 . Similarly, the RESPECT estimates of genome length were highly accurate with median error 2.2% (average: 4.1%), in contrast to 27% (average: 40%) for CovEst ( Fig 3B). RESPECT estimates were better than Cov-Est in 62 out of 66 species, often by considerable margins (Fig 3C). For example, in 54/66 species, RESPECT error was less than 5%, while CovEst error exceeded 50% in one third of test genomes. In fact, CovEst severely underestimates the length for these genomes (S15 Fig).
For 18/66 test genomes, the CovEST estimate was less than the true length by a factor of 4 or higher (S16 Fig). RESPECT relies on models trained using available assemblies. We tested if the performance depended on the amount of training data and the taxonomic composition of the training data. RESPECT performance remained robust in these scenarios (S17(A) Fig).
Moreover, its performance improved slightly (had fewer outliers) with additional training data (S17(B) Fig).
We repeated the same experiment at sequence level coverage of 0.5X, 2X, and 4X (S18 Fig). At 0.5X coverage, the median error of RESPECT was 16% (average: 18%), while CovEst had 88% median error (average: 75%) and underestimated the length by a factor of 8 or more in half of the species (S19  We also compared the performance of RESPECT among different taxonomic groups. In general, plants and invertebrates had higher error rates compared to both vertebrate groups (S22 Fig), consistent with their lower uniqueness ratios (Fig 1B). In fact, we observed a statistically significant negative correlation between the estimation error and the uniqueness ratio (S23 Fig). We additionally tested RESPECT on simulated genome-skims at 1X coverage from 10 bacterial genomes, and the results did not suggest any bias against prokaryotic genomes (S24 Fig), despite the fact that we trained our model on eukaryotic genomes.

Estimating genome length using sequenced short reads
A key difference between sequenced reads versus simulated reads is the presence of 'contaminants' or reads from non-target species. Differences may also include presence of adapter sequences, duplications of reads from the sequencing platform, lower or higher sequencing error rates due to DNA quality, and length variation of reads. Therefore, we tested RESPECT in genome-skims obtained from NCBI's Sequence Read Archive (SRA) database [24]. We downloaded high-coverage raw reads from 29 test species (from all four major taxonomic groups of eukaryotes in RefSeq) including highly repetitive plant genomes, and compared the results with the corresponding genome assemblies of the same data. After preprocessing the raw reads using BBTools [25] to remove adapter sequences and duplicate reads, we used Kraken [26] to remove contaminant reads with microbial or human origin (see Methods: 'SRA preprocessing and contamination filtering'). We note that this is an imperfect process as these tools work only when the contaminating organisms have a highly related member in the reference databases [27]. We discarded 10 samples because > 40% of reads (after removing adapters) were either duplicates of other reads, or came from external DNA sources (Table A in S1 Appendix). For the remaining 19 samples, duplicates and reads classified as contaminant were removed, and unclassified reads were sub-sampled to 1X coverage. In 16 out of 19 samples, RESPECT error was less than 11% (median: 4%), including highly repetitive genomes such as A. tauschii (r 1 /L = 0.29), Z. mays (maize) (r 1 /L = 0.32), S. salar (salmon) (r 1 /L = 0.48), and N. tabacum (r 1 /L = 0.57), where the abundance of repeats made the length estimation challenging (Fig 4, Table 1). In contrast, CovEst had less than 30% error in only 4 samples (median error 80%) (Fig 4). For the highly repetitive genomes, CovEst length estimates ranged from 1/11 to 1/7 of the assembled sequence lengths or 10 to 30 times larger error compared to RESPECT (see Table 1). In 3 samples, RESPECT had relatively high errors. For SRR085103 (domestic ferret), 99.9% of the reads did not in fact map to the available reference assembly of the domestic ferret M. putorious. Together with the relatively low percentage of duplication (9%) the data suggest a mislabeling of the sample species. For Coquerel's sifaka (P. coquereli), we observed a large gap between the total sequence length (2.8 Gbp) and the total ungapped length (2.1 Gbp) of the assembly, suggesting some challenges with the assembly. Cape elephant shrew (E. edwardii) was the last sample where RESPECT length estimate of 4.5Gbp exceeded the RefSeq (GCF_000299155.1) assembly length (3.8Gbp) by over 10%. Interestingly, the uniqueness ratio of the assembly was r 1 /L = 0.72, which contrasted with the RESPECT estimated uniqueness ratio of r 1 /L = 0.65 from the short-read data. Upon investigation, we found that a more recent assembly for E. edwardii (GCA_004027355. 1), not yet in RefSeq, had an assembled length equal to 4.3 Gbp, with r 1 /L = 0.66, matching the RESPECT estimates (4.5Gb, 0.65, respectively). The difference between total sequence length and ungapped length in GCA_004027355.1 was only 1 Mbp, in contrast to > 500 Mbp for GCF_000299155.1. Together, these data suggest that GCA_004027355.1 better assembles repetitive regions, and the RESPECT length estimation error was < 5%, despite using only 1X coverage.

The role of WGD versus high copy repeat elements in shaping genome repeat structure
Predicting polyploidy and recent WGD is challenging because mutation and gene loss after a WGD event can reduce the polyploidy signal. Specifically, a WGD event results in the uniqueness ratio (r 1 /L) becoming 0. Subsequently, as mutations accumulate, r 1 /L ratio moves gradually towards 1 in a process that may be specific to the species, and hard to predict. Nevertheless, it should be skewed toward smaller values for recent WGD events. Independently, the presence of high copy repeats due to DNA transposons and retrotransposons can lead to very

PLOS COMPUTATIONAL BIOLOGY
Estimating repeat spectra and genome length from genome skims high copy numbers of a small set of oligomers. To capture the contribution of high copy repeat elements, we defined the 'High Copy Repeats per Million (HCRM)' value as the average count (per million base-pairs) of the 10 most highly repetitive k-mers. HCRM values varied across the species, ranging from 2 to 3738 among our set of 622 RefSeq genomes (S25 Fig). We observed some correlation between HCRM values of species of the same genus, especially among vertebrates (S26 Fig). However, similar to the case of uniqueness ratios, the phylogenetic signal was not pronounced enough to predict HCRM based on the taxonomy. Analytical calculations showed that the probability of high HCRM values �200 in a genome with random set of k-mers was negligibly small (P � 10 −100 ) (See Methods: 'Statistical analysis of the repeat structure'), suggesting that high HCRM values could not be explained solely by WGD events, and were likely due to high copy (transposon) repeats.  Table B in S1 Appendix). Species with known recent WGD events had expectedly low r 1 /L. For example, only 14% of species with recent WGD had r 1 /L values � 0.8, in contrast with 64% of all species that had r 1 / L values higher than 0.8. Surprisingly, 93% of species with recent WGD also had low HCRM values (� 200) (Fig 5), and there was a strong association between the occurrence of recent WGD events and the (r 1 /L, HCRM) values (p-value: 1.8 × 10 −23 ; See Methods: 'Statistical analysis of the repeat structure'). Our results suggest that genomes with low HCRM and r 1 /L are strong candidates for WGD events.

Discussion
In this paper, we revisited the problem of estimating genomic parameters (length, sequence coverage, k-mer spectra) based on low coverage shotgun sequencing data. The problem has been studied previously and was considered challenging due to the need for simultaneous inference of coverage and sequencing errors along with the k-mer spectra. However, our results suggest that the problem remains challenging even when there is no error and the coverage is known. This is due to two factors. (a) The linear system is ill-conditioned, so that a small change in the k-mer counts due to random sampling can lead to large changes in the estimated k-mer spectra (b) Values in the k-mer spectra show a skewed and non-sparse distribution, where r 1 dominates; r 1 is important for length estimation, but controlling for small errors in r 1 leads to larger errors in the other r h values. We provide evidence of both, but future work will clarify the importance of each facet of the identification.
Proposed solutions for ill-conditioning use regularization but those methods generally enforce sparse solutions. However, the true k-mer distribution is not sparse. Our work resolved this issue through an empirical estimation of k-mer ratios based on finished genomes. This approach is viable given the many finished genomes with different repeat characteristics. Our study, with 662 genomes of which around 10% were isolated for testing, is the largest empirical study of its kind.
As expected, accurately estimated k-mer spectra led to better estimation of genomic parameters such as length, with RESPECT performing significantly better than the previous best method, sometimes by orders of magnitude. Our results also have lower variance than those of other methods.
As coverage increases, all methods perform well. However, at coverage 8X and higher, partial assemblies are possible and small contigs can start to be assembled. In those cases, alternative methods to estimate genome lengths may be possible, but our methods work well even for 0.5X coverage.
We had used every genome for which the assembled sequence and the raw-reads were available at the time of submission. Recently, new data has been been released, and we tested our method on 10 additional samples with very similar performance (S28 Fig). The presence of contaminants is a significant barrier to accurate estimations, and in fact is challenging even for assembling the data. As data sampling and DNA extraction methods improve, this problem will likely be less problematic. In parallel, we are also working to improve computational approaches to removing contamination.
While most k-mer based statistics were developed as an initial first step prior to deep sequencing and assembly, they may have an important role to play in independent analysis of genomes. Many genomes are � 1Gb or lower. Therefore acquiring genome-skims for a majority of organisms and even multiple individuals in a population is a feasible goal. Methods that work on these reduced representations can be transformative for studying dramatic and shortterm changes in bio-ecology. We can envision technologies where a sampled individual's genome-skim can be used to quickly estimate its genome-length, repeat structure, remove contaminating reads, identify the organism or place it confidently in the tree of life, and finally, identify the robustness of population through analysis of heterozygosity. Our paper contributes to the first step of this vision.

Comparing r1/L distribution over different sets
To compare two sets of values and see if the values in one set are greater than the other set, we used the Mann-Whitney U test. Formally, if X and Y are random samples from populations X and Y, the test statistic, U, is given by the number of times x is greater than y for all ðx; yÞ 2 X � Y. The Mann-Whitney U test is non-parametric and does not restrict the samples to be from a certain family of distributions. The test also allows the user to specify a location shift μ and examine the alternative hypothesis that X − Y > μ. By gradually increasing μ and computing the p-value, we can understand the extent of difference between X and Y.
To test if two sets of numbers are drawn from the same distribution, we used the two-sample Kolmogorov-Smirnov (KS) test. The test statistic is a distance between the empirical distributions functions of the samples from the two sets. We used R 'stats' package [31] to compute the p-values for both tests.

Modeling genomic parameters
We consider k-mers in a genome of length L and assume that k � log 4 L so that any k-mer is unlikely to appear more than once, unless it is part of a repeated sequence. Denote the (unknown) k-mer spectrum of a genome that contains repeats using r, where r j describes the number of distinct k-mers that appear exactly j times in the genome.
The genome is shotgun sequenced using reads of length ℓ with average sequencing depth c. The total number of nucleotides sequenced is given by B = cL. As there are l − k + 1 k-mers in each read, the k-mer coverage is given by Let o denote the histogram of observed k-mer counts. The observed number of k-mers of abundance h, o h , can be thought of as a sample allocation to random variable O h , whose expected value, m h ¼ E½O h �, depends upon r, λ, L, and also on sequencing error. We assume that any base-pair is sequenced erroneously with probability �, and sequencing errors only result in novel k-mers. We further assume that the number of times a unique k-mer repeated j times is sampled follows a Poisson distribution with rate λj(1 − �) k . Therefore denotes the probability that a k-mer repeated j times in the genome is observed with count h in the genome skim, 1 h=1 = [1, 0, 0, . . .], and E = Lλ(1 − (1 − �) k ) is the expected number of erroneous k-mers. As λ and L are connected through Eq 4, we choose λ as the independent variable and consider L as a function of λ. Under this model, we would like to estimate ðr; l; �Þ ¼ arg min r;� EðP; r; �; oÞ, where E is a weighted p-norm of the difference between expected and observed counts E w;p ðP; r; �; oÞ ¼ X Note that the optimization is non-trivial because P and E are functions of (r, λ, �), and must be simultaneously estimated.

A generic iterative optimization for parameter estimation
The dimensions of o and r in Eq 5 are determined entirely by data and are not necessarily identical. However, we truncated both to a common dimension n = 50 for computational expediency. A generic optimization method could be described as below.
2. Solve for r using Eq 6.
3. Use estimated r and grid-search to re-estimate λ, �.
4. Repeat step 2 onwards until the error has converged.
Step 2 is the key step in this procedure, and we devised a number of approaches to solve it.

Least-squares estimate of repeat spectrum
Choosing p = 2 (Euclidean norm) and w h = 1, 8h in Eq 6, the problem is turned into a Least-Squares (LS) optimization. To test an LS method for estimating r, we considered the simplest sequencing-error-free case (� = 0), where coverage λ was known. Therefore, where P is an n × n matrix with We showed (S1 Appendix) that P is non-singular and in the error-free case, it should be possible to use the estimate r (est) = oP −T . However, we observed that its effective rank was very small as Λ, E each have rapidly diminishing eigenvalues. Therefore, instead of decomposing P and explicitly computing P −1 , we used the non-negative least squares (NNLS) method [32] to solve r ðestÞ ¼ arg min r ko À rP T k 2 : We used nnls method from SciPy's [33] Optimize library. Unfortunately, the LS estimates were very unreliable and showed high error. In fact, we proved, for λ = 1 (see S1 Appendix), that condðPÞ � 2 n n : The condition number grows exponentially with n suggesting a highly ill-conditioned matrix P where small changes in o from the expected values m would lead to large errors in estimate of r. For these reasons, we adopted constrained optimization methods to solve for r.

Linear programming for constrained optimization based estimates
We used Eq 6 with w = [0, 1, 1, . . ., 1] and p = 1 to design a Linear programming estimate of r as: such that The rationale behind setting w 1 = 0 was that o 1 contains a large number of erroneous k-mers, so we exclude it from the objective function and use the rest of the bins to estimate r. As � is not known in general, o 1 was used to estimate the (average) sequencing error rate, and subsequently the k-mer coverage λ.
The lower and upper bounds on r j r jþ1 were determined based on the distribution R j of spectral ratios in 556 training genomes, and therefore we only search for candidate solutions r that satisfy the constraints. Specifically, we profiled the repeat spectra of the training genomes and set ½L j ; U j � equal to the empirical support of R j distribution, i.e., L j and U j are the smallest and the largest samples observed from R j over the training genomes. We use Gurobi Optimizer [34] to solve the constrained optimization problem formulated in Eq 7.

Spline Linear programming
The final method of estimating r is based on the LP estimate of r and the splines fitted on spectral ratios r j /r j+1 as functions of r j P i�j r i . Formally, let r LP j denote the LP estimate of r j by constraining the spectral ratios to be within the support of R j among the training genomes, as discussed above. For each j 2 {1, 2, 3, 4, 5}, we used a generalized additive model (GAM), learned from 556 training genomes, to predict r j /r j+1 based on . Specifically, we model y j = r j /r j+1 for different genomes as samples drawn from dependent random variable Y j , which follows gamma distribution and its mean is determined by where g j is called the link function, and s j is the smoothing spline. These functions allow us to capture nonlinear dependencies between the variables in our model. For j = 1, 2, we use a logarithmic link function to account for the large dynamic range of r j /r j+1 over the training set, and use identity link for j = 3, 4, 5. For each fitted GAM, we empirically set the smoothing parameter to balance the over-fitting against the goodness of fit. We used R 'mgcv' package [35] for GAM fitting.
Using the LP estimates of r j 's and plugging them into Eq 8, we predict the spectral ratios. Let y SLP j denote the estimate of y j using Eq 8 on previous estimates of r. We recursively re-estimate r j for j 2 {2, 3, 4, 5, 6} and call them r SLP j :

RESPECT algorithm
For the RESPECT algorithm, we replaced the basic iterative method described above with a simulated annealing procedure outlined in Algorithm 1 to speed up the computations. To initialize the algorithm, we started with the assumptions that genome has no repeats r = [L, 0, 0, . . .], and the error-free k-mer counts follow a Poisson distribution (Eq 5). Defining λ ef = λ(1 − �) k as the error-free k-mer coverage, we estimate its initial value from the ratio of observed counts and set (see S1 Appendix). The above estimate of � is used throughout the algorithm, but is corrected at the end based on the estimated uniqueness ratio (described below). Using the estimate of λ ef , we compute P, and thus the error function E at the start of the algorithm. For E, we chose w = [0, 1, 1, . . ., 1] and p = 1 in Eq 6, so With the initial values of the parameters known, RESPECT runs a simulated annealing optimization until the error converges. At each iteration, a candidate λ next in 1 2 l; 3l � � is selected uniformly at random, and P next is computed from λ next (1 − �) k . Next, we run SLP method on (o, P next ) to get r next . Throughout the algorithm, we used truncated o 1×m , r 1×n , and P m×n where the number of spectra is fixed at n = 50 (a reasonable compromise between accuracy and speed), and the number of observed counts m = n � max(1, λ ef ) scales proportionally with the initial estimate of error-free k-mer coverage. Using (o, P next , r next ), error function for the candidate state E next is calculated. If moving to the candidate state results in a reduction in the error (E next < E), the algorithm accepts the move and updates the current estimate of parameters. In addition, to help the algorithm deal with local minima and find better solutions, a simulated annealing scheme is implemented such that the algorithm probabilistically decides to move to states with higher error. Specifically, at iteration t, even if E next > E, the algorithm accepts the move with probability expðÀ ðE next À EÞt=NÞ.
At the end of iterations, the initial estimate of � (obtained under no-repeats assumption) is corrected based on the estimated value of r 1 /L. The correction was learned over 120 genomes randomly selected from the training set, and applied if the estimated coverage is smaller than 1.5X. Then, λ is re-computed based on the corrected �, and is used to compute the final estimates of coverage and genome length. The estimated sequencing error rate and repeat spectrum are also provided by the algorithm.

SRA preprocessing and contamination filtering
After downloading SRA accessions and converting them to FASTQ using SRA Toolkit [36], we used BBDuk and Dedupe from BBTools package to trim adapter sequences and remove duplicate reads. We then ran Kraken2 to remove contamination with prokaryotic or human origin. For plant and invertebrate samples, we filtered out any read that was classified to the Kraken database at 0 confidence level (very sensitive, a single matched k-mer is enough for the classification). For vertebrates, due to their smaller evolutionary distance to homo sapiens, we required 0.5 confidence level (more specific, half of the read's k-mers should match) for human classification, and 0 confidence level for everything else in the database.

Implementation details and running time
We use 'count' and 'histo' commands from Jellyfish [37] command line tool to compute the kmer histogram of input genome-skims. In each iteration of RESPECT algorithm, we solve a constrained optimization problem using the tools provided by Gurobi Python interface in 'gurobipy' package. The running time of RESPECT slowly increases with the coverage as the size of P (and hence the size of optimization problem at each iteration) scales with the (initial) estimate of coverage. On average, for a typical 0.5X-4X coverage of genome-skims, it takes about 2 hours for RESPECT algorithm to converge and produce the final estimate of the parameters.

Selecting species with known recent WGD events
From the total of 83 RefSeq genomes in our database, we obtained the WGD annotation (with estimated age) for 44 plant species [29]. WGD annotations for the remaining 32 plant species in our database were based on the data provided by the 1000 plants project [30], where either the exact same species or a species from the same genus is identified to have undergone a WGD event using transcriptomic data. We also have 7 Salmonid genomes where their common ancestor is thought to have had a WGD event about 80My ago [28].

Statistical analysis of the repeat structure
In a random genome with length L, there are L − k + 1 ' L k-mers, and assuming the random selection of k-mers is uniform over the space of all 4 k possible k-mers, the probability distribution for the copy number (CN) of each k-mer is For typical values of L * 100 − 1000 Mbp and k = 31, the conditions to use a Poisson distribution to approximate a Binomial (see e.g., Section 5.4 of [38]) are met, i.e., L � 1 and 4 −k � 1, hence we have If the genome subsequently undergoes n w whole genome duplication events, the genome length is multiplied by 2 n w . However, the multiplicity of each k-mer increases by at most 2 n w , as mutations reduce the copy number of k-mers. Therefore, to have an HCRM value of H, there should exist at least a k-mer with copy number x � HL in the original random genome. Now, considering that under random-genome model the selection of any k-mer is equally likely, we can use the union bound (see e.g., Section 1.5 of [38]) and have To test the association between WGD events and the values of r 1 /L and HCRM, we used the assembled genomes of 622 RefSeq species and constructed a two by two contingency table where columns represent the species with or without an identified recent WGD, and the rows specify whether or not the genome has r 1 /L and HCRM values less than 0.8 and 200, respectively. We filled the table by the count of genomes that satisfied each of these four conditions, and performed a Fisher's exact test (using R 'stats' package [31]) and got the p-value = 1.8 × 10 −23 for the correlation between the rows and columns of the table.
Supporting information S1 Appendix. Supplementary methods and data. Detailed mathematical derivations and supplementary tables.  RESPECT was test on 10 new samples (chosen at random) made available since the original submission of the manuscript. One of the samples was removed during the preprocessing due to high duplication rate. The results for the remaining 9 samples are plotted along with the original test species. Two newly added samples with high error are Z. cesonia and V. riparia. RESPECT overestimates their genome length by %28. It could be the case that the assemblies are missing some repetitive sequences (especially V. riparia which a has highly repetitive genome), considering that for both species there is a gap between reported total sequence length and total ungapped length.