Abstract
Planarian flatworms have emerged as highly promising models of body regeneration due to the many stem cells scattered through their bodies. Currently, there is no consensus as to the number of stem cells active in each cycle of regeneration or the equality of their relative contributions. We approached this problem with a population genetic model of somatic genetic drift. We modeled the fissiparous life cycle of asexual planarians as an asexual population of cells that goes through repeated events of splitting into two subpopulations followed by population growth to restore the original size. We sampled a pedigree of obligate asexual clones of Girardia cf. tigrina at multiple time points encompassing 14 generations. Effective population size of stem cells was inferred from the magnitude of temporal fluctuations in the frequency of somatic variants and under most of the examined scenarios was estimated to be in the range of a few hundreds. Average genomic nucleotide diversity was 0.00398. Assuming neutral evolution and mutation-drift equilibrium, the somatic mutation rate was estimated in the 10−5 − 10−7 range. Alternatively, we estimated Ne and somatic μ from temporal changes in nucleotide diversity π without the assumption of equilibrium. This second method suggested even smaller Ne and larger μ. A key unknown parameter in our model on which estimates of Ne and μ depend is g, the ratio of cellular to organismal generations determined by tissue turnover rate. Small effective number of propagating stem cells might contribute to reducing reproductive conflicts in clonal organisms.
Introduction
Planarian flatworms are a fascinating model system for studying body regeneration. After injury they can reconstruct their body from very small pieces of tissue; and, can grow or “degrow” by regulating the number of cells in their bodies in response to nutrient availability1,2. The asexual lines in the clade (including Girardia tigrina which we used in our experiments) are generally fissiparous3,4: the grown worm splits down the middle creating a head piece and a tail piece; each half is restored to a complete body through positionally regulated cell multiplication, cell death and differentiation5. Growth and regeneration rely mainly on a very large number of stem cells (neoblasts) which comprise 25-30% of the cells in a planarian’s body by morphological observation2,6,7. Neoblasts are the only dividing cells in Planaria2 but BrdU labeling experiments suggest that only a fraction of them are active at any given time1. Due to the heterogeneity of morphological and cellular features in stem cells, as well as their partial similarity to early post-mitotic committed progeny cells, mitotic molecular markers such as PCNA and H3P or BrdU labeling cannot visualize stem cells completely and exclusively 1,8. Recently, irradiation and transplantation experiments suggested the surface protein Tetraspanin as a reliable marker for isolation of pluripotent stem cells9. Nevertheless, methods involving BrdU injection, experimental wounding, or irradiation significantly alter the normal physiology of the animal. Consequently, a quantitative understanding of the number of active stem cells and their relative contributions at each cycle of tissue regeneration during natural growth and reproduction is still lacking.
Planarians are a curious case for evolutionary genetic studies, too. Because asexual planarians do not undergo the single-cell bottleneck of zygote, significant genetic heterogeneity exists within a single worm’s body10, which invokes competition among diverged cellular lineages11. Mutator alleles remain linked with the mutations they cause in these non-recombining genomes and somatic mutations are transmitted to future generations. Theoretically, deleterious mutations (if not completely recessive) could be eliminated at the cellular level locally before they reach a frequency that can affect organismal fitness, while beneficial mutations give the mutator lineage a competitive advantage. This dynamic predicts a higher optimal somatic mutation rate in clonal organisms12.
In this study we present a simplified population genetic model of the life cycle of an asexual planarian. Most of what is known about planarian biology comes from the study of Schmidtea mediterranea 13–15. In this study we focused on a less well understood species Girardia cf. tigrina. In our model, somatic cells play the role of asexually reproducing individuals in an expanding population, which is the planarian body. Over time, this population doubles in size and splits into two subpopulations, i.e., the head and tail pieces. We showed that the allele frequency spectrum of somatic variants in our model system is shaped more strongly by genetic drift (random fluctuations affected by population size) versus genetic draft (selection acting on tightly linked loci). Then, we proceeded to estimate somatic mutation rate from Ne,sc and the observed nucleotide diversity (π) according to expectations of the neutral theory. We tracked fluctuations in the frequency of somatic variants and average nucleotide diversity over >10 generations and applied the theories of genetic drift and mutation-drift interaction at the cellular level to estimate the effective size of the stem cell population Ne,sc and the somatic mutation rate μsom.
Results
A line of lab-reared asexual flatworms was established from a single individual. Cytochrome oxidase subunit 1 (COI) DNA barcode and proportion of total reads mapping to different Dugesiidae genomes both identified this lineage as Girardia cf. tigrina (Tables S1 and S2). In several years of maintaining this line, no instance of sexual reproduction was observed. For this study, the worm lineage was followed for 14 generations of splitting and regrowth. Genomic libraries were prepared from tail pieces after splits 2 (II), 6 (VI), 8 (VIII), 10 (X), 12 (XII) and 14 (XIV) and sequenced in 2 or 3 replicates (Fig. 1). One of the replicates from sample XIV failed, making samples II and XII the two samples farthest apart in time that had replicated sequenced libraries. The fastq files were trimmed, filtered and deduplicated using Trimmomatic and BBTool’s Clumpify and evaluated by Fastqc (Table S3). FreeBayes was run on individual replicates as separate samples first. Bi-allelic SNPs with coverage 10-40X in all samples were subjected to principal component analysis (PCA). PCA confirmed that the difference between replicates was indeed much smaller than that between biological samples (Fig. S1). FreeBayes was run a second time merging all the replicates of each biological sample. Allele frequencies of positions with coverage 10-60X from the merged-rep VCF were used for population genetic analysis. Mean sequencing coverage of SNPs was 12.1X for sample XIV and 19.9-36.6X for the other merged-rep samples.
Divergence from the reference genome
In the merged samples, divergence was calculated as the ratio of the number of positions in the coverage 10-60X range with AAF > 0.99 to the total number of positions covered 10-60X. Average divergence from the six samples was 1.112% (range 1.03-1.28%) which corroborates taxonomic identification of our specimens as conspecific or congeneric to Girardia tigrina.
Somatic drift vs. somatic draft
The allele frequency spectra from samples II and XII are plotted in Fig. S2. Assuming the reference allele to be ancestral in most positions, patterns of the alternative (derived) allele frequency roughly match the neutral expectation except for a local peak of falt ≅ 1 representing positions of fixed divergence from the reference. Simulations have shown that in the absence of recombination, strong selection on linked loci can distort the distribution of allele frequencies in a specific way: the density of derived allele frequency falls off much more steeply under linked selection (draft) than under neutral evolution (drift)16. Denoting derived allele frequency ν and the corresponding density f(ν), derived allele frequency falls as f(ν)∼ν−1 under drift but as f(ν)∼ν−2 under draft16. In our dataset, the f(ν)∼ν−1 model (R2 = 0.938) fits the data better than the f(ν)∼ν−2 model (R2 = 0.832) (Table S4 and Fig. S3). This suggests that: 1) The observed AFS is more consistent with drift than draft, although draft cannot be rejected and probably plays some role (0.832 is not much smaller than 0.938). 2) The drift model fits quite well, which validates the idea of estimating stem cell Ne based on somatic drift. Fig. S4 illustrates temporal fluctuation in alternative allele frequency of 9 randomly picked loci.
Effective size of stem cell population and somatic mutation rate
We calculated Ne,sc and somatic mutation rate μsom by two different approaches: 1) using somatic drift and the magnitude of change in somatic variant frequencies (equations 1-3, see Methods), and 2) using the rate of change in heterozygosity (π) across generations (equation 7). In both cases, the parameters of interest were estimated with and without generation XIV which was not sequenced in replicates (Table 1). The first approach uses allele frequencies from samples at the two ends of the longest interval of generations, II-XII or II-XIV. The second approach uses nucleotide diversity (π) from all generations. We estimated F̄k and Ne from individual replicates of generations II and XII to evaluate replicate reproducibility (Table S5). Subsampling of SNPs showed that although the 370 SNPs we analyzed were all located on the same contig and were inherited together during asexual reproduction, they provided relatively independent pieces of information on somatic drift (Fig. S5). This was probably because we sequenced tens of thousands of cells but were looking at positions with coverage 10-60X. Reads mapping to different positions likely came from different, possibly divergent cell lineages having had different past trajectories and accumulated different mutations and gene conversions through generations of clonal reproduction. This was equivalent of sequencing multiple loci in different subsets of individuals in the population. The estimates of Ne,sc in Table 1 change considerably depending on the choice of final sampling generation (XII or XIV). This more than expected due to somatic drift alone. It is possible that there is more biological stochasticity in stem cell division patterns over time depending on environmental factors that were not accounted for in our experiments. It could also be influenced by experimental errors which suggests that future experiments must involve more control conditions, more replicated and more time points spaced over longer intervals. Ne,sc estimate from drift analysis also varies almost linearly with the assumed number of elapsed generations (t), which in turn depends on the ratio of cell to organismal generations (g). Although the rate of tissue turnover in planarians under natural physiological conditions has not been measured experimentally, a theoretical lower bound for g can be obtained by assuming that 1) a stem cell’s proliferation rate is independent of its age, and 2) when the worm grows fast under favorable conditions, most cells are young, and therefore the rate of apoptosis is negligible compared to the rate of cell division. Under these assumptions, the number of cellular generations will be approximately half the number of organismal generations; because, after each split and regrowth event, half the cells come directly from the previous generation (number of generations elapsed = 0) and the other half are newly produced by stem cells (number of generations elapsed = 1), bringing the average over all cells to 0.5. Although this assumption is not realistic, it offers a way to estimate min (Ne,sc) and max (μsom). Homeostatic tissue turnover becomes increasingly significant the more slowly the worm grows because more and more somatic cells age, die, and are replaced, adding to the number of elapsed cellular generations. The smallest estimate from drift analysis is Ne,sc = 71.1 which corresponds to g = 0.5 based on allele frequencies from generations II and XII (Table 1).
Nucleotide diversity (π) for each sample was estimated as the product of nucleotide diversity at bi-allelic SNPs (π∗) and the proportion of polymorphic positions in the sample. Average nucleotide diversity across all samples was π̅ = 0.00398. Assuming mutation-drift balance under neutrality, the equilibrium somatic mutation rate was calculated according to the haploid form of the equation π = 2Neμ or . Estimates of Ne,sc and μ: som under several scenarios are given in Table 1. The eight examined parameter sets estimate μ in the 2.7 × 10−5 − 7.2 × 10−7 range. These values should be interpreted with caution because nucleotide diversity shows a decreasing trend down the generations in our experiment (Fig. 2) indicating that the system may not be at drift-mutation balance and somatic genetic variation may not be in a steady state.
We derived a model describing the change in heterozygosity (H) with combined effects of drift and mutation in its general form i.e. without the assumption of equilibrium (Eq. 7). According to this equation, the difference between the current value of H and its equilibrium value is reduced by an exponential factor of per generation. Equation 7 cannot be linearized and there is no closed form solution to find the best fit; thus, numerical methods must be used. We used a nonlinear least square algorithm to find parameter values for Ne,sc, μsom and π0 that best fit the observed trend of π over generations (Fig. 2, Table 1). These are different from estimates inferred from somatic drift and temporal variance of allele frequencies and suggest even smaller Ne and larger μsom.
Discussion
In the absence of direct molecular and cellular evidence and effective tools for in vivo analysis of natural growth and regeneration in planarians, we decided to use the classical population genetic theory to model stem-cell-based body regeneration in an asexual line of G. tigrina. Natural populations of G. tigrina comprise sexual and asexual lines and show variable ploidy (2n or 3n)3,17. Visual inspection of somatic allele frequencies of nine example loci showed what appeared to be random fluctuations (Fig. S4). Evaluating the density of the allele frequency spectrum further corroborated the role of somatic drift at the cellular level (Table S4, Fig. S3).
The properties of our experimental systems match the underlying assumptions and structure of the theoretical model reasonably well. In the original scheme theorized by Waples, a population is sampled at two time points and an estimate of effective population size (Ne) is derived from observed temporal changes in allele frequencies18. To make the model more analytically tractable and free of certain restricting assumptions, they recommend a sampling before reproduction plan to ensure that there will be no overlap (and therefore no covariance) between the reproducing individuals (contributing to Ne) and individuals sampled for allele frequency estimation. Such a separation is guaranteed in our model system as tail pieces are sequenced while head pieces grow to create the next generation. The theory we used assumes discrete generations, but we know planarian cells do not divide and die synchronously. Results from applying the classical Ne,Fk method to populations with overlapping generations may be biased19. Accuracy of Ne estimation can be improved by examining more loci, using loci with intermediate allele frequencies, sampling more individuals and increasing the time between samples (at least 3-5 generations apart, preferably more)20,21. Following these guidelines, we estimated Ne,sc from the longest interval between our samples. Our estimate of Ne,sc comes from 370 positions which is much larger than recommended number of loci to achieve acceptable accuracy and precision in the references. We also omitted SNPs with minor allele frequency <0.1. Subsampling analysis confirmed that despite being on the same contig, these positions provided independent bits of information (Fig. S5).
Our estimates of the effective number of stem cells even under the most permissive scenarios are still much smaller than the number of stem cells suggested by microscopic observations (tens of thousands). It has been suggested that 20-30% of cells in a worm’s body are stem cells22–24. A 1-cm long Dugesia was estimated to contain approximately 2 × 105 cells25. This gives an estimate of 40000-60000 stem cells which is 1 or 2 orders of magnitude higher than our estimate of a few hundreds to a few thousands (Table 1). The most likely explanation is that a small fraction of stem cells, e.g., stem cells closer to the fission site, contribute disproportionately highly to regeneration. It should be noted that we estimated the number of active stem cells in the head piece, which is known to have fewer stem cells than the tail piece, with the area anterior to the eyes practically devoid of any stem cells. Underestimation of homeostatic tissue turnover rate reflected in g (ratio of cellular to organismal generations) is a possibility. Hopefully, experimental data in the future will quantify tissue turnover and g can be specified with more certainty. It has been shown that selection will not affect F^k much under plausible t circumstances, especially when is small26. It has been suggested that variable selection and changes in demographical parameters can lead to overestimation of Fk and underestimation of Ne18; but these are unlikely to have played significant roles in our model system of worms reared in controlled lab conditions. Effective population of stem cells can vary from generation to generation not only due to stochasticity of activation at the cellular level, but also because worms do not always split exactly equal or consistently proportionate pieces. The number of stem cells in the head piece will depend on the size of the head piece. When the population size (of stem cells or otherwise) varies from generation to generation, Ne estimates the harmonic mean of Ns across generations which is influenced very strongly by the smallest values of N.
One of the caveats of our models was that we used a well-mixed model equivalent of a panmictic population. Existence of body structure in planaria means that this assumption is not accurate. However, this is a common simplifying assumption in almost all population genetic models except those focusing specifically on spatial heterogeneity. It is also assumed in the theory we used here for its original purpose of estimating Ne in organismal populations, although natural populations of animals and plants are not realistically panmictic. The planarian body structure may cause Ne,sc to be smaller than the microscopically observed number of stem cells, since stem cells closer to the fission site probably play a more important role in regeneration. However, it is unlikely to violate the assumptions of our model and affect its conclusions substantially because of three mitigating factors: 1) After experimental amputation, natural fission, or during the processes of growth and degrowth, the worm’s body undergoes extensive reshaping (known as morphallaxis) 2,28. This means that body structure is likely not preserved from generation to generation, and therefore random activation of stem cells is an appropriate model for estimation of effective population size. 2) Irradiation-amputation experiments show that stem cells or their progeny can migrate long distances from their original position to the wound site to contribute to regeneration 1,2. The reconstruction of the body after fission is not restricted to stem cells adjacent to the fission site although they might contribute more. 3) It is noteworthy that population structure (lack of complete mixing) has a similar effect on probability of identity by descent of alleles (inbreeding coefficient, denoted F) which is the basis for the definition of Ne: the effective population size of a real population is equal to the consensus size of an ideal population which results in the same amount of identity by descent or F per generation. Strong population structure reduces heterozygosity because it effectively divides the actual population to a number of small drift-prone sub-populations with poorly connected gene pools. In the case of our experimental system, population structure within the stem cell population would mean that not all stem cells are equally likely to contribute to regeneration, with the ones close to the fissure being more likely to do so. Therefore, the main conclusion of small Ne remains valid, but it could be interpreted as population structure or random activation of a small fraction of stem cells. However, population geneticists do not solely attribute Ne to neutral processes; there is abundant literature on the effects of background selection and selection sweeps on Ne 29,30. It was therefore important to determine which is the primary force shaping Ne in our system: drift or selection? Drift inevitably happens in any population of finite size, whereas selection only sometimes occurs with an effective strength. Drift is the default assumption unless there is significant evidence that it cannot explain the data in which case other processes are evoked. In our analysis of allele frequency spectrum, we found no evidence to believe that selection is a stronger force than drift in this dataset. We are not dismissing the possible occurrence of selection, but the data show that it is not a significant determinant of Ne in our system. If selection is not strong enough to alter the AFS significantly, to the point that AFS is slightly better explained by drift, using a more complex model incorporating weak selection does not seem warranted.
Another limitation of this study is that the value of g, the ratio of cellular to organismal generations, is unknown. As far as we know, it is yet unknown even for the most well-studied planarians. Our estimates of Ne and μ from both methods depend on this parameter. Our model is still useful because it provides a relatively simple analytical framework for the estimation of two important cellular and genetic parameters from sequencing data, and reduces it to one unknown value. The value of g depends on several aspects of planarian biology, most importantly the rate of homeostatic cell turnover rate. Future experiments will focus on quantifying this value. In the absence of g’s precise value, we decided to determine its upper and or lower boundaries and calculate Ne and μ based on those boundaries. The upper bound for g is theoretically infinite and practically depends on environmental factors e.g. the feeding regime as well as inherent ones. The lower bound for g, however, is 0.5, which would be the case if all cells were immortal. Although we can be certain that this is not a realistic possibility, it provides a useful basis for calculation of min (Ne) and max (μ).
The estimated somatic mutation rate is of the order of 10−5 − 10−7 which is orders of magnitude higher than the norm in sexually reproducing eukaryotes (often in the 10−8 − 10−9 range). There are several points of consideration here. First, in most organisms, including humans, the rate of mutation in somatic tissues is about an order of magnitude higher than in the germline31,32. Second, theory predicts that mutation rate evolves to a minimum rate in sexual populations but could evolve to a non-minimal optimum in asexual populations under particular circumstances33,34. It has been shown that in the presence of strong selection among somatic cell lineages, a higher optimal mutation rate can evolve, because mutator alleles can benefit from the advantageous mutations they cause while deleterious mutations are eliminated before they get the chance to be transmitted to next generation11,12. The high level of somatic heterogeneity, also observed by Nishimura et al.10, can provide the basis for such strong somatic selection. Third, the neutral expectation of π = 2Neμ is derived assuming a steady state level of genetic variation (mutation-drift balance). However, we observed a rapid loss of heterozygosity in our samples over the generations (Fig. 2), which is in contrast with previous observations in a sexual line of S. mediterranea 27. Inference from loss of heterozygosity produces even smaller Ne and larger μsom estimates than those estimated from drift. Most natural systems are expected to maintain a steady state level of genetic variation over long term. We speculate that the observed decreasing trend in π is partly due to the fast growth and splitting of worms under favorable lab conditions. Under slower growth, more homeostatic tissue turnover would happen which would allow for the accumulation of as many new mutations as eliminated by somatic drift. Spontaneous sexualization has been reported in some G. tigrina lines which could potentially restore genetic diversity7 although this was not observed during our experiments.
Although the estimates from both methods change in the same direction with the key parameter g, there is considerable difference between the two sets of estimates. This will be addressed in future work by gathering more data and developing more complex simulation-based models incorporating body structure, homeostatic tissue turnover, variable growth rates and splitting patterns.
Methods
Worm collection and maintenance
The worms were collected from a stream in Almese, Italy on September 23, 2009. They were separated into individual vials and kept in standard rearing conditions and fed beef liver once a week followed by water exchanges. Over more than two years, the worms reproduced exclusively asexually through fissipary. All the splits occurred naturally; no artificial cutting, wounding or injection was performed. At the beginning of the experiment, a single lineage was followed for 14 generations of natural splitting and regrowth. After generations II, VI, VIII, X, XII and XIV, the tail piece was frozen for sequencing while the head piece was left to grow and further the lineage (Fig. 1).
Sequencing, QC and duplicate removal
Genomic libraries were produced from frozen tail pieces with 2-3 replicates according to the protocol described previously35 and sequenced on Illumina HiSeq. Raw fastq files were examined by FastQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc). Low quality segments were removed and short trimmed reads (<36 bases) dropped using Trimmomatic v.0.3836 in the paired-end mode with the following options: /<PATH>/trimmomatic-0.38.jar PE <INPUT files> <OUTPUT files> ILLUMINACLIP:/<PATH>/adapters/TruSeq3-SE.fa:2:30:12:1:true LEADING:3 TRAILING:3 MAXINFO:40:0.4 MINLEN:36. Only proper pairs (both mates surviving) were processed further. PCR duplicates were removed using the Clumpify command of BBTools (https://jgi.doe.gov/data-and-tools/bbtools) in paired-end mode with the following options: /<PATH>/clumpify.sh -Xmx230g <INPUT files> <OUTPUT files> dedupe=t reorder. In addition to removing duplicates, Clumpify sorts fastq files for more efficient compression which reduces compressed file size and accelerates future processing steps. Trimmomatic and Clumpify output files were re-evaluated by FastQC.
Taxonomic identification
Based on initial morphological inspection, the specimens were tentatively assigned to family Dugesiidae. Molecular identification was carried out using cytochrome oxidase subunit 1 (COI) barcode sequences 37. Pre-aligned COI barcodes of 72 species belonging to the order Tricladida including 16 species from family Dugesiidae were downloaded from the Barcode of Life database (boldsystems.org) public records. Whole genomic sequencing reads from two of our samples were mapped (separately) to the Tricladida COI sequences using bowtie2 (--end-to-end alignment default options). All our samples come from a single founder and belong to the same lineage; however, two samples were tested to ensure the reproducibility of identification. To verify the COI identification by a wider genomic scan, 1000 reads were randomly subsampled from each one of the 24 fastq files pertaining to the 12 paired-end sequenced samples and then pooled together. This pooled fastq was mapped to the concatenated fasta file comprising the four published Dugesiidae genomes on NCBI (Schmidtea mediterranea asexual strain CIW4, S. mediterranea sexual strain S2, Dugesia japonica and Girardia tigrina). Reads were mapped using bowtie2 end-to-end and local alignment options (both in the --very-sensitive mode). Although the samples had been sequenced paired-end, for taxonomic identification purposes alignments were performed separately for forward and reverse reads in the unpaired mode to avoid under-mapping of pairs due to insert size variation caused by indels and structural variations, which would not be uncommon in the taxonomic range of a large and diverse order such as Tricladida.
Mapping and Variant calling for population genetic analysis
Proper trimmed and deduplicated pairs were aligned to the G. tigrina genome using bwa mem with default options except setting the minimum seed length to 15 (-k 15) to facilitate the mapping of more divergent reads38. Variants were called using FreeBayes39 which is specifically suited to variant calling from Pool-seq data38 and has been used to study comparative genomics of flatworms before40. FreeBayes was run with options -F 0.01 -C 1 -m 20 -q 20 --pooled-continuous --use-reference-allele. This configuration is recommended for variant calling from pooled-seq data with an unknown number of pooled samples (https://github.com/ekg/freebayes) (in our study, the exact number of cells per sample in unknown). The --use-reference-allele option ensures that the output includes positions where the base call in the samples differs from the reference allele even if they are monomorphic across the samples. Reads that might map to several parts of the genome were discarded by applying the -m 20 filter (minimum mapping quality of 20). This is especially important in working with potentially highly repetitive genomes which is the case for well-studied flatworms41. The current G. tigrina assembly comprises >255k scaffolds. This created an error in the running of FreeBayes which we suspect was due to an internal algorithmic step concatenating the names of all chromosomes into a single string. To circumvent this problem, and since we needed only tens of SNPs for a reliable estimate of Ne, the bam files and the reference fasta files were filtered to contain only the longest contig (MQRK01218062.1, length=267531). The bam files were manually re-headered in two ways: 1. Rep-by-rep: A different sample name (bam header ‘SM’) was assigned to each replicate 2. Merged-reps: The same sample name (bam header ‘SM’) was assigned to all replicates of the same biological sample, the effect of which is to pool the reads from replicates during variant calling and the consequent calculation of allele frequencies. Sample XIV.8, for which only the reverse mate sequences were available, was excluded from variant calling and later analyses. Thus, all remaining biological samples except sample XIV were represented by at least two independently sequenced replicates. Coverage statistics of the re-headered bam files were obtained using the samtools mpileup function. VCF files were filtered using vcftools to keep bi-allelic SNPs only. A custom python code was run to keep only SNPs in the coverage range of 10-40X and 10-60X in rep-by-rep and merged-rep VCFs, respectively. Only samples from generations II and XII were used for population genetic analyses (see Results). Consistency of allele frequency estimates among replicates was evaluated via principal component analysis.
Population genetic analyses
Divergence from the reference genome was calculated from each merged-rep sample as the ratio of SNPs covered 10-60X with alternative allele frequency AAF > 0.99 to the total number of positions covered 10-60X in the corresponding sample.
To compare the influence of drift and draft on the allele frequency spectrum, a vector of densities of allele frequencies for SNPs with AAF 0.01 − 0.9 was obtained using the density() function in R. Denoting AAF as ν and its corresponding density as f(ν), the goodness-of-fit of three linear models f(ν)∼ν, f(ν)∼ν−1 and f(ν)∼ν−2 were tested by their p-values and R2 16.
Effective population size of stem cells was calculated based on the change in allele frequencies of SNPs between generations using the theory laid out by Waples18,42. In this method, a parameter Fk is calculated from the transgenerational difference in allele frequencies the expected value of which depends on sample sizes at the initial and final sampling points (Si and Sj), number of generations elapsed (t) and effective population size (Ne). We calculated Fk at each SNP as: Where pi and pj are the frequency of the alternative allele at generations i and j, respectively (here: i = 2, j = 12). Only SNPs with AAF 0.1 − 0.9 were included to avoid biases introduced by very rare minor alleles18. The average Fkover loci (F̄k) was plugged into the following equation to obtain Ne,sc : Equation 2 can be rearranged to obtain Ne,sc : In the original formulation designed for a single locus, Si and Sj are sample sizes at generations i and j18. Here, sample size is replaced by sequencing coverage. Si and Sj are harmonic means of sequencing coverage across all examined SNPs in generations i and j, respectively. The number of generations elapsed between the two sampling points is represented by t. Since we are modeling cells as individuals, t in our calculations must reflect cellular generations. As far as we know, the rate of tissue turn-over in planarians has not been measured quantitatively. We defined g as the unknown ratio of cellular generation / organismal generation and evaluated its effect on the calculated Ne,sc. In addition to the combined gen II and gene XII samples, we calculated Fk and Ne from all pairs of gen. II and gen. XII replicates to evaluate the effects of experimental variance (vs. biological variance or drift) on our findings.
Nucleotide diversity (π) for each sample was calculated from positions covered 10-60X in that sample as: Where π∗ is nucleotide diversity at bi-allelic SNPs covered 10-60X and is calculated from the number of reference and alternative allele counts (Rc and Ac, respectively) as follows: The second term on the right side of equation 4 is the proportion of polymorphic positions. No restriction on initial AAF was imposed for calculation of π.
The somatic mutation rate under the assumption of mutation-drift equilibrium was estimated according to the neutral theory expectations in haploid form: The haploid form of the neutral equation was chosen because in our Pool-seq data each genomic position in a sample is represented by individually sequenced chromosomes rather than diploid genotypes. Correspondingly, sample size at each position is set equal to the sequencing coverage at that position.
We also estimated Ne and somatic mutation rate by a second approach. We derived a new formula to model trans-generational change in heterozygosity (π) under mutation-drift non-equilibrium: Details of the derivation are presented below. We used a nonlinear least-square fitting algorithm (algorithm “port” implemented in the R function nls()) to estimate H0(π0), Ne and μ from the rate of reduction in π over generations. Ne = 500, μ = 1 × 10−5, π0 = 0.005 where given as starting values and lower bound for all three was set to 0. Maximum iteration was set to 100.
Deriving the equation for change of heterozygosity under mutation-drift non-equilibrium
The probability of identity by descent (IBD) designated F (sometimes called the inbreeding coefficient) equals relative loss of heterozygosity (H).
Note: This F has nothing to do with the Fk parameter introduced before in equations 1-3. It is a completely different entity.
In a diploid population, F is increased by drift and reduced by mutation according to the following equation: Under mutation-drift equilibrium Ft = Ft−1; assuming μ ≪ 1, this reduces to: And: This result for H: is equal to the expected value of nucleotide diversity (π) under the infinite sites model. For small values of π, H: = E(π) ≅ 4Neμ.
These are classic results found in most population genetics text books43,44. But how can the recursive equation 11 be solved if equilibrium cannot be assumed? We found two solutions with equivalent results:
Solution 1: A trick that usually works in solving recursive equations finally reaching an equilibrium value is subtracting the equilibrium value from the two sides (of Eq. 11): Now, we have a recursive that can be easily solved and taken back to F0: This simplifies to: This means that the deviation of existing F from the equilibrium F: is reduced by a factor of every generation. Solving for H and assuming 4Neμ ≪ 1: Solution 2: We can write the difference in H from generation to the next, convert discrete time to continuous time to arrive at a differential equation, and solve it. The difference equation can be easily obtained from eq. 11 but is also given in population genetics texts44: This is very intuitive: Heterozygosity is reduced by per generation as a result of drift, and is increased by 2μ(1 − H) via new mutations. The differential form of eq. 16 is: To general form of equation 17 and its integral are: Substituting y with H and x with t: Assuming π: ≪ 1 like before: Noting , it is evident that for small , eq. 17 approximates eq. 12. We substituted 2Ne with Ne to be compatible with the haploid model we used throughout this study when applying this equation to the planaria data (Equation 7).
Acknowledgement
This work was supported by NIH Grant No. GM098741, American Cancer Society Grant No. 130920-RSG-17-114-01-RMC and the postdoctoral PBBR grant by Sandler Foundation and UCSF.
Footnotes
Several additional tests have been added to further examine replicate reproducibility. A new equation for temporal change in heterozygosity under mutation-drift non-equilibrium has been derived and applied to the data. Main conclusions (small Ne, high somatic mu) remain unchanged.