Revisiting the Origin of the Octoploid Strawberry

The cultivated strawberry, Fragaria ×ananassa, originated in France approximately 270 years ago via hybridization between two wild species introduced from North and South America. Both the cultivated strawberry and its parental species are octoploids with 2n=8x=56 chromosomes. In the recent publication of the genome of the cultivated strawberry, the authors present a novel phylogenetic hypothesis, proposing that each of the four subgenomes originated from a different 2n=2x=14 diploid progenitor. They further suggest that the hexaploid species Fragaria moschata was a direct ancestor of the strawberries. We reanalyzed the four octoploid subgenomes in a phylogenomic context, and found that only two extant diploids were progenitors, a result that is consistent with several previous studies. We also conducted a phylogenetic analysis of genetic linkage-mapped loci in the hexaploid F. moschata, and resolved its origin as independent of the octoploids. We identified assumptions in their tree-searching algorithm that prevented it from accepting extinct or unsampled progenitors, and we argue that this is a critical weakness of their approach. Correctly identifying their diploid progenitors is important for understanding and predicting the responses of polyploid plants to climate change and associated environmental stress.

Genome Database for Rosaceae (https://www.rosaceae.org/). The former was subdivided by the Edger et al. 1 based on diploid subgenome assignment into four sets of seven chromosomes. The outgroup and subgenome assemblies were converted to 20X genomic coverage of random 100 bp sequences using BBTools randomreads.sh (https://jgi.doe.gov/data-and-tools/bbtools/). The above sets of sequence reads were aligned with BWA (v 0.7.12) 14 to the Fragaria vesca v 4.1 genome assembly 15 after masking with the F. vesca v 4.1 transposable element library downloaded from the Genome Database for Rosaceae (https://www.rosaceae.org/).
The twelve resulting alignments were converted to a variant call format (vcf) file with SAMtools (v 1.9) 16 with the default settings of the mpileup and call options. The vcf file was converted into a multisample variant format (mvf) file using MVFtools (v 0.5.1.4) 17 . All heterozygous sites were converted to N, to account for the fact that the octoploid subgenome and outgroup sequences were derived from haploid genome assemblies. MVFtools was used to automate maximum-likelihood (ML) estimates of phylogeny using RAxML (v 8.2.12) 18 with the GTR+Γ model of sequence evolution and 100 bootstrap replicates. A taxon was excluded from the analysis of a 100 kb window if it had <10% of aligned sites. Analyses were conducted for the seven base chromosomes (Fig. 1b, Fig. 2) and for 2191 non-overlapping windows of 100 kb across the seven chromosomes (Fig. 1c, Fig. 3, Supplementary Table 1). The phylogenetic position of each subgenome relative to diploid species was recorded (see Fig. 3  The hexaploid linkage mapping involved four steps: quality filtering of paired-end capture reads, mapping reads to the above-mentioned diploid F. vesca v 4.1 genome assembly, genotype calling in polyploids using POLiMAPS 2 , and linkage mapping using OneMap 20 . First, we removed adapters and low-quality portions of paired-end reads using Trimmomatic (v 0.35) 12 as described above, and merged the paired-end reads using PEAR (v 0.9.6) 21 with a minimum overlap size of 20 bp. Second, we mapped both the merged and un-merged paired-end reads to the F. vesca v 4.1 reference using BWA (v 0.7.12) 14 . The sorted BAM files were generated with SAMtools (v 1.9) 16 and then were used to create the mpileup file. Third, with the mpileup file, we conducted polyploid genotype calling using POLiMAPS 2 , in which heterozygous and homozygous loci were identified with the default parameters, except for the depth of ≥32X per progeny for the hexaploid. Lastly, to construct maternal and paternal linkage groups (LGs), we assigned SNPs to the most likely LGs based on a logarithm of odds (LOD) threshold of 5 using OneMap 20 in R (v 3.3.3) 22 .
LGs with at least 20 SNPs were used for subsequent analyses.
To infer the phylogenetic replacement of F. moschata in relation to the octoploid strawberry and other diploid Fragaria, we first extracted the quality-filtered reads that contained the above LG SNPs, and mapped these reads to the F. vesca v 4.1 reference to generate consensus LG sequences using POLiMAPS. We next used POLiMAPS and the variant call format file described in the above Phylogenomic Analysis of Octoploid Subgenomes section to generate multiple sequence alignments among F. moschata LG sequences, the four octoploid reference subgenomes, diploid Fragaria genomes and the outgroup Potentilla. Phylogenetic inference was conducted for each of chromosomes 1-7 and for maternal and paternal LGs separately, using the ML method with the GTR+Γ model and 100 bootstrap replicates in RAxML (v 8.0.26) 18 .
To test the hypothesis of Edger et al. 1 that F. nipponica and F. viridis are the progenitors (sister taxa) of the respective 'Camarosa_nipponica' and 'Camarosa_viridis' subgenomes, we built the ML tree of constrained topology that reflects this hypothesis ((Camarosa_nipponica, nipponica),( Camarosa_viridis, viridis),(Camarosa_vesca,bracteata),(Camarosa_iinumae, iinumae)) ( Table 3), for each of the seven base chromosomes. These constrained trees were then compared to the ML unconstrained trees using the Shimodaira-Hasegawa (SH) test in RAxML.
To test the hypothesis of Edger et al. 1 that the hexaploid F. moschata is an evolutionary intermediate in the formation of the octoploid strawberry, we built ML trees of constrained topologies that reflect this hypothesis (Table 4). Specifically, under this hypothesis, each of the three maternal or paternal LGs should be sister to one of the octoploid subgenomes (i.e. Camarosa_viridis, Camarosa_iinumae, and Camarosa_nipponica). We built six constraint trees by alternating the combinations of the three LGs and the three octoploid subgenomes, for each of chromosomes 1-7 and for maternal and paternal LGs separately (Table 4). These constrained trees were then compared to the ML unconstrained trees using the SH test in RAxML.    -LogeL: negative log likelihood; Δ -LogeL: the difference in -LogeL between the best ML constraint tree and the ML unconstrained tree for each of chromosomes 1-7.
The P values of the SH tests are denoted as asterisks: ** , P < 0.01.  bootstrap support. This topology is shared by five of the seven chromosomes (Fig 1). c,     Table 1). For each window, we recorded the diploid species sharing the most recent common ancestor (MRCA) with each octoploid subgenome, and we further noted when this diploid was "sister" to that subgenome. If the MRCA diploid was not sister to that subgenome, we labelled these as "clade". This generally occurred when subgenomes were sister to each other, e.g. 'Camarosa viridis' and 'Camarosa nipponica' in Fig. 1b. The two subspecies of F. vesca and the closely related F. mandshurica and F. bucharica were grouped together for MRCA scoring, and combined into a single color category. We restricted the category "vesca sister" to windows where the octoploid subgenome is sister to one subspecies of F. vesca; all other positions were scored as "vesca clade". More than one diploid species shared a MRCA with multiple subgenomes in 0.2% of the 2191 trees, and these were labelled "ambiguous".
When an octoploid subgenome was excluded from a 100 kbp window (<10% of aligned sites) it was scored "absent". These regions primarily correspond to large duplications, e.g. at the 3´ end of chromosome 2 and 7 in the F. vesca subgenome.  Chromosome 1 has incomplete F. moschata LG numbers. Numbers associated with branches are ML bootstrap support values from 100 replicates.