Assembly of whole-chromosome pseudomolecules for polyploid plant genomes using outbred mapping populations

Despite advances in sequencing technologies, assembly of complex plant genomes remains elusive due to polyploidy and high repeat content. Here we report PolyGembler for grouping and ordering contigs into pseudomolecules by genetic linkage analysis. Our approach also provides an accurate method with which to detect and fix assembly errors. Using simulated data, we demonstrate that our approach is of high accuracy and outperforms three existing state-of-the-art genetic mapping tools. Particularly, our approach is more robust to the presence of missing genotype data and genotyping errors. We used our method to construct pseudomolecules for allotetraploid lawn grass utilizing PacBio long reads in combination with restriction site-associated DNA sequencing, and for diploid Ipomoea trifida and autotetraploid potato utilizing contigs assembled from Illumina reads in combination with genotype data generated by single-nucleotide polymorphism arrays and genotyping by sequencing, respectively. We resolved 13 assembly errors for a published I. trifida genome assembly and anchored eight unplaced scaffolds in the published potato genome. PolyGembler, a method for grouping and ordering contigs into complete pseudomolecules by combining long-read sequencing and genotype information from an outbred mapping population, improves the accuracy for assembly of polyploidy plant genomes.

H igh-quality genome assembly plays an essential role in plant genomic and genetic analyses. Even though the recent advances in third-generation long-read sequencing technologies significantly improved the contiguity of genome assemblies 1 , it is so far not possible to construct a complete polyploid plant genome with sequence data alone due to polyploidy and high repeat content. Long-distance linkage information, such as physical maps, genetic maps, optical maps, syntenic maps, chromatin interactions and Hi-C contact maps, is crucial in constructing complete genome assemblies. Genetic maps have been widely adopted as they provide chromosomal-scale linkage information. The construction of many high-quality chromosomal-level genome assemblies for plants involved scaffolding with genetic linkage maps as the last step for building pseudomolecules 2,3 . The idea is straightforward: a genetic map is constructed for genetic markers associated with scaffolds and the map is then used to anchor scaffolds to construct pseudomolecules 4 . The completeness of the pseudomolecules largely depends on the density of the genetic linkage map. To obtain a full map of chromosomes, genetic markers should cover as much of the genome as possible. However, genome-wide genetic marker discovery for a mapping population with a considerable number of individuals is nontrivial. Several methods have been proposed for high-throughput genetic marker discovery with cost-effective next-generation sequencing (NGS) technology, such as reduced-representation libraries 5 , restriction site-associated DNA sequencing (RAD-seq) 6 and genotyping by sequencing (GBS) 7 .
The genotype data generated with NGS-based high-throughput genetic marker discovery methods, including reduced-representation libraries, RAD-seq and GBS, are usually of large scale with an abundance of missing values and genotyping errors. Genetic linkage analysis for such datasets is challenging. The conventional genetic mapping tools, such as MAPMAKER 8 and R/qtl 9 , have been optimized for relatively small but high-quality marker sets. Moreover, these tools were designed for inbred lines. The development of inbred lines, however, could be difficult, expensive and time consuming, especially for polyploids 10 . Several methods have since been proposed for outbred lines, including OneMap 10 , JoinMap 11 and Lep-MAP2 (ref. 12 ), but only for diploids. TetraploidMap was the first tool specifically designed for mapping outbred tetraploid species based on dominant and codominant marker information 13 . This tool was later extended to TetraploidSNPMap to make full use of the single-nucleotide polymorphism (SNP) dosage data 14 . polymapR is another genetic mapping tool for outbred polyploids utilizing SNP dosage data 15 . Due to the underlying design, however, these methods are intrinsically sensitive to missing values and genotyping errors, and thus require stringent quality control on the input genetic markers. For whole-chromosome pseudomolecule construction, this could dramatically reduce the proportion of the genome covered by the genetic linkage map.
Here we describe a novel method for the construction of whole-chromosome pseudomolecules for polyploids by genetic mapping. This method relies on the availability of a high-density marker Technical RepoRT NATUrE GENETIcs set on an F 1 outbred population and reference contigs or scaffolds. We first perform haplotype phasing for the mapping population at the scaffold level. We then estimate the genetic distance between each pair of scaffolds with the phasing results and further perform a linkage analysis to construct genetic linkage maps for scaffolds, which are used to build whole-chromosome pseudomolecules. This method is computationally efficient and robust to the presence of substantial amounts of missing genotype data and genotyping errors, and thus can well handle data generated using NGS-based high-throughput genetic marker discovery methods. Using simulated datasets, we demonstrate substantial improvements of our approach over existing genetic mapping algorithms and the ability to construct whole-chromosome pseudomolecules for both diploid and tetraploid genomes. We applied our method to the construction of pseudomolecules for several real datasets, including GBS data for the diploid I. trifida, SNP array data for the autotetraploid potato, and RAD-seq data for the allotetraploid Zoysia japonica. The resultant pseudomolecules showed high collinearity with the reference genomes.

results
Overview of method. We have developed a new method called PolyGembler (polyploid genetic linkage assembler) for the assembly of polyploid genomes by genetic linkage analysis. Figure 1 provides an overview of PolyGembler. The method assumes the availability of genome-wide data for genotyping, such as GBS, RAD-seq and array data, collected on an F 1 outbred family, as well as high coverage (that is, greater than 30×) whole-genome sequence data on a reference sample, or alternatively a set of reference contigs or scaffolds. First, the genotyping data of the mapping population are mapped to the reference scaffolds to call variants (Fig. 1a). The resultant genotype data or allele depth data are then used to infer haplotypes for each scaffold. The haplotypes are utilized to detect assembly errors and to calculate recombination fractions (RFs) between each pair of scaffolds ( Fig. 1b-d). Next, the linkage information between the scaffold pairs is used for a graph-based clustering algorithm for linkage group construction (Fig. 1e). Ideally, the scaffolds in each linkage group are originated from the same chromosome. The order of the scaffolds in each linkage group is determined by running a multidimensional scaling (MDS) algorithm and the directions of the scaffolds are calculated by solving a carefully designed traveling salesman problem (TSP) to minimize the sum of adjacent RFs along the linkage map (Fig. 1f). Finally, the scaffold-based genetic linkage maps are used to construct pseudomolecules.
Application to simulated data. Simulated data were used to evaluate PolyGembler. First, we simulated a diploid genome and a tetraploid genome. Genome-wide Illumina short reads and nanopore long reads were then simulated, respectively, from the diploid and tetraploid genome for genome assembly. To construct pseudomolecules from the genome assemblies, we also simulated outbred mapping populations consisting of two parents and 190 F 1 progeny and then further simulated GBS data (Methods).

NATUrE GENETIcs
total of 3,348 scaffolds of ~338 Mb were covered by these SNPs, constituting ~69% of the genome. Among the 20 assembly errors confirmed with the Quality Assessment Tool for Genome Assemblies (QUAST; Methods), PolyGembler identified three without false positives. To investigate the accuracy of RF estimation, we mapped the scaffolds back to reference chromosomes to determine the physical positions (Methods). For scaffold pairs that located on the same chromosomes, the estimated RFs correlated with the physical distances very well, while for scaffold pairs that mapped to different chromosomes the RFs were consistently large (Fig. 2a,b). The linkage groups constructed by this method perfectly recognized 15 reference chromosomes. Within each linkage group, the order of the mapped scaffolds demonstrated high correlation with the true order on the corresponding reference chromosome (Fig. 2e). The Spearman's rank correlation coefficients between them ranged from 0.967-1.000.
Tetraploid simulation. A total of ~10.9 million nanopore reads of ~48.5 gigabases (Gb) with an N50 of ~6.7 kb were simulated, representing ~100× depth of coverage of the genome (Methods). The resultant genome assembly consisted of 722 scaffolds of ~484 Mb with an N50 of ~1.1 Mb. The GBS data were simulated at ~40× around restriction recognition sites (Supplementary Note 1). We identified 58,733 SNPs from the GBS data located on 707 scaffolds for pseudomolecule construction. These scaffolds covered nearly the whole genome. Out of the 135 misassembled scaffolds confirmed with QUAST (Methods), PolyGembler identified 86 without false positives. The results for RF estimation were similar to those for diploid simulation (Fig. 2c,d).
Fifteen pseudomolecules corresponding to reference chromosomes were constructed without misgrouping. The resultant pseudomolecules were highly collinear with reference chromosomes (Fig. 2f). The Spearman's rank correlation coefficients between the scaffold order determined by PolyGembler and the true order fell within the range of 0.956-1.000. To investigate the proposed method thoroughly, we also simulated GBS data at ~20×. With the lower-depth data, the number of assembly errors detected by PolyGembler significantly decreased (19 out of 135). However, the results for linkage mapping and pseudomolecule construction were still quite accurate (Extended Data Fig. 1).
Application to I. trifida GBS data. The I. trifida mapping population consists of two parents and 210 F 1 progeny. It was derived from the crossing of two diploid I. trifida heterozygous genotypes and is called M9 × M19. We ran PolyGembler with the GBS data of this mapping population to construct pseudomolecules from the de novo genome assembly ITR_r1.0 (ref. 16 ). The ITR_r1.0 genome assembly consists of 77,400 scaffolds. The total size is ~512 Mb and the N50 statistic is ~43 kb. We identified 35,646 SNPs covering 2,469 scaffolds of ~231 Mb, constituting ~45% of the genome. We used another I. trifida genome assembly, NCNSP0306 (ref. 3 ), as a reference genome to validate the results. We detected 13 misassembled scaffolds in ITR_r1.0. Figure 3a depicts the RF estimations along the scaffold Itr_sc000015 with 30 independent runs. There was a potential misassembly around 177 kb, where the mean RF estimation was 0.432 (s.d. = 0.012). By mapping it to NCNSP0306 chromosomes, we found that while the first ~166 kb was mapped to chromosome 4, the last ~138 kb was mapped to chromosome 8 (Fig. 3b,c). The mapping results supported the misassembly of scaffold Itr_sc000015, as well as the misassembly position. The other misassemblies detected by PolyGembler were also confirmed by mapping to NCNSP0306 chromosomes (Supplementary Table 1). We split the potentially misassembled ITR_r1.0 scaffolds to generate new scaffolds and excluded the original ones from linkage analysis.
We mapped the scaffolds to the NCNSP0306 chromosomes to investigate the accuracy of RF estimation. When the physical distance was small, it showed a strong correlation with the estimated RFs. As the physical distance increased, the estimated RFs clearly deviated from the expected values calculated from conventional genetic mapping functions (Fig. 4a). One possible explanation is that the calculation of physical distances between scaffolds was not accurate due to structural variations between the two genome varieties or misassemblies in the reference chromosomes. Another possible reason, and probably the major reason, is the absence of genetic recombinations in large pericentromeric heterochromatin blocks in the I. trifida genome, as has been widely reported for the potato genome 17 . The RFs for scaffolds mapped to different chromosomes were consistently large (Fig. 4b).
Fifteen pseudomolecules in high concordance with the reference chromosomes were constructed (Fig. 4e). The pseudomolecules were not well assembled in several pericentromeric regions. This was mainly because these regions are highly repetitive and the original genome assembly ITR_r1.0 is extremely fragmented in these regions. To complete the genome assembly, we probably need to improve the quality of scaffolds in these regions; for example, with long reads. There was a clear discrepancy between the pseudomolecule and NCNSP0306 chromosome 3 (Fig. 4e). It is not known whether it was a misassembly by our approach, a misassembly in NCNSP0306 chromosome 3 or a structure variation between the two genome varieties. We mapped the pseudomolecule to the genome assembly of Ipomoea nil 18 -another species from the genus Ipomoea. The result showed that the pseudomolecule represented a better collinearity with I. nil chromosome 2, suggesting that NCNSP0306 chromosome 3 might be misassembled (Extended Data Fig. 2).

NATUrE GENETIcs
Application to Infinium 8303 potato array data. Two outbred mapping populations genotyped with the Infinium 8303 SNP array were used to construct pseudomolecules from Potato Genome Sequencing Consortium (PGSC) doubled-monoploid version 3 scaffolds 17 . The B2721 mapping population was derived from the cross Atlantic × B1829-5, consisting of 156 F 1 progeny. A total of 5,736 SNPs were identified covering 721 scaffolds, of which 175 scaffolds with fewer than three SNPs were excluded from the further

NATUrE GENETIcs
analysis. The other mapping population was derived from the cross 12601ab1 × Stirling, consisting of 190 F 1 progeny. The variant calling ended up with 4,109 SNPs from 679 scaffolds, of which 455 scaffolds with at least three SNPs were retained. The phasing algorithm failed to generate any feasible haplotype calls for 79 and 100 scaffolds, respectively, for the two mapping populations due to a high level of homozygosity (Supplementary Note 2). The pairwise RF estimations showed very similar patterns to those for the I. trifida dataset (Fig. 4c,d and Extended Data Fig. 3a,b). For genetic linkage map construction, scaffolds were assigned to 12 linkage groups for both dataset, corresponding to the 12 PGSC version 4.03 chromosomes, without misassignment. The order of scaffolds within each linkage group was consistent with the order on chromosomes despite a few incorrect orderings for closely located scaffolds ( Fig. 4f and Extended Data Fig. 3c). The shapes of the genetic linkage maps were nearly identical to those constructed from TetraploidSNPMap (Extended Data Figs. 4 and 5). The total sizes of the anchored scaffolds were ~429 and ~366 Mb, constituting ~59 and ~53% of the genome, respectively. We anchored eight scaffolds of ~2.8 Mb, which were not placed in any PGSC version 4.03 chromosomes, from the two mapping populations (Supplementary Table 2).
Application to Z. japonica RAD-seq data. Z. japonica has an allotetraploid genome with 20 chromosomes. A total of ~4.2 million PacBio reads of ~38.6 Gb with an N50 of ~12.2 kb were assembled for the Z. japonica cultivar Yaji (Methods). The resultant genome assembly consisted of 1,075 scaffolds of ~329 Mb with an N50 of ~2.6 Mb. The RAD-seq data of a mapping population comprising 60 F 1 progeny derived from a cross between Carrizo and El Toro 19 were used for pseudomolecule construction with PolyGembler. We identified 25,080 SNPs covering 297 scaffolds of ~303 Mb, constituting ~92% of the genome. PolyGembler identified six misassembled contigs of approximately 24 Mb, which were split before linkage analysis. The final genetic map consisted of 21 linkage groups, of which a small linkage group comprising two contigs of ~222 kb in total was excluded from the pseudomolecule construction. The resultant . Column N reports the number of SNPs in the genetic linkage map. N was sometimes <1,000 because SNPs were either filtered out or failed to be assigned to any linkage group. Column N is not reported for PolyGembler, as it was always 1,000. PolyGembler (2) and PolyGembler(4) report the results for diploid and tetraploid simulations, respectively. -, no meaningful results reported.

Technical RepoRT
NATUrE GENETIcs pseudomolecules can be grouped into ten homoeologous pairs, each with similar length, ranging from approximately 10-23 Mb. We mapped the pseudomolecules to the rice reference genome (Oryza sativa release 7) 20 for validation (Fig. 5). The pseudomolecules were mapped in pairs and showed clear collinearity with the rice chromosomes, suggesting a high degree of authenticity. The dataset was also used for genetic linkage map construction in the originally study 19 . The genetic linkage maps reported for the parents Carrizo and El Toro consisted of 2,408 and 1,230 RAD markers, distributing across 21 and 20 linkage groups, respectively. Compared with the genetic linkage maps constructed using our approach, the genetic markers included were significantly fewer. The genetic linkage map for the parent El Toro was further used in another study for genome assembly of Z. japonica accession Nagirizaki for pseudomolecule construction 21 . In that study, total sequences of ~273 Mb were anchored in pseudomolecules (30 Mb fewer than the pseudomolecules constructed in the current study). We performed pairwise comparisons between the pseudomolecules from the two genome assemblies. The results showed that the two genome assemblies were highly consistent (Extended Data Fig. 6).
Comparison with well-known genetic mapping tools. We compared our genetic mapping method with three existing tools designed for outbred mapping populations, including OneMap (version 2.0-4) 10 and Lep-MAP2 (ref. 12 ) for diploid genomes and TetraploidSNPMap 14 for tetraploid genomes. The marker sets used for comparisons consisted of 1,000 SNPs sampled from a diploid and a tetraploid simulated outbred mapping population comprising 190 F 1 progeny with known linkage groups and order (Methods). The missing data rates ranged from 0-0.8, while the genotyping error rates ranged from 0-0.4. For each configuration, we conducted 30 independent experiments for PolyGembler, OneMap and Lep-MAP2 and five experiments for TetraploidSNPMap, for which only a graphical user interface was provided by the author. We compared the accuracy for grouping and ordering, as well as the running time. The accuracy for grouping was measured using the normalized mutual information (NMI) between the constructed linkage groups and true groups 22 . NMI ranges from 0-1. Larger NMI values suggest more similar groupings, and specifically NMI = 1 indicates identical groupings. The accuracy for ordering was measured by the switch error rate (R e ) of the marker order on the genetic linkage map compared with the true order on chromosomes 23 . We also investigated the phasing accuracy of PolyGembler using the correct phasing rate (R c ) as an evaluation, which was calculated as the proportion of the correctly phased loci with respect to the true haplotypes (Methods).
The results are reported in Table 1, with details in Supplementary  Table 3. Evidently, PolyGembler was never worse and often better than the other methods in accuracy for grouping, and consistently outperformed the other methods in accuracy for ordering. Besides, PolyGembler can handle datasets with much higher missing and error rates. Here we call a linkage map with NMI ≥ 0.9 and R e ≤ 0.1 acceptable. For the diploid simulation, we gained acceptable results for datasets constituting up to 70% missing genotypes from PolyGembler and 40% missing genotypes from Lep-MAP2. With OneMap, we obtained accurate groupings for datasets with up to 50% missing data, but the orderings were consistently error prone. The performance on the genotyping errors showed a similar trend. The largest error rate that PolyGembler could handle reached 30%, while for Lep-MAP2 it dropped to 5%. For OneMap, the groupings were perfect for datasets with up to 10% genotyping errors; however, the orderings were inaccurate. PolyGembler also outperformed the other two methods for the data with mixtures of missing data and genotyping errors. PolyGembler produced acceptable results for up to 30% missing data and 10% genotyping errors, while for Lep-MAP2 acceptable results were generated for up to 20% missing data and a 5% error rate. Lep-MAP2 filtered out a significant proportion of SNPs for high missing data or genotyping error rates. Regarding the running time, PolyGembler was the fastest, followed by Lep-MAP2, and both were much faster than OneMap. Lep-MAP2 was occasionally faster than PolyGembler as it discarded many SNPs. For tetraploid datasets, PolyGembler consistently outperformed TetraploidSNPMap in both grouping and ordering. TetraploidSNPMap generated acceptable results for datasets of missing data up to 40% or genotyping errors up to 5%, or a combination of them up to 10% missing data and a 2% error rate. In contrast, PolyGembler can handle datasets of missing data up to 50% or genotyping errors up to 20%, or a combination of 30% missing data and a 10% error rate. For speed, TetraploidSNPMap had an average running time of less than 1 min, which was substantially faster (40-50×) than PolyGembler. A high level of phasing accuracy was observed for PolyGembler. The R c was consistently >95% with diploid datasets of missing data up to 70% or genotyping errors up to 20%, or a combination of them up to 30% missing data and a 10% error rate. For tetraploid datasets, the missing and error rates that could be well handled by PolyGembler slightly decreased. To gain an R c of the same level, the datasets needed to be with a missing rate no greater than 40% or an error rate no greater than 10%, or a combination of them no greater than 20% and 5%, respectively. Since we included the SNP loci for introducing missing data and genotyping errors in the calculation of R c (Methods), the results indicated that PolyGembler imputed the missing data as well as corrected genotyping errors with high accuracy.

Discussion
We have described a genetic anchoring method that harnesses genotype data from a mapping population and reference genome assembly to construct chromosomal-scale pseudomolecules. Several methods have been proposed under this framework, such as POPSEQ 24 and RPGC 25 . These methods have mainly focused on integration of established tools for read mapping, variant calling and genetic mapping, to build computational pipelines for organizing contigs or scaffolds. Since they largely depend on conventional genetic mapping algorithms, these methods have several limitations. First, they require high-quality genetic markers and lack robustness to genotyping errors and missing data. However, variant calling for such high-quality genetic markers from genome-wide genotyping sequence data, especially from low-coverage sequence data, is nontrivial. Second, these methods do not scale to large datasets. The conventional genetic mapping tools have been designed to handle datasets with up to several thousand genetic markers but do not scale to hundreds of thousands of genetic markers generated with the current whole-genome genotyping approaches. Lastly, these methods can rarely be used for polyploid genomes. To solve these problems, we used a divide-and-conquer strategy for genetic anchoring. We first performed haplotype phasing using a hidden Markov model (HMM) at the scaffold level. The genetic distances between each pair of scaffolds were then estimated from the haplotypes and further used to perform a scaffold-level linkage analysis. Finally, the genetic linkage maps of scaffolds were converted to whole-chromosome pseudomolecules.
Scaffold-level linkage analysis effectively reduces the computational complexity. Presumably, a well-designed heuristic algorithm is required to order the markers within each linkage group as it is a non-deterministic polynomial hard problem. However, in our method, the size of the problem is dramatically reduced, which enables us to combine the MDS algorithm with the exact TSP solver CONCORDE to order markers efficiently and accurately. The scaffold-level design also enables us to perform haplotype phasing in a highly parallel manner. In the present study, the method was used to handle datasets with up to 3,348 scaffolds. Moreover, the computational time and resources required for haplotype phasing were almost linearly related to the number of genetic markers, and the

Technical RepoRT
NATUrE GENETIcs method can easily handle scaffolds with up to 10,000 genetic markers (Extended Data Fig. 7). The high scalability makes it possible to apply the method to other more extensive genotyping techniques, such as whole-genome or transcriptome re-sequencing approaches, which could have millions of genetic markers. The generalization of the proposed method to high-ploidy genomes is straightforward. In this study, we focused on diploids and tetraploids. However, this method can construct genetic linkage maps for higher-ploidy genomes. Our biggest challenge when dealing with higher-ploidy species is computational, especially in the haplotype phasing step. For hexaploids, the number of hidden states for HMM increases to 14,400. This is a large number of states but remains computationally tractable. However, for even higher levels of ploidy, computation becomes difficult.
The fundamental idea for haplotype phasing is to model the process of gamete formation of the parents with a Markov chain that starts with the first and ends with the last genetic marker along a chromosome. Similar models were used in PolyHap 26 and TetraOrigin 27 . PolyHap assumed a given number of ancestral haplotypes shared by the target population. The recombinations were allowed between any ancestral haplotypes, leading to an intractably large state space as the number of ancestral haplotypes and ploidy increased. The state spaces for TetraOrigin and our method are nearly identical except that TetraOrigin also considered double reductions. Double reduction was not considered in our method mainly because the accurate identification of them requires high-quality genotyping data. TetraOrigin used SNP array data to detect double reductions 27 . However, with low-coverage GBS or RAD-seq data, it is difficult to obtain such high-quality genotypes without filtering out considerable markers, which is undesirable if we want to cover the whole genome. Even though double reduction could introduce some bias in RF estimation, it can be safely ignored if quadrivalent pairing is rare. For organizing the assembly scaffolds, the bias is negligible.
We have demonstrated the ability of our method in the construction of whole-chromosome pseudomolecules for genome assembly. For the real datasets, the genome coverage of the pseudomolecules ranged from 45% for the diploid I. trifida to 92% for the allotetraploid Z. japonica. The huge difference arose because of the quality of the input genome assemblies, especially the contiguity. The scaffold N50 statistics of the input genome assemblies were approximately 43 kb and 2.6 Mb, respectively, for I. trifida and Z. japonica. Even though generating high-quality plant genome assemblies remains challenging, the recent advances in the sequencing technologies 28,29 and related computational tools facilitate the construction of highly contiguous genome assemblies even for very complex plant genomes 30 . In the context of the rapid development of genome assembly techniques, it is hoped that our method could serve as an option among the others to finish a genome assembly.

Online content
Any methods, additional references, Nature Research reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/ s41588-020-00717-7.

NATUrE GENETIcs
Methods Data simulation. We simulated a reference genome with the sequence data of I. trifida 16 . The sequence data were first assembled to generate contigs (Supplementary Table 4). Contigs were then used to generate complex repeats considering the copy number of the contigs inferred from the sequence depth (Supplementary Table 5 and Supplementary Note 1). Finally, we randomly concatenated these repetitive sequences and generated a reference genome comprising 15 chromosomes with a total size of ~482 Mb (Supplementary Table 6). We simulated the genotypes of a diploid and a tetraploid outbred mapping population both consisting of two parents and 190 F 1 progeny using PedgreeSim (version 2.0) 31 . The genotype data for comparison studies were randomly sampled from the mapping populations (Supplementary Note 1). To simulate GBS data, we mapped the genotypes back to the reference genome to generate individual genomes and simulated the GBS protocol including sequence errors to produce GBS sequences 7 (Supplementary Note 1). To simulate the reference genome assemblies for diploid genomes, we simulated Illumina reads with ART (version 2.5.8) 32 , followed by DISCOVAR de novo (version 52488) 33 assembly, whereas for tetraploid genomes, we simulated nanopore reads with DeepSimulator (version 1.5) 34 and used Flye (version 2.5) 35 for assembly (Supplementary Note 1).
Construction of Z. japonica genome assembly. The publicly available whole-genome PacBio sequence data for the Z. japonica cultivar Yaji were downloaded from the National Center for Biotechnology Information (NCBI) repository under the accession number SRP110561. The sequence data consist of ~4.2 million reads, with a total size of ~38.6 Gb and an N50 read length of ~12.2 kb. Flye (version 2.5) 35 was used to assemble the sequence data with default parameters.
Variant calling for mapping populations. The variant calling for GBS and RAD-seq data was performed with TASSEL-GBS (version 5.2.43) 36 and Stacks (version 2.41) 37 , respectively, using default parameters. Then, we extracted the allele depth of each SNP and performed a new round of genotyping with the R package updog (version 1.1.3) 38 , taking into account that the mapping population was a full-sib family resulting from one generation of a biparental cross. The R package fitPoly (version 3.0.0) was used to call genotypes from the signal intensities for Infinium 8303 SNP array data. We filtered out SNPs represented significant distortions to the expected segregation ratio. Precisely, for each SNP and for each possible combination of parental genotypes, we first removed F 1 progeny with no genotypes or unexpected genotypes with respect to the parental genotypes (that is, genotyping errors). Then, if we had at least 50 F 1 progeny retained, we performed an exact multinomial test using the xmulti function from the R package XNomial (version 1.0.4) to measure the segregation. We kept a SNP site if there existed a combination of parental genotypes such that the genotyping error rate was no greater than 5% and the P value for the exact multinomial test was no smaller than 0.001. We chose a small P value threshold to include more SNPs for linkage analysis. The PGSC provided a coordinate file describing the positions of Infinium 8303 potato array SNPs on the PGSC version 4.03 pseudomolecules. This was used to anchor the SNPs to PGSC doubled-monoploid version 3 scaffolds 17 .

Scaffold haplotype phasing.
Here we describe our model for haplotype phasing. By solving this model, we can reconstruct the underlying inheritance pattern of the outbred mapping population, to infer the parental haplotypes and which of these haplotypes have been inherited by each F 1 progeny.
Notation. We write a collection in square brackets to indicate an ordered list, while the use of wavy brackets indicates an unordered list. We write π = [π 1 ,...,π u ], a permutation of the sequential number 1 to u. Then, for an unordered list Consider an outbred mapping population derived from a cross between two individuals P and Q with ploidy level h (so h = 2 for diploid genomes), and h is an even number greater than zero. Denote by F an F 1 progeny derived from P and Q and we have n F 1 progeny. For a chromosome with m markers, we assume the availability of allele depth data or genotype data collected on each individual at each marker. We only consider biallelic markers. Denote by x j and y j the total and reference allele count, respectively, for an individual at the jth marker, j = 1,...,m. The individual could be either a parent or F 1 progeny. Here we drop the subscripts of individuals for convenience as we only need one individual to demonstrate the algorithm.
We use a h × m matrix to represent the h copies of the chromosome for an individual. Two parents are indexed (that is, P =  that f 1 ,...,f h/2 and f h/2+1 ,...,f h were passed to F by parents P and Q, respectively. For example, f i could be [1 ×4 2 ×6 ], which indicates that the chromosome is a recombinant of parental haplotypes 1 and 2 inherited from a gamete of P, and the recombination occurred between the fourth and fifth marker. Now, consider all chromosomes together at the jth marker. Obviously, we have P j = [1,...,h] and Q j = [h + 1,...,2h], irrespective of j. There are Perm(h,h/2) 2 possible values for F j (that is, the product of types of gametes the two parents can produce), where Perm calculates the partial permutations. Without considering the order of chromosomes within gametes, there are Comb(h,h/2) 2  and P jþ1 I (that is, the jth and (j + 1) th markers on the chromosome). We introduce a parameter set α = [α 1 ,...,α m−1 ] to model the transitions between the homologous chromosomes during the meiosis process. Precisely, α j is the probability of a jump occurring between different homologous chromosomes between the jth and (j + 1)th markers, j = 1,...,m − 1. This can be written as: The parameter α j is positively related to the physical distance between the jth and (j + 1)th markers. When the two markers are tightly linked, α j should be extremely small to prevent haplotype states from jumping frequently. Actually, α j could be used to approximate the RF between the two markers. The transition probability between P j I and P jþ1 I is then written as: For unphased P 0 j I and P 0 jþ1 I , the transition probability is written as: Since the inheritance of chromosomes from two parents is independent, the transition probability between F j and F j+1 in the unphased mode is then written as: Likelihood of allele depth data. For an individual, the probability of observing the reference allele at a marker with parental chromosome subset s ⊆ S is calculated as: We drop the subscript for s, and θ j is calculated for each s. The count of the reference allele at the jth marker, y j , follows a binomial model of probability θ j with x j trials; that is: Likelihood of genotype data. With genotype data, x j and y j represent the total allele count and reference allele dosage, respectively. Apparently, we have x j = h, irrespective of j, and y j = 0,...,h. For a chromosome set s ⊆ S, y j follows a Poisson binomial distribution (that is, the sum of h independent Bernoulli trials with success probabilities associated with s). The probability of y j = g, where g = 0,...,h, is calculated as: Technical RepoRT

NATUrE GENETIcs
where s g is the collection of all subsets of g elements that can be selected from s. A c is the complement of A with respect to s (that is, A c = s\A More computational details. The model described above can be summarized as a HMM. Each marker is a time step; the hidden states are the underlying parental chromosomes (S) and the observations are the allele depth or the genotype data collected on the mapping population. The transition probabilities are calculated with equation (4) and the emission probabilities are calculated with equation (6) or (7). The parameters for the model include α = [α 1 ,...,α m−1 ] and {θ ij }, where i = 1,...,2h and j = 1,...,m. The Baum-Welch algorithm was used to estimate these parameters. We used Dirichlet priors on all parameters. Let α j  1 2 Beta μ α 1 � e �2ρδj ð Þ ; μ α e �2ρδj ð Þ I , where δ j is the physical distance between the jth and (j + 1)th markers along the chromosome, and ρ = 10 −8 per base pair, reflecting the background recombination rate. Let θ ij ∼ Beta(μ θ (1 − β j ),μ θ β j ), where β j is the B allele frequency estimated from the genotype data for the jth marker. We use μ α = 10 5 and μ θ = 10 for initialization of the expectation-maximization algorithm and μ α = μ θ = 0.1 for the maximization step.
RF estimation. The RF between two markers was estimated by the proportion of the number of recombinants to the total number of haplotypes in the F 1 progeny. Assume a genome of ploidy h and a mapping population of n F 1 progeny, and r recombinants between the two markers in the F 1 haplotypes. The RF was then calculated as: Two kinds of RFs were estimated in the proposed method: (1) to detect assembly errors, we calculated RFs between adjacent markers along the scaffold; and (2) to estimate genetic distance between two scaffolds, we calculated RFs between the outermost markers. The RF calculations within a scaffold were straightforward. We counted the number of jumps from one parental haplotype to another in F 1 haplotypes. However, the calculation of RF between two scaffolds needed to consider all four possible directions as the orientations of the scaffolds on the chromosome were unknown. Besides, as we ran the haplotype phasing algorithm for each scaffold independently, we considered all possible correspondences between parental haplotypes. See Supplementary Note 3 for more computational details and an example.
Assembly error detection. To detect misassemblies, RFs between the adjacent markers along a scaffold were calculated. If the RF between two adjacent markers was larger than the predefined threshold 0.1, we considered it a misassembly at that position. Apparently, the physical distance between two markers should be taken into consideration. However, it is difficult to derive a universal relationship between the physical distances and RFs, especially when the physical distance becomes large. Therefore, we did not consider positions where the physical distance between the adjacent markers was larger than 1 Mb. The incorrect scaffolds detected by the algorithm were split at the misassembled positions to generate new scaffolds. We ran haplotype phasing analysis on the new scaffolds and recalculated the pairwise RFs. For the genome assemblies generated from the simulated data, QUAST (version 5.0.2) 39 was employed to detect the genuine misassemblies by comparing the scaffolds with the reference chromosomes with default parameters. The genuine misassemblies identified by QUAST were used to validate the misassemblies detected by the proposed method.
Superscaffold construction and multipoint analysis. We introduced multipoint haplotype phasing analysis to improve the accuracy of RF estimations. The multipoint analysis ran the haplotype phasing algorithm on superscaffolds, which were constructed by nearest neighbor joining. Precisely, we chose the scaffold pair that represented the smallest RF. The scaffolds were combined to form a superscaffold if the RF was smaller than the predefined threshold 0.27 (approximately 30 cM with the Kosambi mapping function). The gap between the two scaffolds was measured by the RF. The RFs between the superscaffold and the other scaffolds were then updated. This process was repeated until there was no such closely linked scaffold pair. Multipoint analysis is crucial for short scaffolds as the number of markers that mapped to them could be small, which makes haplotype phasing solely based on them inaccurate.
Linkage group construction. We constructed a graph for linkage group construction with scaffolds as vertices and genetic linkages as edges. To simplify the graph, we excluded edges with RFs greater than the predefined threshold 0.38 (approximately 50 cM with the Kosambi mapping function). The retained edges were weighted based on the RFs. For an edge with RF γ, the weight was calculated as K(0.5) − K(γ), where K(.) is a function that converts the RFs to genetic distances with the Kosambi mapping function. For graph clustering, an information-theoretic approach, as implemented in the R package igraph 40 , was employed, which detects the modularity structure of a graph by minimizing the expected description length of the trajectory of a random walk 41 .
Scaffold ordering within linkage groups and pseudomolecule construction. The scaffold ordering within each linkage group was performed with the MDS algorithm, as implemented in the R package MDSMap (version 1.1) 42 . Once the order of the scaffolds was determined, we further performed a TSP to find the optimum directions of the scaffolds to minimize the sum of adjacent RFs. The start and end of a scaffold were treated as two independent points in the TSP. The TSP was carefully designed so that in the optimum solution: (1) the Hamiltonian circuit can be equivalently converted to a Hamiltonian path, as is required for scaffold ordering; (2) the order of scaffolds determined by the MDS algorithm remains unchanged; and (3) the start and end of a scaffold are adjacent (Supplementary Note 3). The exact TSP solver CONCORDE (release 03. 12.19) was employed in this research to find exact solutions of TSPs to guarantee feasible scaffold orderings. The genetic linkage maps of scaffolds were converted to pseudomolecules. The adjacent scaffolds were joined with 10-kb gaps.
True linkage groups, scaffold orders and distances, and collinear blocks. We mapped scaffolds to the reference chromosomes using Minimap2 (version 2.17) 43 . A scaffold was considered as being mapped to a reference chromosome if at least 50% of the base pairs were collinear with that, and only that, chromosome. The true linkage groups of scaffolds were determined by the reference chromosomes they mapped to. The order of the scaffolds within a linkage group and the pairwise distances were determined by the positions on the chromosome. We used MUMmer (version 3.23) 44 for mapping pseudomolecules to reference chromosomes to detect collinear blocks.
Calculation of the correct phasing rate. The correct phasing rate is calculated as the proportion of the correctly phased loci with respect to the true haplotypes 45  , where m is the number of genetic markers and 1 is the indicator function. The correct phasing rate R c is defined as: For a dataset of multiple individuals and multiple scaffolds, we calculated R c as the total number of matched loci divided by the total number of loci. To evaluate the power of imputing missing data and the robustness in handling genotyping errors for PolyGembler, the haplotypes before introducing missing data and genotyping errors were used as true haplotypes.
Statistical testing. For SNP filtering, we performed exact multinomial tests with the xmulti function from the R package XNomial (version 1.0.4) to check whether the segregation of a SNP followed the law of segregation. The log-likelihood ratio was used as the test statistic to calculate P values. The original sample sizes for the M9 × M19, B2721, 12601ab1 × Stirling, Carrizo × El Toro and simulated mapping populations were 210, 156, 190, 60 and 190, respectively. We required a minimum sample size of 50 for testing after removing samples with no genotypes or unexpected genotypes with respect to the parental genotypes. We rejected the null hypothesis if the P value was <0.001. To compare PolyGembler with OneMap and Lep-MAP2, we performed two-tailed paired-sample t-tests to compare the accuracy for grouping and ordering and the central processing unit (CPU) time (Supplementary Table 3). The sample size was 30.
Reporting Summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
Data for the simulation studies, including comparisons with other methods and studies of M9 × M19 I. trifida and the B2721 potato, are available from http://data. genomicsresearch.org/Projects/polyGembler. Data for the 12601ab1 × Stirling potato mapping population were provided by C. Hackett. Data for the Z. japonica mapping population Carrizo × El Toro are available from the NCBI repository under the accession number SRP055007. The whole-genome PacBio sequence data for the Z. japonica cultivar Yaji are available from the NCBI repository under the accession number SRP110561. Data related to the PGSC version 4.03 pseudomolecules are available from http://solanaceae.plantbiology.msu. edu. The I. trifida de novo genome assembly ITR_r1.0 is available from http:// sweetpotato-garden.kazusa.or.jp. The I. trifida de novo genome assembly NCNSP0306 is available from http://sweetpotato.plantbiology.msu.edu. Release 7 of the O. sativa reference genome is available from http://phytozome.jgi.doe.gov. The genome assembly of the Z. japonica accession Nagirizaki is available from http://zoysia.kazusa.or.jp. Source data are provided with this paper.

Code availability
The software PolyGembler, presented in this article, and its documentation are publicly available at GitHub (https://github.com/c-zhou/polyGembler).

nature research | reporting summary
April 2020 Corresponding author(s): Lachlan Coin Last updated by author(s): May 7, 2020 Repor�ng Summary Nature Research wishes to improve the reproducibility of the work that we publish. This form provides structure for consistency and transparency in repor�ng. For further informa�on on Nature Research policies, see our Editorial Policies and the Editorial Policy Checklist.

Sta�s�cs
For all sta�s�cal analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods sec�on.

n/a Confirmed
The exact sample size (n) for each experimental group/condi�on, given as a discrete number and unit of measurement A statement on whether measurements were taken from dis�nct samples or whether the same sample was measured repeatedly The sta�s�cal test(s) used AND whether they are one-or two-sided Only common tests should be described solely by name; describe more complex techniques in the Methods section.
A descrip�on of all covariates tested A descrip�on of any assump�ons or correc�ons, such as tests of normality and adjustment for mul�ple comparisons A full descrip�on of the sta�s�cal parameters including central tendency (e.g. means) or other basic es�mates (e.g. regression coefficient) AND varia�on (e.g. standard devia�on) or associated es�mates of uncertainty (e.g. confidence intervals) For null hypothesis tes�ng, the test sta�s�c (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted Policy information about dual use research of concern Hazards Could the accidental, deliberate or reckless misuse of agents or technologies generated in the work, or the application of information presented in the manuscript, pose a threat to:

Experiments of concern
Does the work involve any of these experiments of concern: No Yes Demonstrate how to render a vaccine ineffective