Bovine pangenome reveals trait-associated structural variation from diverse assembly inputs

Advantages of pangenomes over linear reference assemblies for genome research have recently been established. However, potential effects of sequence platform and assembly approach, or of combining assemblies created by different approaches, on pangenome construction have not been investigated. Ten haplotype-resolved assemblies of three bovine trios representing increasing levels of heterozygosity were generated that each demonstrate a substantial improvement in contiguity and accuracy over the current Bos taurus reference genome, with more telomere and centromere content and an average 2.5x increase in NG50 and 11x decrease in base errors. Downsampling analysis demonstrated that diploid coverage as low as 20x for HiFi or 60x for ONT was sufficient to produce two haplotype-resolved assemblies meeting the standards set by the Vertebrate Genome Project. The assemblies were integrated into structural variant-based pangenomes that demonstrated significant consensus regardless of sequence platform, assembler algorithm, or coverage. Inspecting pangenome topologies identified approximately 900 structural variants overlapping with coding sequences; this approach revealed variants affecting QRICH2 , PRDM9 , HSPA1A , TAS2R46 and GC that have potential to affect phenotype. A 12 kb multi-allelic copy number variation encompassing a GC gene enhancer is associated with mastitis resistance in dairy cattle.


Introduction
Cattle are a substantial component of global animal-based food production, and are raised for meat, milk, or both. Two subspecies of cattle, taurine and indicine, have emerged from at least two distinct domestication events (Loftus et al., 1994;Pitt et al., 2019)), with artificial selection for production goals or environmental adaptation contributing to diversity within cattle, resulting in the current existence of hundreds of taurine and indicine cattle breeds. Interbreeding and introgressions with other bovids, like yak and banteng (N. Wu et al., 2018), further drive an increase in genetic diversity within Bovinae.
The Bos taurus reference genome was first drafted in 2004 and was based on whole genome shotgun sequence of a Hereford cow, supplemented with sequences of bacterial artificial chromosome clones prepared from DNA of her sire (Elsik et al., 2009). A major revision using long read sequencing of the same cow was recently reported (ARS-UCD1.2; ) and remains the accepted reference for conducting genomic studies in cattle due to extensive annotation efforts and connections to historical analyses, despite more recent bovine assemblies having higher contiguity and accuracy Oppenheimer et al., 2021;Rice et al., 2020).
Sequence variability between cattle breeds at both the single nucleotide (SNP) and short insertion or deletion (indel) level has been extensively characterized through referenceguided approaches (Daetwyler et al., 2014;Hayes & Daetwyler, 2019;Kim et al., 2017). However, these studies suffer from potential reference bias resulting from the use of a single taurine cattle reference assembly for SNP and indel discovery. Larger structural variants (SVs) and variation located in repetitive or challenging regions have rarely been studied across Bovinae due to the inherent limitations of short sequencing reads and incomplete reference genomes. Moreover, no reference assembly of a single individual can reflect the immense genomic diversity present in global breeds of domestic cattle (Crysnanto et al., 2019(Crysnanto et al., , 2021Crysnanto & Pausch, 2020;Talenti et al., 2021).
Pangenomes have long been proposed (Tettelin et al., 2005) as a way to better reflect variation present in a group of individuals (e.g., breed, species, clade, etc.). Pangenomes can be constructed from variants called through reference-guided approaches Hickey et al., 2020), contigs assembled from reads which failed to align to the reference (Sherman et al., 2018), or multiple whole-genome assemblies Li et al., 2020). The latter approach may more faithfully capture challenging regions and SVs due to the complexity of calling and representing nested variation. Third-generation sequencing continues to become more cost-effective and accessible, making populationscale de novo assemblies more feasible. An influx of high-quality assemblies makes the need for pangenome representations more pressing, although the effects of integrating disparate assemblies into pangenomes are unknown.
The present study employed different sequencing and assembly approaches to produce genomes from three bovine trios of varying heterozygosity. The relative strengths of Pacific Biosciences high-fidelity (HiFi) (Eid et al., 2009;Wenger et al., 2019) and Oxford Nanopore Technologies (ONT) (Mikheyev & Tin, 2014) sequencing were examined through generation of haplotype-resolved, reference-quality assemblies for the same individuals sequenced with both platforms and assembled with various algorithms. The set of assemblies were then used to evaluate the effects on pangenome construction depending on assembly approach, or when constructed from a combination of approaches. The utility of a bovine pangenome is then demonstrated by analyses that reveal novel insights into the evolutionary relationships between Bovinae.

Results
The three examined bovine trios (Figure 1a-c) reflect diverse breeding strategies (within-breed, inter-subspecies, and inter-species) and increasing heterozygosity. The first F1 (OxO) was a cross between two Original Braunvieh cattle (Bos taurus taurus), but was still substantially less inbred compared to the cow used for the ARS-UCD1.2 reference (pedigree-based coefficient of inbreeding of 0.07 compared to 0.30  respectively). The second F1 (NxB) was a cross between Nellore (Bos taurus indicus) and Brown Swiss (Bos taurus taurus) cattle and the third F1 (GxP) was an inter-species cross between a gaur (Bos gaurus) bull and a Piedmontese (Bos taurus taurus) cow. HiFi and ONT reads were collected for each F1, and short reads were collected for all animals in the trios (Figure 1d, Supplementary Table 1). F1 long reads were separated into paternal, maternal, and unknown origin (Koren et al., 2018). The success of parent-of-origin assignment improved significantly from 81.1% to 99.9% with increasing heterozygosity for HiFi reads, but was near perfectly separable for ONT at all examined heterozygosities ( (OxO, NxB, and GxP) examined in this study. The OxO and GxP were female, while the NxB was male. d) The three respective F1s were sequenced to with read N50 of 20,21,with read N50 of 65,45,and 49 Kb. Coverage is determined with respect to an assumed genome size of 2.7 Gb. F1 short reads were collected to 31-, 23-, and 37-fold coverage. e) Separating F1 reads into parental haplotype bins improved with increasing heterozygosity for HiFi, but F1 reads were nearly 100% separable for ONT even at low heterozygosity. f) HiFi reads were assembled with hifiasm, HiCanu, and Peregrine, while ONT reads were assembled with Shasta, Flye, and Raven. Bold type indicates the tools used to produce the assemblies that are discussed in detail. Assemblies were assessed individually for quality metrics (blue lines) as well as integrated together into pangenome analyses (red lines).
Contigs were assembled for each F1 using multiple assemblers for HiFi and ONT data ( Figure  1f, see methods). Contigs were scaffolded by alignment to ARS-UCD1.2 to produce the final assemblies, an approach with minimal bias given the well-curated reference and highly contiguous assemblies . The haplotype-resolved assemblies for the inter-subspecies NxB and inter-species GxP are directly breed/species-specific, while both haplotypes of the OxO constitute the same breed, resulting in a total of five novel breed/species assemblies. Variation between the paternal and maternal haplotypes of the OxO, reflecting within-breed diversity, was sufficient to generate haplotype-resolved assemblies, although only the maternal haplotype was used to represent the Original Braunvieh breed in subsequent analyses except where explicitly stated. The "quality" of each assembly was assessed by several widely used metrics: contiguity (NG50) representing the size distribution of contigs, phasing (PG50) to characterize haplotype separation, correctness (QV) quantified as Phred-scaled base-error rate, and completeness (BUSCO) which approximates the percentage of near-universal, single-copy genes that were identified. These metrics provide a coarse-grained summary of assembly quality amenable to comparisons. Table 1The assemblies produced by hifiasm for HiFi data and Shasta for ONT data were selected for further analysis based on having the best quality metrics (Table 1) and computational tractability compared to four other tested assemblers (Supplementary Table  3). These assemblies were of reference quality with every examined metric exceeding those of the current Bos taurus Hereford-based reference. Improvements over the current reference are substantial, including a 3-11x reduction in autosomal gaps, 1.8-3.6x increase in NG50, and 3-22x reduction in base errors. Furthermore, they all exceed the current standards set by the Vertebrate Genome Project (VGP) . The ONT-based assemblies were marginally above the QV targeted by the VGP, but other metrics for these assemblies such as the contig or phased contiguity are orders of magnitude greater than VGP thresholds. The HiFi-and ONT-based assemblies were generally comparable, however there were notable differences in the average assembly correctness and genome size metrics. Shasta assemblies averaged QV 41.5 after one round each of polishing with ONT reads and short reads, while hifiasm assemblies reach QV 47.6 without any polishing. The log scale of QV means that the hifiasm assemblies had a 4-fold reduction in base errors compared to the Shasta assemblies, indicating the ability of HiFi data to achieve higher quality in fewer steps.
In contrast, phasing in Shasta assemblies is better compared to hifiasm, while both platforms showed improved phasing at higher heterozygosity in agreement with the relative ability to sort F1 reads by parental origin prior to assembly. The lower PG50 of the Original Braunvieh Shasta assembly reflects limited ONT coverage that resulted in the need to perform diploid polishing, rather than directly assembling trio binned reads, and is not due to an inherent limitation imposed by the level of heterozygosity.
The mean autosomal genome size of the assemblies generated by hifiasm and Shasta was 2.57 ± 0.03 Gb and 2.48 ± 0.004 Gb respectively, such that on average each hifiasm autosome was 2.8 Mb longer than ARS-UCD1.2, and 0.34 Mb shorter for Shasta. The additional length of hifiasm autosomes was primarily due to the presence of more repetitive sequences (43.5% repetitive content in hifiasm versus 41.7% in Shasta), especially centromeric repeat sequence. Previous studies have shown that the higher accuracy of HiFi reads allows assemblers to confidently assemble through more repeats despite having shorter read lengths (Chu et al., 2021), leading to extension of autosomal contigs into flanking centromere sequence. Hifiasm assemblies also contained more sequence in contigs not assigned to chromosomes (average of 300 Mb) compared to Shasta assemblies (50 Mb). These unassigned contigs were composed primarily of repetitive sequences (Supplementary Table 4) including novel centromeric sequence and long terminal repeats. Unplaced contigs were generally higher in repeat content than the scaffolds (88% versus 48% for hifiasm and 87% versus 42% for Shasta), and thus would present a challenge to scaffolding by any technology including the reference alignment approach applied here.
Bovine autosomes are acrocentric (Blazak & Eldridge, 1977), and so a complete bovine assembly is conceptually closer to "centromere-to-telomere". Hifiasm assemblies contained substantially more centromeric sequence than Shasta assemblies, respectively averaging 2.01 and 0.14 Mb per autosome, compared to 0.08 in the ARS-UCD1.2 reference (Figure 2a). Similarly, hifiasm autosomes average 2.6 Kb of vertebrate telomeric repeats (TTAGGG) within 10 Kb of the chromosome end, compared to 0.8 Kb for the ARS-UCD1.2 reference. Telomeres were almost entirely missing in Shasta assemblies, averaging only 88 bp of telomeric repeats per autosome ( Figure 2b). Chromosomes which contain at least 50 kb of centromeric repeats at the proximal end and 500 bp of telomeric repeats at the distal end are considered to be end-to-end (but not necessarily "complete"), of which there were 5 for ARS-UCD1.2, and a mean of 13.2 and 1.2 for hifiasm and Shasta across the five breeds/species. Hifiasm and Shasta assemblies had a near equal distribution of gaps, averaging about 1.5 gaps per autosome, compared to nearly 9 in the ARS-UCD1.2 reference ( Figure 2c). These observations hold in general for all HiFi-and ONT-based bovine assemblies investigated (Supplementary Figure 1). These differences are visible on the example Brown Swiss chromosome ideograms in Figure 2d. Both HiFi-and ONT-based assemblies were able to routinely assemble the 16 Kb bovine mitochondrial DNA (Anderson et al., 1982).

Optimal sequencing coverage depths
Impact of sequencing depth on assembly quality was assessed using the Brown Swiss haplotype of the NxB as an example. Subsets of the 52x diploid HiFi and 55x haploid (trio binned) ONT reads, respectively, were randomly sampled to mimic lower sequencing depths ( Figure 3). Completeness metrics (e.g., BUSCO or k-mer content) plateau when coverage increased above 25-fold, however other metrics like contiguity or correctness continued to benefit from higher sequencing coverage with diminishing returns. The trio aware mode of hifiasm only required about 19x diploid coverage of HiFi reads to meet the VGP targets. Shasta required around 28x haploid coverage (corresponding to roughly 56x diploid coverage) to achieve the necessary QV, although 17x haploid (34x diploid) coverage fulfills the contiguity and completeness targets. The minimum VGP-satisfying coverage varies slightly for the different F1s due to different input sequencing read properties but is approximately consistent across all examined bovine trios (Supplementary Figure 2).  The higher accuracy of HiFi reads allows hifiasm to exploit sequence common to both haplotypes during trio aware assembly. Reaching a comparable quality through a trio binned approach required approximately 16% higher coverage, with 11x haploid coverage (22x diploid) necessary. The higher error rate of ONT reads makes haplotype-aware correction and phasing challenging, and so diploid assembly followed by haplotype separation or diploid-aware polishing is less effective than trio binning ( Figure 3, up triangles versus down triangles). Phasing is particularly poor, with the PG50 25 times smaller on average compared to trio binning approaches.
There is an increased risk of coverage gaps when sequencing coverage is reduced, even if the resulting assemblies achieve certain genome-wide standards. When 11x HiFi coverage is aligned to ARS-UCD1.2 and binned into 10 Kb windows, there are 950 regions on average of near total dropout (<1x coverage) across the autosomes. This drops by 23% at 13.5x coverage and by 30% at 16x coverage, as the effects of stochasticity are reduced. While the overall assembly quality does not fluctuate substantially between random subsamplings (Figure 3), it may overestimate the quality at specific regions. Furthermore, default parameters of assemblers are typically tuned to higher coverage and assembling at low coverage can introduce subtle issues. Hifiasm could underestimate a parameter related to duplication purging at coverages below 15x, resulting in a sharp transition to larger assemblies with more duplicated BUSCO genes ( Figure 3, blue circles versus red diamonds).
Manually setting the parameter to its expected value recovered similar behavior seen in higher coverage assemblies.

Constructing a bovine pangenome
Sequencing technologies and assemblers evolve rapidly, and so even recently generated bovine assemblies, including the ones reported here, have been produced under nonuniform conditions (e.g., (Crysnanto et al., 2021;Talenti et al., 2021)). Given the differences we observed between HiFi-and ONT-based assemblies, especially in comparison to the CLRbased ARS-UCD1.2 reference, it is crucial to examine how pangenomes respond to different assembly inputs.
Pangenomes were constructed with minigraph  using ARS-UCD1.2 as the initial backbone of the graph structure. Assemblies were iteratively added into the graph, and regions of synteny were ignored while sufficiently diverged subsequences (>50 bp) instead augmented the graph with new nodes ("bubbles"). Pangenomes for each autosome were constructed from all hifiasm assemblies, all Shasta assemblies, or random mixtures of the two. Graph properties of each pangenome in terms of the amount of non-reference sequence added, were generally robust to the input assemblies (Figure 4a,b). Pangenomes constructed from five hifiasm assemblies had more non-reference sequence added compared to five Shasta assemblies (82.5 Mb across 88.5k nodes versus 63.5 Mb across 90.2k nodes), in agreement with the greater completeness observed in hifiasm assemblies. Approximately 92.1% of identified SVs were common between hifiasm and Shasta pangenomes, with 3.6% and 4.3% unique to each respectively ( Figure 4c). There was not a clear bias between HiFi-and ONT-based pangenomes, with only 1.8 and 0.39 Mb nonreference sequence uniquely identified in each. These bubbles were more repetitive than autosome-wide averages (53% and 45% respectively, Supplementary Table 5). Minigraph is sensitive to the order of integration, and on rare occasions constructed significantly different bubbles, particularly for palindromic sequences, resulting in larger variance on chromosomes 7 and 12 (Supplementary Figure 3). Pangenomes constructed from lower-coverage assemblies remained robust. We selected hifiasm and Shasta assemblies generated with an average of 21.6x diploid and 24.9x haploid coverage respectively, which approximately satisfy the VGP standards. The hifiasm assemblies have a higher average QV compared to the Shasta assemblies (41.9 versus 39.3), but lower NG50 (2.3 Mb versus 19.5 Mb) and more autosomal gaps (1869 versus 323) (Supplementary Table 6). Although these assemblies are substantially worse than their fullcoverage counterparts, the resulting pangenomes are similar (Figure 4a-c). Taking the high coverage SVs as the truth set, the low coverage hifiasm and Shasta pangenomes have an F1 score of 98.5 and 94.7 for SV discovery respectively. The low coverage Shasta pangenomes tend to identify a greater number of SVs not present in other pangenomes, and so may be false positives.

Figure 4. Pangenomes are generally robust to different input assemblies. a) The number of large (>1 Kb) bubbles is highly consistent across hifiasm, Shasta, and mixed pangenomes at both full (up triangles) and lower (down triangles) coverages. b) The mean bubble size is also consistent across different inputs, but bubbles are larger on average in hifiasm pangenomes
Pangenomes constructed using the existing Angus (CLR-based)  or Simmental (ONT-based)  assemblies as backbones (Figure 4d) produced similar results compared to using the Hereford-based ARS-UCD1.2. More non-reference sequence was identified in the lower-quality Angus-backed pangenome (+13%), while the more complete Simmental-backed pangenomes had less (-6%). Reference-bias propagates through minigraph's pangenomes, such as the missing sequence in Angus chromosome 28 . CC-BY 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted November 4, 2021. ; https://doi.org/10.1101/2021.11.02.466900 doi: bioRxiv preprint (Lloret-Villas et al., 2021) resulting in 25% fewer bubbles compared to using ARS-UCD1.2 (Figure 4d).

Quantifying bovine structural diversity through the pangenome
Structural variation bubbles were associated with their source assembly by retracing each "haplotype walk" through the graph. The phylogenetic topology of their evolutionary relationship was then estimated by counting the number of mutually exclusive bubbles any two assemblies have to construct a condensed distance matrix. The results are consistent with expectations, with the gaur (Bos gaurus) largely separate, followed by the Nellore (Bos taurus indicus), and then the three taurine cattle (Figure 5a). All constructed pangenomes unequivocally predicted the first two branches, while there was also good agreement within the closely related taurine cattle (Figure 5b).  . CC-BY 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted November 4, 2021. ; Several chromosomes (e.g., 2, 6, 9, etc.) are observed to only predict a single taurine topology, even across different assembly inputs and coverages, while other chromosomes (e.g., 1, 12, 16, etc.) have multiple predicted topologies with similar frequencies. This may indicate that certain chromosomes harbor greater structural variation between specific taurine breeds, which may be reflected in phenotypic differences. We compared against a conventional approach, calling small variants from the parental short reads corresponding to the five haplotypes against ARS-UCD1.2 (Figure 5b). While the overall topology and magnitude is similar to that found through the pangenome, there was no significant concordance within the taurine cattle, as might be expected given the low linkage disequilibrium between small and structural variants (Yan et al., 2021).
Genome level dendrograms were mostly consistent across different combinations of inputs, demonstrating that pangenome robustness extends beyond graph properties and into applications. The same general topology was recovered for all examined assemblers, both individually and combined, as well as the lower coverage assemblies. The dendrogram correctly places all assemblies of each breed on their own branch ( Figure 5c) when 31 pangenome assemblies are included (1 reference backbone plus 5 breeds/species x 6 assemblers). Some HiFi-and all ONT-based assemblies for Original Braunvieh are only partially phased due to limited coverage, resulting in greater structural variation within these assemblies (Figure 5d). The variation between the parental haplotypes of the OxO results in the higher branching point for "O" in Figure 5c, and highlights the importance of cleanly resolved haplotypes for downstream analyses.

Pangenome topology at trait-associated SVs
Pangenome integration of haplotype-resolved assemblies representing multiple breeds/species supported investigation of the evolutionary history of a multi-allelic copy number variation (CNV) at 86.96 Mb on BTA6 encompassing an enhancer of the groupspecific component (GC) gene. This CNV has pleiotropic effects on mastitis resistance and other dairy traits, and segregates in different breeds of cattle Olsen et al., 2016;Pausch et al., 2016), although the prevalence of the CNV has not yet been determined for the breeds/species included in our bovine pangenome. The reported CNV was only observed in the Shasta assemblies of Brown Swiss and Original Braunvieh cattle ( Figure  6a,b) indicating it may contribute to variation in somatic cell score also in these breeds (Fang & Pausch, 2019). The CNV formed part of a complex superbubble , suggesting substantial allelic diversity between the haplotype-resolved assemblies. In addition to the two (Brown Swiss) and three (Original Braunvieh) copies of the 12 Kb CNV segment, there were also two shorter insertion elements detected within the superbubble from the Nellore and gaur assemblies (Figure 6a,b).
Inspection of coverage data retrieved from HiFi-and ONT-binned long read alignments against the ARS-UCD1.2 reference suggests that the Brown Swiss and Original Braunvieh haplotypes harbor between 2 and 4 additional copies of the 12 Kb segment (Figure 6c, Supplementary Notes). Coverages derived from F1 short read alignments are consistent with these values but are unable to resolve the disagreement observed between HiFi and ONT coverage (Figure 6d). While most examined assemblers predicted duplication of the 12 Kb segment in Brown Swiss and Original Braunvieh, there was poor consensus in copy number, supporting that this region remains challenging to resolve even with long reads and . CC-BY 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted November 4, 2021. ; de novo assembly (Supplementary Figure 4). The same inspection did not identify the duplication in the gaur, Piedmontese and Nellore haplotypes, but did validate the two insertions of additional repeat elements exclusively in the gaur and Nellore haplotypes. We retrieved various repetitive elements in the tandemly duplicated 12 Kb segments of the haplotype-resolved Brown Swiss and Original Braunvieh assemblies (Figure 6e) broadly confirming the results of . black) to the other assemblies.

Figure 6. Topology of a tandem duplication on BTA6. a-b) Two example pangenome subgraphs of the promoter region of GC. Reference paths (including those in bubbles) are colored gray, while the tandem duplications are orange and the insertions are blue. Complex superbubbles generally have suboptimal topologies due to the lack of base-level alignment. For example, the 725 bp insertion is obvious in a), but appears as the difference between a 1,400 bp and 667 bp path in b). However, both subgraphs identify the approximately 200 bp (1) and 700 bp (2) insertions in Nellore and gaur, as well as the tandem duplication in Brown Swiss and Original Braunvieh. c) The identified CNV region shows clear coverage increase in only Brown Swiss and Original Braunvieh, across both HiFi and ONT haplotype-resolved reads. d) F1 short reads also show increased coverage for the NxB and OxO trios. The NxB coverage increase is consistent with only the Brown Swiss haplotype carrying additional copies. e) The 12 Kb repeat structure (orange) is clearly identified by RepeatMasker across all assemblies, shown here for the ARS-UCD1.2 reference and Shasta assemblies for gaur, Nellore, Piedmontese, Brown Swiss, and Original Braunvieh. The two marked gaur/Nellore insertions (1&2) are consistent with the pangenomes in a-b). One additional copy in Brown Swiss and Original Braunvieh is shown (yellow), while the tandem duplication eventually ends with a similar repeat
An SV at the ASIP gene, encoding the Agouti-signaling protein involved in mammalian pigmentation and located on chromosome 13, was also investigated. This genomic sequence harbors alleles associated with coat color variation in many species including cattle (Girardot et al., 2006;Trigo et al., 2021). Both Nellore and Brown Swiss cattle present variability in coat color ranging from near white to almost black. The NxB was born with a light coat, which darkened as the bull aged (Supplementary Notes). Our pangenome confirmed great allelic diversity between the bovine assemblies involving insertions and deletions of repetitive elements upstream of ASIP. However, the previously described variants associated with coat color variation (Trigo et al., 2021) were not in the pangenomes. Short read alignments of the Nellore sire confirmed it carried the previously described SV in the heterozygous state, but the F1 inherited the other haplotype (Supplementary Notes). Thus, the darkening of the coat we observed in the NxB is not due to previously described ASIP alleles. We identified 900 and 922 genes in the ARS-UCD1.2 genome annotation (Refseq release 106) whose coding regions overlap bubbles in hifiasm and Shasta based pangenomes respectively (Supplementary Figure 5). Of these, 808 and 845 are bubbles with haplotype path information for each assembly, and so are amenable to association investigations. Several of these genes are listed in the OMIA (Online Mendelian Inheritance in Animals) database as they harbor alleles causing phenotypic variation in cattle; for instance, both hifiasm and Shasta pangenomes contained a bubble encompassing a 36 bp deletion in ACAN resulting in the in-frame deletion of 12 amino acids in the gaur assemblies. Variants in ACAN are associated with stature in cattle (Cavanagh et al., 2007) and other species (Gibson & Briggs, 2016). The pangenome also recovered the insertion of an 11 Kb segment on chromosome 23 in all assemblies. This segment encompasses HSPA1B (Hess et al., 2018) which is not annotated in ARS-UCD1.2 (Suqueli García et al., 2017). This 11 Kb insertion is challenging to identify by inspection of short and long read alignments, mainly appearing as elevated coverage over the 2294 bp segment encompassing HSPA1A, indicating a similarly sized duplication, with soft-clipped bases extending on both sides (Supplementary Note).
The bubbles in the pangenome indicate further putatively trait-associated regions. For instance, the coding sequence of QRICH2 overlapped with multiple bubbles indicating tandem duplications of a 30 bp region of the fifth exon ( Figure 7a). Loss of function alleles in mammalian QRICH2 orthologs lead to multiple morphological abnormalities of the sperm flagella (Shen et al., 2019). We find that the fifth exon of QRICH2, which is affected by the coding sequence expansion, is transcribed in high abundance (>30 transcripts per million) in testes of mature taurine bulls. Inspection of long read alignments confirms an expansion of the coding sequence relative to ARS-UCD1.2 that extends the high molecular weight glutenin subunit of the protein by 10 amino acids in our taurine cattle, 60 amino acids in Nellore, and 50 amino acids in gaur (Figure 7b,c). This SV is challenging to resolve with short reads (Supplementary Notes). Another example of potential trait-associated SV lies in TAS2R46, related to bitter taste receptors and associated with adaptation to dietary habitats (Dong et al., 2009), which overlapped with a 17 Kb deletion in gaur (Figure 7d,e). This deletion also spanned ENSBTAG00000001761. A final example is found in PRDM9, the only known speciation gene in mammals and known to harbor alleles associated with variation in meiotic recombination within and between Bovinae (Ahlawat et al., 2016;Sandor et al., 2012;Zhou et al., 2018), where an SV overlapped with copy number variation in the zinc finger array domain (Figure 7f,g). The Nellore and gaur assemblies contained one zinc finger less, while the paternal haplotype of Original Braunvieh carried one more relative to ARS-UCD1.2. The maternal haplotype of Original Braunvieh contained the same number of zinc fingers as ARS-UCD1.2, supporting the intra-and inter-breed/species variation observed for PRDM9.
. CC-BY 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted November 4, 2021. ;

Discussion
Ten haplotype-resolved assemblies for cattle and related species were constructed using recent sequencing and assembling technologies. Assemblies produced by hifiasm (HiFi) and Shasta (ONT) were substantially more contiguous and correct than the current Bos taurus reference sequence, which is a haplotype-merged assembly based on older CLR technology. The higher accuracy of HiFi reads was found to be generally more advantageous to quality measures of assembly and increased completeness of centromeric and telomeric regions compared to the longer but higher-error ONT reads. HiFi-based assemblers also required less compute and storage resources compared to ONT-based assemblers; producing haplotype-resolved hifiasm assemblies required approximately 600 CPU hours and 200 GB of peak memory usage, while the equivalent Shasta (plus polishing) assemblies took 2200 CPU hours and 750 GB of peak memory usage. Correct-then-assemble approaches like Canu (Koren et al., 2017) can be practical for smaller genomes (Wick & Holt, 2021), but on gigabase-sized mammalian genomes like in Bovinae we observed >20 Tb of peak temporary storage and >25k CPU hours for correcting only 30-fold ONT reads. Even recent referenceguided correction approaches like Ratatosk (Holley et al., 2021) still needed approximately 15k CPU hours to correct 55-fold ONT reads. Cutting-edge sequencing and bioinformatic improvements Silvestre-Ryan & Holmes, 2021), like the ONT Guppy5 basecaller, will likely assist more efficient assembly, resulting in higher QV and reduced computational load; however, currently the ONT specific requirements might be computationally prohibitive, especially when assembling many samples.
The phased assembly graph approach of hifiasm is most efficient with lower heterozygosity samples, where HiFi reads are least sortable by parental haplotype and there is more mutual sequence to exploit, but still functions with highly heterozygous F1s. The inter-species GxP trio binned assemblies were more contiguous (+25% NG50) compared to the trio aware assemblies, but contained several large (>5 Mb) misassemblies which the latter did not (Supplementary Figure 6). Phasing in both HiFi-and ONT-based assemblies improved as heterozygosity increased and allowed more cleanly resolved haplotypes, which can be beneficial to downstream analyses Yang & Chaisson, 2021). The ability of hifiasm, and to a lesser extent Shasta with diploid-aware polishing, to assemble phased haplotypes from purebred individuals also avoids ethical and logistical concerns regarding the higher heterozygosity crosses previously targeted (Koren et al., 2018).
The minigraph pangenomes were strongly comparable whether the input assemblies were all HiFi-based, ONT-based, or a mix. The greater completeness observed in hifiasm assemblies is reflected in those pangenomes containing more non-reference sequence, but no notable HiFi or ONT specific biases were observed in the pangenomes. The quality and completeness of the pangenome backbone can have an impact, seen on the incomplete Angus chromosome 28 or the ARS-UCD1.2 chromosomes generally lacking centromeric sequence, but again we found no specific bias between CLR-or ONT-based backbones. These results indicate that optimal minigraph pangenomes would use high-quality, complete genomes as the backbone, like emerging T2T assemblies (Belser et al., 2021;Nurk et al., 2021;Rengs et al., 2021). Alternatively, reference-free approaches to pangenomes  may also circumvent these issues.
Mutual variation identified through shared paths in the pangenome provides opportunities to study the phylogeny of Bovinae beyond SNPs and indels. The ability to accurately separate and represent paths within SVs also enables pangenome-based GWAS (PWAS) (Gupta, 2021), as recently explored in crops Hufford et al., 2021;Song et al., 2020). We identified multiple SVs overlapping annotated coding sequences in different bovine pangenomes, demonstrating that pangenomes provide a framework to make them amenable to association mapping. Furthermore, some of these SVs (e.g., tandem duplications in QRICH2) are inaccessible from short or noisy long read alignments and some (e.g., an insertion of HSPA1B) are challenging to resolve even with long read alignments. These cases highlight the benefits of de novo, haplotype-resolved assemblies and pangenome integration.
Two haplotype-resolved assemblies which satisfy VGP quality standards can be produced with diploid coverage of approximately 20x for hifiasm (HiFi) or 60x for Shasta (ONT). These lower coverage assemblies are nearly as efficient (90+%) for SV discovery as the full coverage assemblies, with approximately 88% efficiency for genic SV identification. Combined, these results support that cataloguing most within-breed structural variation is a tractable goal through de novo long read assembly of 50 samples per breed, sufficient to represent the effective population size. Extensive existing short-read sequencing data can then be leveraged to genotype SVs present in the pangenome (Ebler et al., 2020; and then imputed into tens of thousands of cattle previously genotyped with microarrays. . CC-BY 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted November 4, 2021. ;https://doi.org/10.1101https://doi.org/10. /2021

Ethics statement
The sampling of blood from the NxB and OxO trios was approved by the veterinary office of the Canton of Zurich (animal experimentation permit ZH 200/19). All GxP protocols were approved by the Institutional Animal Care and Use Committee (IACUC) of the University of Nebraska-Lincoln, an AAALAC International Accredited institution (IACUC Project ID Project ID 1697). Gaur semen collections were approved by the Henry Doorly Zoo IACUC in 1992.

Animals
Cows from the Original Braunvieh (O), Brown Swiss (B), and Piedmontese (P) breeds were inseminated with semen samples from Original Braunvieh, Nellore (N), and gaur (G) sires, respectively. A female (OxO) and a male (NxB) calf were delivered at term. A female fetus (GxP) cross was collected by cesarean section at 119 days of gestation. Blood samples were taken from the calves and dams by trained veterinarians. Lung tissue was taken from the fetus and DNA extracted as described (Logsdon, 2020). Semen samples were available for the bulls. High-molecular weight DNA was extracted from blood and semen samples, respectively, using Qiagen's MagAttract HMW DNA Kit as described earlier (Crysnanto et al., 2021).

Sequencing
The genomes of the F1s were sequenced with long reads using PacBio HiFi and ONT. The PromethION 1D Genomic DNA by ligation SQK-LSK110 library prep kit was used for the OxO, NxB, and GxP F1s. The libraries were respectively sequenced on PromethION (R9.4.1) flowcells (nuclease wash was applied to one cell). The OxO and NxB reads were basecalled on Guppy4, while the GxP reads were basecalled with Guppy3.
Paired-end libraries (2x150 bp) were produced using parental DNA samples and sequenced on Illumina instruments.

Genome assembly
HiFi reads were first filtered with fastp (version 0.21.0) (S. , removing reads below 1 Kb length or QV20. Nanopore reads were pre-filtered to a minimum of QV7, with no length restrictions. Short reads were also filtered with fastp, using the "-g" parameter to trim polyG tails. Both HiFi and ONT reads were trio binned into paternal and maternal haplotypes using (Trio)Canu (Koren et al., 2018) (version e0d6bb0), using parental short reads and an estimated genome size of 2.7g.
HiCanu (Nurk et al., 2020) was used with default settings and "genomeSize=2.7g -pacbiohifi" to produce a draft set of contigs from the HiFi reads. Only contigs which were not suggested as bubbles (suggestBubble=no) were retained, which improved automated coverage detection when using purge_dups (version 1.2.3) . The final set of "assembled" contigs was obtained after following the default purging pipeline.
The trio binned hifiasm assemblies were obtained with the same assembly parameters, with the addition of the "--primary" flag to produce a primary and alternate assembly. The primary gfa file was converted to fasta as described above. Only the primary assembly was retained.
Shasta (Shafin et al., 2020) (version 0.7) assemblies were produced using the standard configuration file "Nanopore-Sep2020.conf". The assemblies used the haplotype-binned nanopore reads, except for the OBV assembly, which used the total read set. Due to the high coverage for the gaur and Piedmontese haplotypes, the parameter "minReadLength=30000" was used instead.
All nanopore assemblies except the OBV Shasta were polished using PEPPER ) (version 0.1) and the haplotype-binned nanopore reads. Each assembly was then polished with bcftools (Danecek et al., 2021) (version 1.12) using the merfin  (commit version 1331fa5) filtered vcf output of DeepVariant (Poplin et al., 2018) (version 1.1). The DeepVariant input was F1 short reads where reads containing k-mers specific to the other haplotype were excluded with meryl. The OBV Shasta assembly was polished with PEPPER-DV (version 0.4), tagging the intermediate bam file with haplotypebinned read IDs. The subsequent short read polishing was as described above.
Raven (Vaser & Šikić, 2021) (version 1.5.0) assemblies were obtained from nanopore reads with default parameters, except for "-p 0". Polishing with the default two rounds of racon (Vaser et al., 2017) resulted in slightly lower QV compared to post hoc polishing with PEPPER.
Flye (Kolmogorov et al., 2019) (version 2.8.3-b1725) assemblies were constructed with "-genome-size=2.7g --nano-corr" from Ratatosk (version 0.4) (Holley et al., 2021) errorcorrected nanopore reads. The nanopore reads were corrected using a reference-guided approach, taking the haplotype-specific hifiasm assembly as the reference. Ambiguous IUPAC codes were randomly replaced with equal probability of an appropriate nucleotide. The contig set was taken from the pre-scaffolding result of Flye and was not polished due to the pre-corrected reads having sufficient accuracy.

Coverage downsampling
Sequencing subsets were made through seqtk (https://github.com/lh3/seqtk) (version 1.3-r115-dirty) with the command "seqtk seq -f {sample}", including the "-A" flag to drop quality scores where appropriate. In the case of repeat trials, the "-s {seed}" flag was set with a randomly generated 64-bit integer to ensure a unique subsampling of data. The reduced coverage assemblies were conducted identically to the full coverage methods unless explicitly mentioned otherwise.

Pangenome construction and bubble extraction
Pangenomes were constructed on a per chromosome basis using minigraph (vesion 0.15-r426) , with default parameters. The selected assemblies were added to the graph in a randomly shuffled order, where repeated constructions from the same input set may differ due to the ordering. Pangenomes were visualized with Bandage (version 0.8.1) (Wick et al., 2015).
Haplotype paths were called with minigraph, using the "--call --xasm" option. Bubble intersections across assemblies were created with UpSetPlot (https://github.com/jnothman/UpSetPlot) (version 0.6.0) and converted into a condensed distance matrix by assessing when two assemblies did not take the same path through a bubble.

Pangenome bubble and gene overlapping
All entries in the ARS-UCD1.2 annotation (RefSeq release 106) with a "CDS" description were extracted into a bed file. Bubbles from the pangenome were also extracted into a bed file using "gfatools bubble". The two bed files were then intersected using bedtools (version 2.30.0) (Quinlan & Hall, 2010) with the intersect command and "-wo" to find potential traitassociated SVs. We then filtered the list by checking all five breed/species were successfully assigned a haplotype path through the bubble, to remove incomplete associations.

Availability of data
HiFi reads for the OxO and NxB F1s are available at the study accession PRJEB42335 under sample accession SAMEA7759028 and SAMEA7765441.
ONT reads for the OxO and NxB are available at the study accession PRJEB42335 under sample accession SAMEA10017983 and SAMEA10017982.
Short reads for the OxO and NxB are available under accession number SAMEA9986200 and SAMEA7589752. Parental short reads are available at SAMEA9986201 & SAMEA9986199 (OxO) and at SAMEA6163185 & SAMEA9533783 (NxB).
Long and short read sequencing data for the GxP trio are available at the study accession PRJEB48481 under secondary accessions SAMEA10563833, SAMEA10563834, and SAMEA10563835. support provided by Dr. Anna Bratus-Neuenschwander from the ETH Zürich technology platform FGCZ (https://fgcz.ch) for sequencing and DNA fragment analysis. The results reported here were made possible with resources provided by the USDA shared compute cluster (Ceres) as part of the ARS SciNet initiative. We thank the USMARC Core Facility staff for outstanding technical assistance. Also thank B. Lee, J. Carlson, K. McClure, H. Clark, H. Sadd, M. Sadd, and B. Shuck for outstanding technical support. We thank the North American Piedmontese Association for their enthusiastic support and assistance.

Funding
This work was financially supported from the Federal Office for Agriculture (FOAG), Bern, Switzerland, and the Swiss National Sciences Foundation (SNSF).