Haplotype-Resolved Cattle Genomes Provide Insights Into Structural Variation and Adaptation

We present high quality, phased genome assemblies representative of taurine and indicine cattle, subspecies that differ markedly in productivity-related traits and environmental adaptation. We report a new haplotype-aware scaffolding and polishing pipeline using contigs generated by the trio binning method to produce haplotype-resolved, chromosome-level genome assemblies of Angus (taurine) and Brahman (indicine) cattle breeds. These assemblies were used to identify structural and copy number variants that differentiate the subspecies and we found variant detection was sensitive to the specific reference genome chosen. Six gene families with immune related functions are expanded in the indicine lineage. Assembly of the genomes of both subspecies from a single individual enabled transcripts to be phased to detect allele-specific expression, and to study genome-wide selective sweeps. An indicus-specific extra copy of fatty acid desaturase is under positive selection and may contribute to indicine adaptation to heat and drought.

identified by short read sequencing of the parents prior to assembly with TrioCanu. The initial assemblies comprised 1747 Angus haplotigs and 1585 Brahman haplotigs ( Table 1).
The haplotig N50 was 29.4 Mb and 23.4 Mb for the Angus and Brahman, respectively.
For the present study, additional data were generated for the same hybrid fetus, including ~12x Hi-C reads, ~167x Bionano optical map, and ~84x Illumina paired-end reads (Fig. 1), to provide haplotype-resolved scaffolding and identify assembly errors. Following haplotig assembly, two sets of scaffolds, one based on Hi-C and the other on optical map data, were generated for each haplotype. Three different scaffolding programs (3D-DNA, Proximo and SALSA2) using the Hi-C data were evaluated (Supplementary Note 1). SALSA2 was found to be the best scaffolder and produced the closest agreement with the latest cattle reference ARS-UCD1.2. The scaffold N50 produced by SALSA2 was larger than those generated by optical map scaffolding, but the latter had the advantage of more accurate chimeric haplotigs detection, which resulted in 29 and 36 breaks in the Angus and Brahman haplotigs, respectively (Supplementary Note 2). Without these chimeric breaks, four apparent, incorrect inter-chromosomal fusions created in the initial haplotigs, involving two Brahman chromosomes (13 and 15) and six Angus chromosomes (8,9,12,20,23,28), would have remained unresolved.
After validation against a recombination map, gap filling, and error correction, the final assemblies, UOA_Angus_1 and UOA_Brahman_1 had similar chromosome sizes and excellent co-linear chromosome alignment with the current cattle reference, ARS-UCD1.2 ( Supplementary Fig. 2). Unlike some of the recent PacBio-based assemblies 8,9 , which required an additional polishing step with Illumina short reads to correct the high indel error rates, the haplotype-resolved assemblies only required correction of a very small number of coding sequences, showing that polishing with short reads was unnecessary (Supplementary Note 3).
The Brahman genome was annotated by Ensembl and NCBI whereas the Angus genome was annotated only by Ensembl (Supplementary Note [3][4]. A comparison of annotation features between the Angus, Brahman and Hereford reference genomes is given in Supplementary

Assembly benchmarking and sequence contiguity assessments
The per-base substitution quality values (QVs) for the UOA_Angus_1 and UOA_Brahman_1 reference assemblies were 44.63 and 46.38, respectively (Supplementary Table 2, Supplementary Note 5). The QV represents the phred-scaled probability of an incorrect base substitution in the assembly, hence these QVs indicate that the assemblies are more than 99.99% accurate at single base level. This is similar to the latest water buffalo assembly UOA_WB_1 (QV 41.96) and surpasses the recent goat ARS1 assembly (QV 34.5) by an order of magnitude. The Angus and Brahman assemblies had ~93% BUSCO completeness score, which demonstrates a high-quality assembly of genes (Supplementary Table 3).
The Angus and Brahman assemblies have few gaps compared to most existing mammalian reference assemblies, and are comparable to the human GRCh38, the new Hereford cattle ARS-UCD1.2 and the water buffalo UOA_WB_1 reference genomes (Fig. 2a). For example, the Angus chromosome 24 was assembled without gaps. In terms of contiguity, these new cattle reference genomes are comparable to the recent water buffalo UOA_WB_1 assembly 9 , which is the most contiguous ruminant genome published to date with fewer than 1000 contigs ( Supplementary Fig. 3), although it is not fully haplotype-resolved. While the cattle autosomes showed excellent contiguity, the Brahman X and Angus Y chromosomes were interrupted by 91 and 69 gaps, respectively.

Resolution of longer repeats
The use of long PacBio reads substantially improved repeat resolution compared with the previous cattle assembly UMD3.1.1, which was assembled from Sanger sequences 10 (Fig.   2b). Approximately 49% of both Angus and Brahman assemblies consist of repeat elements, which is consistent with other published mammalian assemblies, including human GRCh38, Hereford cattle ARS-UCD1.2, water buffalo UOA_WB_1 and goat ARS1. The two largest repeat families identified were Long Interspersed Nuclear Element (LINE) L1 and LINE/RTE-BovB, which covered ~25% of the chromosomes in both cattle sub-species. Satellite or centromeric repeats (>10 kb) accounted for 21% and 14% of repeats in unplaced scaffolds of Angus and Brahman, respectively ( Supplementary Fig. 4). The 7% higher satellite and centromeric repeats in Angus unplaced scaffolds is likely due to the presence of the Y chromosome in the Angus haplotype. The combination of the three most frequent repeat families, LINE L1, LINE/RTE-BovB and satellite/centromeric repeats, covered ~40% of all unplaced bases, and repeat sequences were most frequently responsible for breaking sequence contiguity. The three cattle assemblies constructed using PacBio long reads that resolved repeats > 2.5 kb, UOA_Angus_1, UOA_Brahman_1 and ARS-UCD1.2, provide significant improvements in repeat resolution over the previous Sanger-based cattle assembly (UMD3.1.1) (Fig. 2b). All 29 cattle autosomes are acrocentric and in these assemblies, 20 contained centromeric repeats within 100 kb of chromosome ends, demonstrating that they approach chromosome-level. Nine Angus and eight Brahman chromosomes have centromeric repeats larger than 10 kb near the chromosome end. Six vertebrate telomeric repeats (TTAGGG)n were found within 1000 kb of chromosomal ends in Angus and five in Brahman assemblies.

Discovery of indicus-specific fatty acid desaturase 2
One of the most diverged genomic regions between Brahman and Angus was observed on chromosome 15 (Fig. 3a). A region of ~1.4 Mb has three copies of fatty acid desaturase 2like genes (FADS2P1) in Brahman whereas the homologous region in the Angus only has two FADS2P1 genes (Fig. 3b, c). In both Brahman and Angus the FADS2P1 genes are encoded by 10 to 12 exons and the entire regions were assembled completely without gaps for both genomes. The region also contains six genes annotated as olfactory receptor-like, with unknown functions, which had differences in their predicted gene models between Brahman and Hereford assemblies. Within the ~1.4 Mb region, there is a high level of sequence divergence for ~200 kb, which is where an extra copy of FADS2P1 lies in Brahman.
Searches for FADS2P1 in other ruminant species with high-quality genome assemblies revealed that only Brahman has three copies of the gene. The additional FADS2P1 gene is ~53 kb long, and is flanked by two other conserved FADS2P1 genes. Searching WGS short read sequences from 38 animals used in this study showed that only Brahman animals had the extra copy, which was not present in any of the taurine individuals. (Supplementary Fig.   5). Considering that the Brahman genome is derived from four indicine breeds, the extra FADS2P1 is likely a Bos taurus indicus-specific gene. We used CODEML to search for positively selected amino acid residues in FADS2P1 and identified 16 significant positively selected sites, 10 of which are located in a small exon 7 of only 60 bp (Fig. 3d,   Supplementary Table 4).

Selective sweeps in Brahman
We compared the SNP patterns of 5 Brahman with 33 individuals from six taurine breeds to identify signatures of selective sweeps (Fig. 4a, Supplementary Note 6). We searched 100 kb windows spanning the whole genome for those where there was high level of homozygosity within Brahman individuals but with more segregating polymorphic variants in taurine breeds. This identified a total of 128 genes in 60 selective sweep intervals. Among these candidate selected genes, 80% were protein-coding, 1 was a pseudogene and the remainder were RNA-based genes (Supplementary Table 5). No biological pathways were found to be significantly over-represented among the positively selected protein-coding genes. The heat shock protein HSPA4, a member of the Hsp70 family, was amongst genes identified as under selection, which was also found in a search for selective sweeps in African cattle 11 . We also identified DNAJC13, a member of a gene family known to act as co-chaperones of heatshock proteins, in another selective sweep region. In addition to heat tolerance related genes, selective sweep regions included genes involved in a range of biological processes including metabolic process, cellular component organization or biogenesis, cellular process, localization, reproduction, biological regulation, response to stimulus, developmental process, rhythmic process, biological adhesion and multicellular organismal process.
A total of 231 unique QTL covering all six major QTL types listed in the cattle QTL database overlapped with positively selected genomic regions. The major QTL types were reproduction (26%) followed by exterior phenotype (22.9%) and milk (22.1%) traits (Fig. 4c).
Ten of 60 selective sweep intervals did not overlap with any of the currently identified QTL.

SNP and INDEL differences between Brahman and Angus
Using WGS short reads from 5 Brahman and 6 Angus individuals, we identified ~24 million Brahman SNPs and ~11 million Angus SNPs, which were annotated using the corresponding reference genomes (

Structural variant differences between Brahman and Angus
We assessed the structural continuity of our Brahman and Angus genome assemblies against the current cattle reference genome assembly, ARS-UCD1.2, and against WGS datasets from 38 animals representing seven breeds, to ascertain the benefit of using haplotype-resolved assemblies for variant calling. To assess structural variant (SV) differences between Brahman and Angus and the cattle reference genomes, the haplotyperesolved assemblies were aligned to the ARS-UCD1.2 reference (Hereford). This detected insertions, deletions, tandem expansions, tandem contractions, repeat expansions and repeat contractions 12 (Supplementary Fig. 6). Both tandem expansion/contraction and repeat expansion/contraction are repeat-type SVs. Detection of SV was limited to sizes of 50 -10,000 bp, and the total bp affected by SVs in Angus and Brahman were 10.9 Mb and 21.8 Mb. This translates to approximately 0.4% and 0.8% of the Angus and Brahman genomes, respectively. Among the six classes of SVs examined, insertion/deletion types were the most prevalent in both Brahman and Angus genomes compared to ARS-UCD1.2.
We extracted Brahman-and Angus-specific SV to study their distribution in genic and intergenic regions (Fig. 5a). For Brahman-specific SV, insertion/deletion types overlapped ~4% of all genes whereas each of the repeat-type SV overlapped ~1% of genes. In contrast, the Angus-specific insertion/deletion type SV overlapped ~1-2% of all genes and the repeattype SV overlapped less than 1% of genes. Therefore, the majority of SV were found in intergenic regions and whenever they overlapped with genes were generally localized within introns. Over-representation of Gene Ontology (GO) terms were detected for Angusspecific insertions and tandem contractions and Brahman-specific insertion/deletion SVs at FDR-adjusted P-value < 0.05 (Supplementary Table 7 Using WGS reads from different datasets, we used a two-tiered approach to identify subspecies-specific CNVs that were masked by the absence or poorer resolution of sequence in the ARS-UCD1.2 reference. The input dataset for these analyses came from ~10x WGS short reads of 38 animals representing seven cattle breeds. Each set of reads was aligned to all three reference genome assemblies (Hereford, Brahman, and Angus) and processed with SV callers designed to detect read depth differences and paired-end/splitread discordancy, respectively. The first approach (read-depth variation) included the use of the Vst statistic 13,14 to identify genes with copy number variation between taurine or indicine lineages using the Brahman, Angus or ARS-UCD1.2 assemblies, (Fig. 5b, Supplementary Fig.   7). Only autosomes were considered. Six CNV genes were found in Brahman whereas four and eight CNV genes were found in Angus and Hereford, respectively (Fig 6a-c). Prediction of CNV genes was sensitive to the assembly chosen, e.g. only TMPRSS11D and betadefensin-like precursor were found to be copy number variable in more than one assembly.
Among the 18 CNV genes differentiating indicine from taurine genomes, six unique gene families were identified, which were beta defensin, workshop cluster, trypsin-like serine protease, T-cell receptor alpha chain, tachykinin receptor, and interferon-induced very large GTPase, all of which have immune-related functions. All of the CNV genes from these six families showed higher copy number in the indicine cattle lineage regardless of the assembly used. An olfactory receptor, two long non-coding RNAs and one putative protein, FAM90A12P, also had higher copy numbers among indicine animals. In contrast, ubiquitinconjugating enzyme E2D3 and two keratin-associated protein 9 genes (KRTAP9-1, KRTAP9-2) had higher copy numbers in the taurine lineage.
We quantified the effects of using different reference assemblies for paired-end/split-read (PE) SV discovery. All SV calls of this type were converted into Hereford coordinates to facilitate comparisons. We removed 17, 9, and 18 PE SVs from the Brahman, Angus and Hereford assemblies that were likely false positives, as they were larger than 1 Mb and did not correspond to aberrant read depth signal to support their SV calls. On average, 0.5% of each cattle genome was covered by CNV regions (CNVRs) (Fig. 6d). The majority of CNVRs (at least 76% from each assembly) were found to be unique to one assembly. Among the Brahman CNVRs, only 10% intersected with Angus CNVRs, which suggests mis-assembly in the Hereford reference potentially due to compression of repetitive elements that are more difficult to resolve without phasing haplotypes using the trio binning method.

Phasing of full-length transcripts in haplotype-resolved genomes
Among the PacBio error corrected Iso-Seq (CCS) reads pooled from seven tissues of the F1 hybrid fetus, 3,275,676 reads (55%) were classified as full-length non-concatamer (FLNC) reads. After processing with the isoseq3 software, 193,974 full-length, high-quality (HQ) consensus transcripts were generated. We mapped the HQ transcripts to the Brahman reference and obtained 99,329 uniquely mapped transcripts covering 20,940 nonoverlapping loci. Using the SQANTI2 transcript characterization tool, 83% of the Iso-Seq transcripts fell into coding regions of the Brahman annotation (Fig. 7a). As many as 68% of the transcripts could be considered as novel, because they were categorized as "novel in catalog" (NIC), "novel not in catalog" (NNC), antisense, intergenic or genic. The transcript length distribution ranged from 85 to 11,872 bp, with a median of 3853 bp and a mode of ~4 kb (Fig. 7b). Our haplotype-resolved genomes allowed us to explore genes with allelic imbalance in expression. (Fig. 7d). All tissues showed evidence of imbalance in allelic expression (Shapiro test, P-value < 0.01), which was most pronounced for liver, lung, muscle and placenta, whereas brain, heart and kidney were less affected. However, as mammalian brain consists of a wide range of cell types and hence transcriptional complexity, brain tissue was chosen to demonstrate the phasing of transcripts to explore allele-specific expression. The most highly expressed Angus gene with allelic imbalance (ratio of 8 Angus : 1 Brahman) in the brain was ARIH2 (also known as TRIAD1), which is known to play a role in protein degradation via Cullin-RING E3 ubiquitin ligases 15 (Fig. 7e,f). ARIH2 expression in the liver, lung, muscle and placenta was also higher from the Angus allele than the Brahman or maternal allele. The HQ transcripts included 23 different transcript isoforms of ARIH2, however, 66% of transcripts for this gene across the seven tissues were represented by only three isoforms. The annotated exons of this gene were in good agreement with the RNA-Seq data ( Supplementary Fig. 8).
The most highly expressed Brahman gene with allelic imbalance (ratio of 1 Angus : 6 Brahman) in the brain was Calmodulin (CaM), a heat-stable Ca 2+ -binding protein that mediates the control of numerous physiological processes, including metabolic homeostasis, phospholipid turnover, ion transport, osmotic control, and apoptosis 16 ( Supplementary Fig. 9). Surprisingly, we also found allelic imbalance (ratio of 1 Angus : 16.5 Brahman) in pregnancy-associated glycoprotein 1 (PAG1) with a higher expression of the Brahman allele in the brain and placenta but undetectable in other tissues. This gene was previously thought to be placenta-specific and is used as a biomarker for embryo survival 17 .

Discussion
Traditional genome assembly approaches collapse haplotypes and therefore do not allow accurate assembly or the study of divergent, heterozygous regions. Here we demonstrate a new assembly approach that yielded highly contiguous, haplotype-resolved Brahman and Angus cattle genomes from an F1 hybrid of the two subspecies. Our analyses demonstrated that previous studies, which mapped indicine sequences onto the taurine reference UMD3.1.1 4,11 , likely inflated the number of genetic variants that are present between the two subspecies by up to 4%. Calling SNPs in transcripts from a diploid hybrid with both haplo-genomes decoded provides accurately phased transcripts for studies on the role of allele-specific expression in, e.g., hybrid vigor or heterosis. The phasing of Iso-Seq transcripts in reciprocal crosses will facilitate the exploration of breed-specific effects on parental imprinting, which has been shown in maize 18 .
We found that the choice of reference assembly had a large impact on SV calling. After converting SVs from each assembly onto the Hereford assembly coordinates and calculating the intersection, we identified 1.3 Mbp SVs present in the Angus and Brahman assemblies were not present in the Hereford assembly. This suggests that either the Hereford assembly was not as representative of the true structural variation in these regions or that there were assembly errors in the Angus and Brahman assembly that generated false positive SVs. The latter is less likely given the high accuracy of the Angus and Brahman genomes. Conversely, we identified 0.9 Mbp SVs shared between only the Hereford and Angus assembly, which may represent true genomic structural differences between taurine and indicine cattle.
The discovery of an indicus-specific, additional copy of fatty-acid desaturase 2 gene (FADS2P1), that has been under positive selection, further highlights the benefits of highquality haplotype-specific assemblies. The FADS2P1 gene region in both Brahman and Angus span ~1.4 Mb of sequence, while the two FADS2P1 genes in the water buffalo span ~1 Mb.
The orthologous region in goat is ~1 Mb, but contains gaps. Taking phylogenetic and information on conservation of synteny together, the most parsimonious explanation is that the extra FADS2P1 was duplicated in the indicine lineage after divergence from taurine cattle. Rapid evolution at the FADS2P1 locus resulted in neofunctionalization of the additional gene in indicine animals, with profound changes seen in the small exon 7.
FADS2 is a pleiotropic gene with known functions in the biosynthesis of unsaturated fatty acids, lipid homeostasis, inflammatory response, and promotion of myocyte growth and cell signaling [19][20][21] . A non-synonymous SNP in exon 7 of Japanese Black cattle is significantly associated with linoleic acid 22 composition. While we do not know the functional significance of positively selected residues in the additional FADS2P1 copy in Brahman, the SNP reported in the Japanese Black shows the importance of exon 7 in FADS2 function.

Studies in rats have shown linoleic acid is an important component of skin ceramides and its
deficiency increases water permeability of the skin 23 . Comparisons between indicine and taurine animals have shown differences in fatty acids 24 and types of phosphatidylcholines 25 .
We hypothesize Bos indicus has three copies of FADS2P1 genes to regulate the composition of fatty acids that constitute the cell membranes and could alter water permeability and heat loss from skin. The significant differences in phenotype, energy metabolism and adaptation to heat stress of indicine cattle have been linked to the thyroid hormone axis [26][27][28] . One of the selective sweep regions in Brahman contains thyroid hormone receptor β (THRB), a ligand-activated pleiotropic transcription factor that modulates the expression of a large number of genes 29 .
Thyroid hormones are intrinsically connected to the growth hormone -insulin-like growth factor axis (GH-IGF) 30 . Insulin-like growth factor 1 receptor (IGF1R) was found in another selective sweep region. Polymorphisms in IGF1R have been associated with age of puberty in Brahman cattle 31 . In comparison with taurine cattle Brahman tend to reach puberty late, which may have been under positive selection as a consequence of adaption to harsh tropical environments, ensuring that cows are more mature and robust at the time of first calving.
Quantitative trait loci for reproduction featured prominently in the comparison of Brahman selective sweep regions with known cattle QTL. Amongst the reproduction traits, QTL related to calf size and calving ease were overrepresented. Brahman cows deliver a small calf that is less likely to result in dystocia and still birth 32 , which is one major benefit of the introgression of indicine genetics into more productive taurine breeds. Selective sweep regions thus provide candidate genes for maternal control of birth weight.
Brahman cattle may be better adapted to harsher environments because they have slower protein turnover 33 . Relative to Angus, Brahman have much lower expression of ARIH2 in key metabolic organs, such as the skeletal muscle and no detectable expression in the liver.
ARIH2 promotes ubiquitylation of DCNL1, which is a co-E3 ligase that performs cullin neddylation, a process that regulates one-fifth of ubiquitin-dependent protein turnover 15 .
CNV analysis revealed a decreased number of ubiquitin-conjugating enzyme E2D3 genes in the indicine lineage, which suggests lower protein turnover in indicine animals. As a result, Brahman possibly have lower endogenous energy expenditure and protein turnover, and thus are better able to withstand stressful conditions than taurine cattle.
The CNV analyses of Brahman and Angus genomes revealed that six gene families with immune related functions and putative roles in response to disease challenge and external parasites are expanded in the indicine lineage. Conversely, KRTAP9-2, a gene with significantly altered gene expression following tick infestation 34 , is expanded in the taurine lineage, which has also been reported in previous CNV studies 14,35 . Further studies are needed to elucidate how changes in copy number of KRTAP9-2 affect its expression and its role in tick resistance.
In conclusion, the approach used here is able to create haplotype-resolved genome assemblies that are of higher quality than traditional haplotype-collapsed assemblies.
Availability of these high-quality assemblies has enabled us to better resolve structural variants and identify regions under selection that may be involved in adaption to the environment. Looking forward, it is clear that high-quality haplotype-resolved assemblies together with long-read transcripts information will underpin studies on genome function, regulation and the control of phenotypes. Gb of Sequel data were produced, which gave a total sequence yield of 366 Gb with the mean read length of ~10.4 kb. Assuming a genome size of 2.7 Gb, the raw PacBio data represents ~136x coverage.

Bos taurus hybrid
Illumina sequencing libraries for both parents (i.e. sire and dam) and F1 fetus were prepared using TruSeq PCR-free preparation kits (Illumina, San Diego, CA). A total of ~55x, ~60x and ~84x coverage of 150 bp paired-end reads were generated for the sire, dam and F1 fetus, respectively. In order to assemble phased haplotigs for the F1 Brahman-Angus hybrid, we used the Trio-binning method introduced by Koren et al 6 . Briefly, 21-mers in both sire and dam Illumina reads were identified and 21-mers unique to one or other parent were used to assign the F1 PacBio long reads to the parent of origin. Approximately 1% of the PacBio reads were excluded from the assembly as they lacked parent-of-origin-specific 21-mers, due to their shorter lengths. Long reads that were binned into paternal and maternal groups were assembled separately with TrioCanu v1.6.

Hi-C library preparation and sequencing
A Sau3AI Hi-C library was prepared (Phase Genomics, Seattle WA) as follows: approximately 200 mg of fetal lung tissue was finely chopped and then cross-linked in Proximo crosslinking solution. The 5' overhangs after Sau3AI digestion were filled with biotinylated nucleotides, and free blunt ends ligated. After ligation, crosslinks were reversed and the free DNA was column purified and sonicated to approximately 600 bp peak fragment size (Bioruptor, Diagenode). Hi-C junctions were bound to streptavidin beads and washed to remove unbound DNA. Washed beads were used to prepare sequencing libraries using the HyperPrep kit (Kapa) following manufacturer's protocols. In total, 203 million 2x 81 bp read pairs were sequenced on NextSeq Illumina platform.

Scaffolding of contigs with Hi-C
All Hi-C reads were mapped to each breed-specific set of haplotigs using BWA 37 . A haplotype score for a pair was defined as the sum of the percent identity multiplied by match length for each read end (unmapped read ends were assigned a score of 0). Each read pair had two scores, one per haplotype. Pairs with a higher score for one haplotype were considered breed-specific and assigned to their respective haplotype. Pairs with a tied score were considered homozygous and assigned to both haplotypes for scaffolding.
Three different Hi-C based scaffolding programs, 3D-DNA 38 , Proximo (Phase Genomics) and SALSA2 39 were evaluated for scaffolding contigs. Further detail on the comparison between the scaffolders is given in Supplementary Note 1. Reads were mapped with the Arima mapping pipeline (https://github.com/ArimaGenomics/mapping_pipeline commit 72c81901c671203a86ca4675457004a71d0cd249) and converted to bed format prior to SALSA2 scaffolding (https://github.com/machinegun/SALSA git commit 863203dd094aaf9b342c35feedde7dabeec37b44), which was run with parameters '-c 10000 -e GATC -m yes' for both breed-specific haplotigs.

Bionano DNA isolation and assembly
DNA was extracted from 10 mg kidney tissue from the F1 hybrid using the Bionano Animal Tissue DNA Isolation Kit (P/N 80002) with slight modifications as follows: the frozen tissue was crushed in liquid nitrogen, placed in 2% formaldehyde in Bionano animal tissue homogenization buffer (Document number 30077, Bionano-Prep-Animal-Tissue-DNA-Isolation-Soft-Tissue-Protocol.pdf), and blended with a rotor-stator. The homogenate was passed through a 100 um nylon filter, fixed on ice for 30 minutes in 2 mL 100% ethanol, and centrifuged for 5 minutes at 2000 g. The resulting pellet was re-suspended in homogenization buffer and added to pre-warmed agarose to make 0.8% agarose plugs. High molecular weight DNA was extracted from the agarose plugs, labelled, stained, and imaged on a Bionano Saphyr system 40 . Further detail on de novo optical map assembly is given in Supplementary Note 2. Iso-Seq data were generated from brain, heart muscle, kidney, liver, lung, skeletal muscle and placenta (cotyledon) tissue. Iso-Seq SMRT bell libraries were created according to the PacBio protocols. Briefly, two size selected cDNA pools were created, one with an average cDNA size ~ 3 kb and the second with a cDNA size of ~7 kb. The two pools were then combined for SMRTbell™ Template Preparation. The final average library size was ~5 kb as measured by a bioanalyser. Each SMRTbell library was loaded onto the Sequel at approximately 50pM.

Identification and phasing of full-length transcripts
The Iso-Seq data was processed using the isoseq3.1.0 software on the PacBio Bioconda  (4) polishing each isoform to create high-quality, full-length transcript sequences.
The high-quality transcript sequences were then mapped to the Brahman reference genome using minimap2 (v2.15-r905) and filtered for alignments that had ≥99% coverage and ≥95% identity. Redundant and degraded transcripts were collapsed using the Cupcake tool (https://github.com/Magdoll/cDNA_Cupcake). SQANTI2 41 was used to annotate transcripts for various features such as known isoforms with full-splice match (FSM) or incompletesplice match (ISM), novel isoforms in catalog (NIC) or not in catalog (NNC), and other novel genes that are antisense, overlap with intergenic or genic regions.
In order to phase transcripts using the Iso-Seq data, we ran IsoPhase, which is a part of the Cupcake tool, against the Brahman reference. IsoPhase first piles up the FLNC reads of all the isoforms of a gene and calls substitution SNPs using a one-sided Fisher exact test with Bonferroni correction at a P-value cut-off of 0.01. It then infers haplotypes based on the phasing information provided by the FLNC reads. The output defines the inferred haplotypes for each transcript and estimates the relative abundance of each allele. We ran IsoPhase using the pooled set of all FLNC reads from all tissues, then later demultiplex them to create an abundance matrix that is specific for each haplotype, per isoform FLNC count for each tissue. To compare the abundance of transcripts across tissues, we normalized the counts by dividing the FLNC counts for each haplotype-isoform by the total number of FLNC counts in that tissue, multiplied by a million to obtain the transcript per million (TPM) number.
IsoPhase was able to identify at least one SNP in 6273 of the genes. We then validated the IsoPhase SNPs using (1) SNPs called from RNA-Seq data of brain, liver, lung, muscle, placenta, and (2) Angus SNPs derived from mapping Illumina WGS short reads of the F1 hybrid to the Brahman reference. For RNA-Seq, read mapping was performed with Hisat2 42 whereas the genomic short reads were mapped using BWA v0.7.15 37 . SNPs were called using GATK v4 43 . As the RNA-Seq had greater coverage than Iso-Seq and the SNPs called from genomic DNA included non-transcribed regions, only SNPs that were in positions covered by at least 40 full-length Iso-Seq reads were retained. The proportion of an allele from each breed, to assess allelic imbalance, was calculated as the normalized count of the Brahman allele divided by the sum of normalized count of both Brahman and Angus alleles.

Scaffold validation with recombination map
Scaffold contiguity was assessed using a previously published recombination map 44 . Briefly, the recombination map probe sequences were aligned using BWA MEM to the scaffolds and the coordinates were arranged in a directed acyclic graph, using a custom script. A contiguity break between consecutive recombination map-ordered probes in the scaffolds was considered an error, however, we tolerated one mismatched probe in a window of three consecutive probes (Hamming distance = 1) to avoid false positive detection due to mapping ambiguity. Despite having Hi-C sequences, some scaffolds that belonged to chromosomes could not be joined together, which necessitated the use of recombination map markers to join and orientate these scaffolds.

Gap filling and polishing
After checking scaffolds with recombination maps 44 , the Angus and Brahman scaffolds that contained 343 and 369 gaps, respectively, were gap filled with PBJelly 45 v15.8.24 using haplotype-specific PacBio subreads. The default parameters of PBJelly were used, except for the support module, where the options "captureOnly and spanOnly" were used. This step closed 52 and 61 gaps in Angus and Brahman scaffolds, respectively. Two rounds of ArrowGrid (see URLs) was run to polish the scaffolds to give quality scores.

Assembly evaluation and genome annotation
The assemblies were evaluated with BUSCO v2.0.1 46 and other metrics that include compression/expansion (CE) errors. Annotations were created using the Ensembl gene annotation system 47 and the NCBI pipeline. Further detail on the annotation process is given in Supplementary Notes 3-4 and for assembly evaluation, detail is given in Supplementary Note 5.

Repeat analysis
RepeatMasker version open-4.0.7 (see URLs) was used to search for repeats in the UOA_Angus_1 and UOA_Brahman_1 assemblies by identifying matches to RepBase (version RepBase23.10.embl) 48 . Repeats in the current water buffalo assembly (UOA_WB_1) and cattle assembly (UMD3.1.1) were downloaded from the NCBI. Repeats with matches less than or equal 60% identity were filtered out. Centromeric repeats were identified by searching repeats that belonged to the family 'Satellite/centr' in Repbase. The vertebrate telomeric repeat, 6-mer TTAGGG, was identified by RepeatMasker. The search for at least 2 consecutive identical TTAGGG repeats within 1000 kb of chromosome ends was done to detect presence of telomeres.

Gap comparisons and sequence contiguity
To evaluate gaps and sequence contiguity, the Angus and Brahman assemblies were compared to the water buffalo, human and Hereford cattle assemblies. Only sequences that belong to autosomes and sex chromosomes were retained for analysis, whereas unplaced and mitochondrial sequences were filtered out. The tool seqtk v1.2-r94 (see URLs) was used to count gaps with similar code implementation as those used for the water buffalo genome 9 . Potential adapters in the sequence reads were removed using AdapterRemoval v2.2.1 52 .

Single-nucleotide polymorphism and indel calls
Following trimming, the reads were checked with FASTQC again to ensure that only highquality reads were retained. Reads were then mapped to both the Angus and Brahman assemblies separately using BWA v0.7.15 37 with the option "mem". Samtools v1.8 53 was used to convert the resulting alignment to sorted bam format. Duplicate reads, that may be due to PCR artifacts, were marked with Picard 54 MarkDuplicates. The bam files from each individual animal were merged with GATK v4 43

Structural variant and copy number variant analyses
WGS short read data sets from the same 38 animals used for SNP and indel calls were aligned to the UOA_Angus_1, UOA_Brahman_1 and ARS-UCD1.2 with BWA MEM 37 and further processed with Samtools v1.9 53 . Read-pair and split-read profile structural variants were called with the lumpy-sv v0.2.13 56 pipeline, lumpyexpress, using default parameters for each sample. lumpy-sv VCF files were converted to BEDPE format using the vcfToBedpe script included in the lumpy-sv software package. Copy number estimates for genomic segments were calculated from normalized WGS read depth using JaRMS v0.0.13 as previously described 57 . As JaRMS estimates of genomic copy number are distributed around a value of "1" as the normal diploid copy number count, we multiplied the "levels" estimates from the JaRMS program by two to obtain the adjusted copy number state of genomic regions. JaRMS copy number estimates were used to estimate the population differentiation of taurine and indicine cattle on a per-gene basis using the Vst metric 13,14 . A custom script (CalculateVstDifferences.py) was used to automate the calculation of Vst and generation of data tables for plotting. Genes that had a Vst higher than 0.3, which is equivalent to the top 1% Vst, and a difference in average copy number between groups greater than three were considered have a significant difference in copy number between taurine and indicine populations.
In addition to using short WGS reads from the 38 individuals of seven breeds to find structural variants, the haplotype-resolved Angus and Brahman genomes were aligned with the high-quality ARS-UCD1.2 cattle reference to assess structural variants. The advantage of aligning to ARS-UCD1.2 was to standardize the structural variants specific to each haplotype on a common coordinate system. Contigs obtained by breaking final scaffolds at gap positions from UOA_Angus_1 and UOA_Brahman_1 were aligned using nucmer v4 58 to the ARS-UCD1.2 assembly to identify the larger structural differences (50 bp to 10,000 bp) using Assemblytics 12 . The nucmer alignment parameters were "--maxmatch -t 4 -l 100 -c 500", which was followed by delta-filter with the option "-g". Assemblytics parameters followed the default settings, which were "Unique sequence length required: 10000, Maximum More detail on similar positive selection methodology can be found in our study on mammalian GSTs 66 .

Identification of selective sweep regions
To uncover genetic variants involved in indicine adaptative selection, we designed a strategy to identify selective sweeps using the haplotype-resolved Brahman genome. This method is analogous to the Cross Population Extended Haplotype Homozygosity (XP-EHH) 67

Code availability
Custom scripts can be found at GitHub repository at the following URL:    Figure 1. An overview of assembly methods. Long PacBio reads were binned to the respective haplotypes using parental specific k-mers prior to assembly by TrioCanu.
Unassigned reads were discarded. Each haplotig assembly was scaffolded separately with Hi-C and optical map data. Both Hi-C and optical map scaffolding have chimeric haplotig detection capability and the optical map is better at resolving chimeras. Where possible the break positions of optical map-based cuts were adjusted based on F1 coverage and alignment with the other breed haplotig. Hi-C and optical map-based scaffolds were consolidated and cattle recombination maps were used to validate the assembly. Finally, haplotype specific long reads were used to fill gaps and polish the sequence.      level for ARIH2, which is the most highly expressed Angus gene in the brain f) Tissue-specific allelic expression at the transcript level for ARIH2 in brain, heart, kidney, liver, lung, muscle and placenta. A denotes "Angus" and B denotes "Brahman".