A hepatitis B virus (HBV) sequence variation graph improves sequence alignment and sample-specific consensus sequence construction for genetic analysis of HBV

Hepatitis B virus (HBV) remains a global public health concern, with over 250 million individuals living with chronic HBV infection (CHB) and no curative therapy currently available. Viral diversity is associated with CHB pathogenesis and immunological control of infection. Improved methods to characterize the viral genome at both the population and intra-host level could aid drug development efforts. Conventionally, HBV sequencing data are aligned to a linear reference genome and only sequences capable of aligning to the reference are captured for analysis. Reference selection has additional consequences, including sample-specific ‘consensus’ sequence construction. It remains unclear how to select a reference from available sequences and whether a single reference is sufficient for genetic analyses. Using simulated short-read sequencing data generated from full-length publicly available HBV genome sequences and HBV sequencing data from a longitudinally sampled individual with CHB, we investigate alternative graph-based alignment approaches. We demonstrate that using a phylogenetically representative ‘genome graph’ for alignment, rather than linear reference sequences, avoids issues of reference ambiguity, improves alignment, and facilitates the construction of sample-specific consensus sequences genetically similar to an individual’s infection. Graph-based methods can therefore improve efforts to characterize the genetics of viral pathogens, including HBV, and may have broad implications in host pathogen research.


INTRODUCTION
sequences to identify the most appropriate reference sequence but this is both computationally expensive and can still fail within the context of recombinant or mixed infections.
Nevertheless, the use of unrepresentative reference sequences can interfere with characterizing pathogen diversity, resulting in false or missing mutations and biased phylogenetic relationships. [33][34][35] Reference selection can also affect the ability to accurately derive sample-specific 'consensus' sequences, which provide an approximation of the genome sequence causing an infection. This issue of reference ambiguity is especially problematic for CHB, as a set of phylogenetically representative HBV reference sequences has only recently been proposed. 36 Furthermore, given the extreme diversity of HBV, the use of a single reference sequence, even of the correct HBV genotype, may be insufficient. 32,33 Aligning to a phylogenetically representative 'genome graph' constructed from many different HBV genome sequences rather than a single HBV genome sequence could potentially avoid these reference ambiguity issues. Genome graphs are comprised of 'nodes' which reflect stretches of genetic sequence connected by 'edges' which determine the path a genome sequence traverses across a subset of nodes within the graph. 31,34,37,38 Graph-based structures containing genetic variation from multiple genomes have been shown to improve sequence alignment and variant calling for highly variable regions of the human genome and microbial organisms like Escherichia coli. 31,34,39 A graph-based reference containing a representative sampling of the genetic variation observed across all known HBV genotypes/subgenotypes might also improve sequence alignment and variant calling, as well as enable the generation of accurate sample-specific consensus sequences for HBV-related genetic analyses. Consensus sequences reflect the most commonly observed nucleotide at each site across the . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 12, 2023. genome, inferred from aligning sequencing data against a specific reference sequence.
The construction of consensus sequences is a typical objective of viral-focused genetic analyses, 40,41 including HBV. 42,43 However, whether a graph-based approach can improve viral sequence alignment and sample-specific consensus sequence construction has, to our knowledge, yet to be demonstrated.
In this study, we leverage 2,837 publicly available full-length HBV genomes, simulated high-throughput sequencing data from these HBV sequences, and real-world longitudinally collected sequencing data from an individual with CHB to identify the optimal alignment method as determined by the proportion of successfully aligned HBV sequencing data. Additionally, by comparing alignment derived sample-specific consensus sequences, the accuracy of graph vs. linear reference-based alignment methods will be evaluated.

Source of genetic sequence data
Full-length HBV genome sequences: A set of non-redundant full-length HBV genomes (N=2,837) was obtained from the publicly available resource provided by McNaughton et al. 36 Briefly, 7,108 full-length HBV genomes were obtained from the HBVdb database (https://hbvdb.lyon.inserm.fr/HBVdb/) and recombinant or highly similar full-length HBV genome sequences were removed. A set of 44 sequences representative of all phylogenetically identified genotypes, subgenotypes, and genetically distinct clades was then identified for use as reference sequences in downstream analyses.
High-throughput CHB sequencing data: HBV-targeted sequencing data from an individual included in a longitudinal cohort study of treatment naïve individuals with CHB was obtained via the NCBI Sequence Read Archive (BioProject ID: 479693). 44 Sample-reference graph was created using the pangenome graph builder (PGGB) pipeline (https://github.com/pangenome/pggb), which performs pairwise whole-genome alignment using wfmash and graph induction using the seqwish software. 48,49 PGGB can then sort and order the graph via partial order alignment using smoothxg (https://github.com/pangenome/smoothxg).
The variation graph toolkit (VG, v1.39) was used to perform all graph-related format conversions, indexing, sequence-to-graph alignment, and collating of mapping/alignment statistics, as described in the VG documentation. 31,50 Fast short-read alignment via the VG giraffe mapper was accomplished by creating a haplotype-aware graph index where each reference genome was indexed as a unique haplotype. 51 Highly accurate but more computationally intensive graph-based mapping was performed using the VG map mapper.
Establishing internal validity of the HBV reference graph: All reads simulated from the graph-embedded HBV genomes were concatenated and then randomly subsampled to 20,000X coverage seven times. Coverage-based subsampling was performed using rasusa. 52 These subsampled HBV sequencing datasets were aligned to the HBV reference graph using the haplotype-aware VG giraffe mapper. For the graph to be internally valid, we required >99% of the reads simulated from HBV genomes embedded within the graph to successfully align.
To assess whether each path within the graph was utilized during sequence alignment, and to test whether aligned sequences had the highest alignment scores to graphembedded HBV genomes which were more genetically similar to the aligned sequences, each full-length HBV sequence (N=2,837) was aligned to the graph using VG map. A 'correct' alignment was observed if the reference path with the highest alignment score was of the same HBV genotype as the query sequence. Path-specific alignment scores . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 12, 2023. ; https://doi.org/10.1101/2023.01.11.523611 doi: bioRxiv preprint were also derived for alignments made using the set of simulated high throughput sequencing data from HBV genomes not used in graph construction (N=59). Briefly, by identifying the graph nodes for each path with successful alignments, the genome path with the most alignments was able to be identified ( Figure S1). Path-specific alignment scores were derived using the sum of weights estimated for each node involved in a successful alignment. Weights reflect the path depth of each node (i.e. the number of genome sequences containing/traversing through the node), with nodes traversed by a single HBV genome weighted heavily and nodes traversed by all genomes weighted least ( Figure S2). A 'correct' alignment was observed if the path with the highest weighted alignment score was of the same HBV genotype as the genome sequence used to simulate the HBV sequencing data.

Alignment of HBV sequencing datasets -graph vs. linear references
Simulated HBV sequencing data: To determine whether a graph-based reference improves sequence alignment compared to linear reference-based approaches for HBV sequencing datasets, we aligned the combined simulated high-throughput sequencing data (generated from 59 HBV genomes not included within the graph) to the graph using VG giraffe and to each linear reference sequence (N=44) using BWA-MEM. The proportion of successfully aligned reads was obtained using 'VG stats' and 'SAMtools flagstat', 53 respectively. While comparisons of the computational time and resources required for variation graph and linear-reference-based aligners have been performed previously, 50 '/usr/bin/time' estimates for the alignments using BWA-MEM, VG giraffe, VG giraffe in fast-mode, and VG map can be found in Table S1. To approximate a more realistic scenario, in which the observed genetic diversity spans a subset of HBV genotypes known to circulate within a geographic region rather than all currently known HBV genotypes/subgenotypes, linear reference and graph-based alignment . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 12, 2023. ; https://doi.org/10.1101/2023.01.11.523611 doi: bioRxiv preprint comparisons were performed using simulated sequencing data from randomly selected HBV genotype B (N=6) and C (N=12) sequences, the primary genotypes endemic in East and Southeast Asia. 54,55 We estimated the depth of coverage across the HBV genome for the alignment of all simulated high-throughput sequencing data to each linear reference using 'SAMtools depth'. Genotype-specific depth estimates were obtained by estimating the mean alignment depth across alignments made using references of the same genotype via a sliding-window approach (50bp wide) in R. Local minima in depth were estimated using the ggmisc package in R. To approximate site-specific depth of coverage across the HBV genome from the graph-based alignment, the start site of each successfully aligned read was used to infer coverage by estimating a rolling sum of the median number of reads within a sliding window the length of the simulated reads (125bp).
To facilitate the comparison of whether regions of poor coverage corresponded to loci of increased pairwise diversity, the local nucleotide sequence diversity across the set of reference sequences (N=44) was estimated using a sliding window approach (150bp wide) in R using the pegas package. 56 Alignment of real CHB sequencing data: To determine the approximate sequencing depth for each CHB sample (N=5), raw sequencing data were aligned to each linear reference sequence (N=44) using BWA-MEM. 57 Alignment quality was assessed using Qualimap (v2.2.1). 58 The proportion of successfully aligned reads were estimated using 'SAMtools flagstat'. For each sample, the linear reference with the highest proportion of successfully aligned reads was an HBV subgenotype B2 sequence (GenBank ID: GU815637). For alignments to this reference, mean depth of coverage ranged from 82,930X-334,157X. To reduce computational time and resources required for our analyses, QC-passed reads were down-sampled to obtain an average coverage of . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 12, 2023. ; https://doi.org/10.1101/2023.01.11.523611 doi: bioRxiv preprint 20,000X. To test whether subsampling altered the proportion of successfully aligned reads, subsampled reads were also aligned to each linear reference sequence and the proportion of successful alignments was compared to the alignments involving all QCpassed sequencing data using a binomial generalized linear mixed model (GLMM) with random intercepts in R. The GLMM treated each alignment as a binomial outcome (successful alignment vs. not successful alignment), with the total number of referencespecific alignments used as weights. Whether alignments of these subsampled reads to an extended linear reference, obtained by concatenating the first 120bp of each reference to the end of each sequence, altered alignment statistics were also assessed using the same GLMM performed in R. To identify whether reads which failed to align to sub-optimal linear references (non-HBV subgenotype B2) were uniformly distributed across the genome, unaligned reads from each linear reference-based alignment were re-aligned to the best performing linear reference. The genome-wide distribution of these 'rescued' reads was visually assessed in R.
Graph-based alignment of subsampled CHB sequencing data was performed using the VG map mapper. For samples with higher alignment proportions to the graph than any linear reference, unmapped reads from the best performing linear reference for each sample were re-mapped to the graph and the genome-wide distribution of the reads 'rescued' via graph alignment was visualized using R. To identify and visualize the loci where HBV sequence was rescued via graph alignment, the rescued reads were queried via BLAST against a compacted de Bruijn Graph comprised of the reference sequences and de novo (reference-free) assembled HBV haplotypes from each sample created using Bifrost and visualized with Bandage. 59, 60 We also performed BLAST in Bandage using these successfully re-mapped reads against the HBV reference graph only to confirm that rescued reads mapped to regions of increased graph complexity.
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 12, 2023. ; https://doi.org/10.1101/2023.01.11.523611 doi: bioRxiv preprint

Derived consensus sequences -graph vs. linear reference sequences
Simulated HBV sequencing data: Consensus sequences were obtained from linear reference-based alignments of simulated non-graph derived sequencing data using iVar. 61 iVar was developed to analyze amplicon-based viral sequencing data and leverages SAMtools to call variants and derive a consensus from the most common nucleotide across each position in an alignment file. We used a minimum base-level quality score of 20 and depth threshold of 10 while accounting for ambiguous nucleotides. For graph-based alignments, we used a wholly graph-based variant calling approach leveraging the alignments across all paths using VG (https://github.com/vgteam/vg), followed by consensus generation via bcftools. 62 Longitudinal CHB sample consensus sequences: Prior to performing alignment and variant calling for the real CHB samples, QC-passed paired-end reads were merged using PEAR and filtered to retain the highest quality reads >150bp long for analysis via bbmap. 63,64 Reads were aligned to each linear reference or the HBV reference graph, followed by iVar-based consensus sequence identification. We also performed variant calling using the LoFreq software, a variant calling tool able to identify even lowfrequency variants from high-coverage data across diverse genetic sequencing datasets, 65 for each linear reference-based alignment followed by consensus generation using a majority allele rule for each site (i.e. alleles with frequency >50% were integrated into the consensus sequence) via bcftools. For LoFreq-derived consensuses, we used the same depth threshold (10) used in iVar and estimated insertion/deletion qualities which were used in addition to LoFreq's method of combining base-level, mapping, and alignment quality information to determine variant quality and identify the majority nucleotide at each position, accounting for insertions/deletions. Graph-based alignment was performed using VG giraffe. VG-based variant calling using the graph alignments . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 12, 2023. ; https://doi.org/10.1101/2023.01.11.523611 doi: bioRxiv preprint and consensus generation were obtained via bcftools. For these CHB samples, we also derived consensus sequences after re-aligning the successful graph-aligned reads to a single path within the graph via VG, termed surjection, followed by iVar consensus construction. Graph-aligned reads were surjected into the path corresponding to the best-performing linear reference.

Consensus sequence comparisons
Consensus sequence comparisons from simulated HBV sequencing data: Comparisons between each simulation-based consensus sequence and the full set of HBV genomes from which reads were simulated were performed using Mash, 66 which estimates a genetic distance metric, the Mash distance, based on the estimated mutation rate between two sets of sequences and the jaccard index (the fraction of k-mers shared between the comparison sequences). The Mash distance also approximates average nucleotide identity (ANI) estimates, with ANI equivalent to one minus the distance estimate, while also having the benefit of facilitating comparisons between sequences/sequencing datasets of variable lengths/sizes. 66 Given the short length of the HBV genome (3.2kb), a k-mer sequence length of 7 was used for Mash distance estimations. 67,68 The consensus sequence with the lowest estimated genetic distance with the set of full-length HBV genome sequences can be inferred to be the most accurate or genetically representative consensus sequence.
Identifying accurate consensus sequences from real CHB sequencing data: To facilitate comparisons between CHB-derived consensus sequences and to identify the most genetically similar consensus to the HBV quasispecies of each sample, we estimated the Mash distance between each consensus and the subsampled HBV sequencing data which aligned to the best performing reference (linear or graph-based) for each sample.
We also performed de novo HBV strain-level assembly using SAVAGE and VG-Flow to . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 12, 2023. ; https://doi.org/10.1101/2023.01.11.523611 doi: bioRxiv preprint identify the viral haplotypes comprising each CHB infection. 69,70 For each sample, the best-performing linear reference was added to the SAVAGE output for VG-Flow to improve strain-level contiguity and assembly. The set of sample-specific viral haplotypes with frequencies >1% were included in all pairwise genetic distance comparisons. The consensus sequence with the lowest estimated genetic distance with the HBV-specific high throughput sequencing data can be inferred to be the most accurate and genetically representative consensus sequence for each sample.

Simulations to assess HBV sequence-to-graph alignment and coverage
Short Illumina-like reads were simulated from 59 genetically diverse HBV genomes encompassing 9 HBV genotypes and aligned to a non-overlapping set of 44 phylogenetically representative HBV genome sequences reflecting the known breadth of HBV diversity. Despite all reads being of HBV origin (N=500,002 reads), only 84.3% to 96.6% of sequences successfully aligned to these 44 linear references (Figure 1). In contrast, >99.9% of this diverse simulated HBV sequencing data successfully aligned to an HBV reference graph constructed using the same set of 44 phylogenetically representative HBV genome sequences. To ensure that loci from across all HBV genomes used to create the graph are adequately represented by the reference graph, seven randomly subsampled sets of simulated high-throughput HBV sequencing data (N=507,938 reads) generated from these 44 genomes were aligned to the HBV reference graph, with 100% of reads always aligning to the graph.
To reflect a more realistic scenario utilizing simulated HBV sequencing data, we limited the simulated data used in our alignment comparisons to those generated from HBV . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 12, 2023. ; https://doi.org/10.1101/2023.01.11.523611 doi: bioRxiv preprint sequences of genotypes B or C. While >99.9% of simulated reads from genotypes B and C successfully aligned to the reference graph, a high proportion of these reads also successfully aligned to linear reference sequences of genotype B (97.9%-98.4%) and C (97.6%-98.8%) (Figure S3).
The simulated HBV sequences that failed to align to the linear references (3.4%-15.7%) were not uniformly distributed across the genome, with loci observed to have precipitous drops in coverage corresponding to loci of increased genetic diversity (Figure 1). Within these loci, drops in coverage were highly heterogeneous across reference sequences of different HBV genotypes, with the lowest proportion of successfully aligned HBV sequencing data and the most precipitous drops in coverage in regions of increased nucleotide diversity occurring for HBV genotypes H and G reference sequences. As >99.9% of sequencing data successfully aligned to the HBV reference graph, no major coverage differences were observed.
To determine whether graph-aligned HBV sequences aligned best to the path/reference sequence embedded within the graph of the same HBV genotype as the query sequence, all full-length HBV genome sequences (N=2,837) and each set of simulated short-read HBV sequencing data generated from the HBV genomes used in the simulations (N=59) were aligned to the HBV reference graph. For each alignment, query sequences always resulted in paths of the same HBV genotype having the highest alignment score (Table 1), demonstrating the importance of representing each phylogenetically distinct HBV genotype within the reference graph. These results also demonstrate that the path-specificity of sequence-to-graph alignment can enable HBV genotype prediction using either the alignment score directly or a metric based on the path-depth of nodes with successful alignments for genome-length and high-throughput HBV sequences, respectively. . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 12, 2023. ; https://doi.org/10.1101/2023.01.11.523611 doi: bioRxiv preprint

Alignment of real CHB sequencing data to an HBV reference graph
Unlike our simulated HBV sequencing datasets, real CHB-derived HBV sequencing data can reflect extensive genetic variation due to both host and pathogen-derived evolutionary pressures in addition to any sample processing or sequencing-related errors. Additionally, real CHB sequencing data can have highly variable quality and coverage distributions across the genome. For the analyses of real longitudinally collected CHB samples, graph-based sequence alignment consistently achieved higher proportions of successfully aligned HBV sequence data compared to any single linear reference (N=44), with 98.6%, 98.7%, 93.5%, 99.4%, and 98.7% of successfully aligned sequence for samples SRR747499, SRR747500, SRR747501, SRR747502, SRR747513 (Figure 2), respectively ( Figure S4). The choice of linear reference had a significant effect on the proportion of aligned sequences across the five samples, with a per-sample difference between the best and worst performing linear reference ranging from 32.8% to 38.1%. The best performing linear reference (HBV subgenotype B2, GenBank ID: GU815637) resulted in 98.5%, 98.6%, 92.4%, 99.3%, and 98.6% of aligned sequences for each sample (SRR747499, SRR747500, SRR747501, SRR747502, SRR747513, respectively), which were all lower than the proportion of successful graph-based alignments. Notably, differences were also observed between references of the 'correct' HBV genotype (B), with a per-sample difference between the best and worst-performing reference ranging between 7.2% (85.3% vs. 92.4% for SRR7471501) and 7.8% (90.8% vs. 98.6% for SRR7471513). Thus, compared to graphbased alignment across these samples, up to 8.8% of HBV sequencing data can be missed due to the use of a linear reference sequence of the correct HBV genotype compared to the HBV reference graph.
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 12, 2023. ; https://doi.org/10.1101/2023.01.11.523611 doi: bioRxiv preprint HBV-derived sequences that failed to align to the non-subgenotype B2 reference were also not uniformly distributed across the genome (Figure S5). The distribution of where rescued reads aligned to the B2 reference is informative as more reads from non-HBV genotype B references were rescued across the HBV genome except in the precore/core region. At this locus, the average distribution of rescued reads from genotype C was always the lowest (Figure S5), which is unsurprising as the pre-core/core region within HBV subgenotype B2 reflects a known recombination event between genotypes B and C. 55 For reads which still failed to align to the best performing linear B2 reference sequence across each sample, 30.5%-63.2% were rescued via graph-based alignment (N=130,154, 71,526, 83,348, 61,692, 52,071 reads rescued, respectively). The distribution in the start sites of all rescued reads was similarly non-uniformly distributed.
Interestingly, the loci in which graph-based alignment rescued the most reads also corresponded to loci with increased pairwise nucleotide diversity estimated across the 44 phylogenetically representative proposed HBV reference sequences (Figure S6), suggesting regions of increased genetic diversity globally may correspond to loci of increased intra-host sequence variation in real CHB samples.
There was no significant difference in the observed proportion of successfully aligned reads when using all or a subset of the QC-passed HBV sequencing data across the real CHB samples (P>0.99). Additionally, there was no significant difference in the proportion of aligned reads when the full-length linear reference sequences or extended linear reference sequences were used across the linear reference-based alignments (P=0.76) ( Figure S7).
Graph-derived consensus sequences are more genetically similar to HBV sequencing datasets . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 12, 2023. ; https://doi.org/10.1101/2023.01.11.523611 doi: bioRxiv preprint While no single full genome-length HBV sequence could realistically capture the sequence variation observed across our simulated high throughput HBV sequencing data, the graph-based variant calling performed using the variation graph toolkit (VG) provided a consensus sequence with the lowest genetic distance to the full set of HBV genomes used in the simulations (Figure S8) mash distances ranging between 7.50x10 -2 and 7.82x10 -2 . Using the Mash distance as an approximation of average nucleotide identity (ANI), consensus sequences had ANIs ranging between 92.2%-92.5%, with the consensus inferred from VG-based variant calling having the highest ANI (92.5%).
For consensus sequences derived using the subset of HBV sequencing data generated from HBV genotypes B/C only, VG-based variant calling also resulted in the sequence with the lowest Mash distance and highest ANI compared to the full HBV genotype B/C sequences. However, all genotype B and C specific consensus sequences had similar Mash distances, ranging between 6.01x10 -2 and 6.17x10 -2 , and ANIs ranged between 93.8% and 94.0% (Figure S9).
For analyses of the real CHB sequencing data, the de novo assembled viral haplotypes always had the lowest Mash distance compared to the HBV sequencing data for each sample. This suggests viral haplotypes comprising an individual's CHB quasispecies better approximate the overall sequence diversity of an infection than any derived consensus sequence.
For consensus sequence comparisons, a graph-based variant calling approach resulted in consensus sequences with the lowest average Mash distance and highest ANI compared to each sample-specific set of HBV sequencing data across the longitudinal CHB samples. Our graph-based consensus sequence construction method provided improvements (i.e., a reduction in genetic distance) over attempts involving linear reference sequences when variants were identified via LoFreq and every sample other . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 12, 2023. ; https://doi.org/10.1101/2023.01.11.523611 doi: bioRxiv preprint than SRR7471499 when consensus sequences were generated using iVar (Figure 3,   Figure S10). For this sample, graph-based variant calling and consensus sequence generation resulted in the same genetic similarity estimate (ANI=89.3%) as a consensus derived using a subgenotype C1 reference sequence (GenBank ID: DQ089781). While both iVar and LoFreq can be used to identify variants across a diverse set of viral pathogens, 71,72 LoFreq has repeatedly been used to identify variants from real CHBderived HBV sequencing data. 73,74 Additionally, while both consensus identification methods used the same site-specific depth threshold, our LoFreq-based approach accounted for insertions and deletions, potentially explaining its consistently lower Mash distance compared to the consensus sequences obtained via iVar (Figure 3, Figure   S10).

Discussion
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 12, 2023. ; In this study, we confirm that the choice of reference plays a critical role in the alignment of high throughput HBV sequencing data and can influence the construction of samplespecific consensus sequences in genetic studies of CHB. We also demonstrate that sequence variation graphs can improve upon widely accepted methodologies used for sequence alignment of highly diverse pathogens such as HBV. Using both real CHB and simulated high diversity HBV sequencing datasets, we show that alignment to a phylogenetically representative reference graph results in a higher proportion of successful sequence alignment and facilitates the generation of accurate samplespecific consensus sequences.
As the benefits of sequence-to-graph alignment are greatest for highly diverse sequencing datasets, the utility of graph-based sequence alignment is dependent upon the research question of interest. For example, sequence-to-graph alignment recovers only marginally more simulated sequencing data generated from a subset of HBV genotype B and C sequences ( Figure S3) compared to linear reference-based approaches using genotype B or C reference sequences. Furthermore, linear referencebased sequence alignment is highly successful at capturing HBV sequences from regions across the HBV genome with non-extreme global sequence diversity (Figure 1).
While our results demonstrate that for regions of increased diversity any single linear reference is likely insufficient to capture the genetic variation observed across all HBV genotypes/subgenotypes or mixed CHB infections, many CHB infections are comprised of a single HBV genotype, and thus linear reference-based alignment using a correct genotype/subgenotype sequence would not be expected to omit important information. This is, however, not guaranteed as we find >8% of viral sequence can fail aligning to a phylogenetically representative linear reference sequence. For more genetically diverse infections, a hybrid approach in which a linear reference-based alignment is followed by . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 12, 2023. graph alignment of unmapped reads could also solve issues related to reference ambiguity while limiting the computational burden associated with graph-based sequence alignment. Notably, we find reads rescued via graph alignment largely originate from regions across the HBV genome of increased global sequence diversity ( Figure S6), suggesting these loci could also correspond to regions of increased intrahost genetic variation.
The generation of consensus sequences is an important product of microbial-focused genomic analyses, and novel software and workflows devoted to generating pathogen consensus sequences, including for SARS-CoV-2, 40 continue to be developed. In addition to providing an accurate characterization of the genome comprising a clinical infection, publicly available consensus sequences enable molecular epidemiologyfocused research, allow for large-scale phylogenetic analysis, and can aid disease surveillance efforts. [75][76][77][78] Consensus sequences can also serve as the 'reference' sequence in subsequent bioinformatic analyses, reducing the number of spuriously identified variants for HBV. 32 Thus, care should be taken to ensure the most accurate and genetically representative sequences are obtained from clinical CHB or other HBVinfected samples. We show that graph-based alignment and variant calling can often improve upon linear reference-based approaches to derive sample-specific consensus sequences, even when such efforts utilize reference sequences of the correct HBV subgenotype, which produces consensus sequences less genetically similar to an individual's CHB quasispecies than sequences derived via graph alignment. However, given the minute differences in average nucleotide identity between consensus sequences obtained from any of the best performing linear references and our graphbased approach, the deleterious consequences of linear reference-based alignment are . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 12, 2023. ; https://doi.org/10.1101/2023.01.11.523611 doi: bioRxiv preprint likely minimal when effort is made to first identify a genetically representative reference sequence for use in alignment.
However, when reference selection is not carefully considered, unrepresentative reference sequences can impact the fidelity of consensus sequences and other downstream phylogenetic-focused analyses. 33,79 It is therefore possible that some publicly available full-length HBV genome sequences are not the most accurate HBVrelated sample-specific consensus sequences. While alternative approaches to infer sample-specific consensus sequences exist, 5,32 our approach using the Mash distance to compare and identify the consensus sequence that best approximates the set of HBV sequencing data or de novo assembled haplotypes could provide a mechanism by which sample-specific consensus sequences are compared and selected for use as ideal reference sequences.
While sequence-to-graph alignment requires more computational resources than linear alignment-based approaches, especially for the VG map mapper (Table S1), if the goal is to capture and retain as much HBV-related sequencing data as possible for analysis, we show that graph-based methods outperform traditional linear reference-based alignment for HBV. We should note that effective tools enabling sequence-to-graph alignment and the subsequent identification of graph-derived genetic variation are a relatively recent development. Improvements in computational performance have already been demonstrated through graph-simplification and the development of more advanced mapping and variant identification models, [80][81][82] with further improvements expected. 83 While alternatives to graph-based alignment which leverage multiple reference sequences have also been developed, such as alignment using multiple linear references in tandem, 84,85 their performance has not been assessed using HBV . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 12, 2023. ; https://doi.org/10.1101/2023.01.11.523611 doi: bioRxiv preprint sequencing data. Furthermore, an added benefit of graph-based approaches is that differences observed between the embedded paths/reference sequences can be utilized during variant calling to identify loci of genetic variation, in addition to mutations inferred from the alignment of sequencing data directly. The ability to leverage graph topology was demonstrated in our use of path depth to infer the genotype of HBV sequences aligned to the reference graph. Alternative graph-based prediction methods, including models for HBV subgenotype prediction or recombination detection, are worth further exploration. Whether metrics linked to graph topology or complexity, including path depth, can be used to better characterize the viral genetics of CHB quasispecies, either within specific regions or across the genome, 86,87 or the genetic diversity of HBV generally, remains unexplored. For example, we observe the distribution of path depth within the phylogenetically representative HBV reference graph approximates a universal gene frequency distribution typical of many bacterial species (Figure S2), 34 despite there being no distinction between a core and accessory genome for HBV.
Future efforts should investigate the utility and potential clinical importance of these graph-derived measures of genetic complexity for HBV and other microbial pathogens of public health importance. For example, graph-to-graph comparisons could enable the analysis of genetic sequence data in ways that Euclidean data structures cannot.
Graph-based approaches are being increasingly used to investigate highly genetically diverse microbial pathogens and regions of the human genome. In this study, we demonstrate the limitations of using linear HBV reference sequences to derive consensus sequences for CHB samples. Furthermore, we hope to mitigate issues of HBV reference ambiguity by making this HBV reference graph publicly available, which will also promote the use of graph-based advances in genetic analyses to improve our understanding of CHB genetics.
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 12, 2023. ; https://doi.org/10.1101/2023.01.11.523611 doi: bioRxiv preprint

Data and code availability
The longitudinal HBV sequencing data utilized in this study is available as an NCBI BioProject under the accession PRJNA479693. Simulated HBV sequencing data and the HBV reference graph have been deposited on Zenodo and can be accessed using the following doi: 10.5281/zenodo.6646207.
Code used to construct and index the HBV reference graph, align sequencing data to the graph, and infer a consensus sequence can be accessed at https://github.com/dduchen/HBV_reference_graph_manuscript. . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 12, 2023. ; Figure 1: Alignment and depth of coverage across the HBV genome for simulated HBV sequencing data. On the left, points reflect the proportion of successfully aligned simulated reads, colored by either the genotype of the reference or whether graph-based alignment was performed. The Y axis reflects the proportion of successfully aligned reads. Labels indicate the reference sequence genotype or graph used in the alignment and the proportion of reads aligned. On the right, the X axis reflects genome position, with gene regions provided as colored bars along the base of the figure. C=core, P=polymerase, S=Surface, X=X. The top right panel reflects the average nucleotide diversity across the genome, with the Y axis reflecting the average pairwise nucleotide diversity (0-1). For the bottom right panel, the Y axis reflects depth of coverage across the HBV genome obtained when using reference sequences of different HBV genotypes, using the same genotype-specific color scheme as the left panel. Significant drops in coverage are indicated with the grey vertical lines.