Metagenomics Strain Resolution on Assembly Graphs

We introduce a novel bioinformatics pipeline, STrain Resolution ON assembly Graphs (STRONG), which identifies strains de novo, when multiple metagenome samples from the same community are available. STRONG performs coassembly, followed by binning into metagenome assembled genomes (MAGs), but uniquely it stores the coassembly graph prior to simplification of variants. This enables the subgraphs for individual single-copy core genes (SCGs) in each MAG to be extracted. It can then thread back reads from the samples to compute per sample coverages for the unitigs in these graphs. These graphs and their unitig coverages are then used in a Bayesian algorithm, BayesPaths, that determines the number of strains present, their sequences or haplotypes on the SCGs and their abundances in each of the samples. Our approach both avoids the ambiguities of read mapping and allows more of the information on co-occurrence of variants in reads to be utilised than if variants were treated independently, whilst at the same time exploiting the correlation of variants across samples that occurs when they are linked in the same strain. We compare STRONG to the current state of the art on synthetic communities and demonstrate that we can recover more strains, more accurately, and with a realistic estimate of uncertainty deriving from the variational Bayesian algorithm employed for the strain resolution. On a real anaerobic digestor time series we obtained strain-resolved SCGs for over 300 MAGs that for abundant community members match those observed from long Nanopore reads.

There is a growing realisation that to fully understand microbial communities it is necessary to 2 resolve them to the level of individual strains [35]. The strain is for many species the 3 fundamental unit of microbiological diversity. This is because two strains of the same species 4 can have very different functional roles. The classic example is E. coli, where one strain can be 5 a dangerous pathogen and another a harmless commensal [24]. The best definition of a strain, 6 and the only one that avoids ambiguity, is a set of clonal descendants of a single cell [15,39], 7 but strain genomes by this definition can only reliably be determined by sequencing cultured 8 isolates or single cells [30]. The former is not representative of the community and the latter is 9 still too expensive and low-throughput for many applications as well as producing only 10 fragmentary genomes. For these reasons, there is a practical need for efficient methods that can 11 profile microbial communities at high genomic resolution. 12 In contrast to 16S rRNA gene sequencing, shotgun metagenomics has the potential to resolve 13 microbial communities to the strain level. This is because it generates reads from throughout 14 the genomes of all the community members. It also has the additional advantages of reduced 15 levels of bias and the capability to reconstruct genomes. There are many methods for 16 reference-based strain resolution from metagenome data [1,35,42], but they are, and will 17 continue to be, limited by the challenge of comprehensively isolating and sequencing the 18 genomes of diverse microbial strains. Comprehensive reference genome databases may be 19 possible for a few slowly evolving species or particularly well studied pathogens but for the 20 entirety of a complex community it is unlikely to ever be tractable. For example, in a recent de 21 novo large-scale binning study of the relatively well-studied human gut microbiome, it was 22 found that 77% of the species recovered did not have a reference genome in public 23 databases [31]. This suggests that even less of the strain-level diversity in those samples would 24 be represented in a genome database. These observations motivate the need for de novo 25 methods of metagenomic strain resolution. 26 In the metagenomics context, we adopt the definition of a 'metagenome strain' as a clonal 27 subpopulation with sufficiently low levels of recombination with other strains, that it can be 28 distinguished genetically from them. This does not require that recombination between strains 29 does not occur, rather that either because of physical separation or selection, it has not been 30 sufficiently strong relative to the rate of mutation [40], to generate a continuum of diversity 31 throughout the genome. This means members of a 'metagenome strain' may differ substantially 32 from each other particularly in rapidly evolving accessory regions and the subpopulation as a 33 whole may descend from multiple cells but with a core genome that has descended from a single 34 cell in the recent past. This is equivalent to the definition of 'lineage' in [29]. For ease, in the 35 discussion below we will refer to strain in the metagenome context when properly we mean this 36 looser definition of a strain as a genetically distinct subpopulation. challenge can be addressed by sequencing more deeply. More difficult to address is the problem 46 of repeats. Just as they do in isolate genome sequencing, intra-genomic repeats such as the 16S 47 rRNA operon will lead to uncertainty in metagenomic assemblies, but if multiple closely related 48 strains from the same species are present then they will possess potentially large regions of 49 shared sequence. If the strain genomes are of comparable divergence to the reciprocal of the 50 read length then very complex graphs will result, for typical short read sequencing (75-150bp) 51 this would be strains at around 98-99.5% sequence identity. The result is that it is not possible 52 to find long paths in the graph that can be unambiguously assembled into long contiguous 53 sequence or contigs. For this reason metagenome assemblies for strain-diverse communities can 54 comprise millions of contigs when made from short read data, with the added drawback that in 55 the metagenomics context, we do not even know which contig derives from which species. For 56 species that contain multiple very similar strains (> 99.9%), then we expect better assemblies 57 but the variants are then too far apart to be linked or phased by Illumina reads. In that case 58 we may resolve the large-scale genome structure but not the sequences of the individual strains, 59 which we will refer to as their haplotypes. 60 Metagenomic contig binning methods attempt to mitigate the problem introduced by 61 standard metagenome sample processing approaches, wherein the origin of each sequence read 62 is unknown. Contig binning works because contigs deriving from the same or similar genomes 63 will share features that can be learnt without prior knowledge. These features can be sequence 64 composition, but it is also possible to use per-sample coverage depths of contigs as a more 65 powerful feature, if multiple samples are available from the same (or very similar) 66 communities [2]. There are now numerous algorithms capable of using both coverage across 67 samples and composition to automatically cluster contigs and determine from single-copy core 68 gene (SCG) frequencies where the resulting bins are good quality metagenome assembled 69 genomes (MAGs) [3,13]. These tools enable genome bins to be extracted de novo from 70 metagenomes, and are becoming crucial for studying unculturable organisms, contributing to 71 many exciting discoveries, such as the description of the Candidate Phyla Radiation [9] or an 72 improved understanding of the diversity of nitrogen fixers in the open ocean [14]. 73 The resolution of genome binning though, is limited by the resolution of the assembler, with 74 a typical maximum kmer length of around 100, the best case is that we can resolve to about 1% 75 sequence divergence, so that bins correspond to something between a species and a strain. In 76 the presence of strain diversity, those contigs that are shared across strains will become a 77 consensus of the strains present, in the ideal situation their sequence would be that of the most 78 abundant strain, but even this is not guaranteed. Contigs that are part of the accessory genome 79 and present in a subset of strains may be successfully binned with the core genome, but they 80 may not if they are too short or divergent in coverage. Consequently, if multiple strains are 81 present in the assembly the MAGs that result from binning will be an imperfect composite of 82 multiple strains.

83
Strains in a metagenome can exhibit variation in shared genes, such as insertions/deletions 84 and single-nucleotide variants or SNVs, as well as in their accessory gene complements. 85 Recently, we introduced DESMAN [32] to resolve subpopulations in MAGs using variant 86 frequencies on contigs when multiple samples from a community are available. This is similar to 87 contig binning using coverage but it can be viewed as a relaxed form of clustering closer to 88 non-negative matrix factorisation, because each variant can appear in more than one 89 subpopulation haplotype. Similar strategies had been proposed prior to DESMAN but using 90 variant frequencies on reference genomes e.g. Lineages [29] and Constrains [27]. DESMAN [19]; thirdly, it treats every variant as independent ignoring 96 the co-occurrence of variants in reads, which is a powerful extra source of information when 97 strain divergence is greater than the inverse of read length, when we would expect most reads 98 to contain more than one variant. The last issue can be addressed by keeping track of which 99 variants appear in which reads but that requires extra bookkeeping [18].

100
To address these limitations, we introduce a new method, STRONG (Strain Resolution ON 101 Graphs), for analysing metagenome series when multiple samples are available either from the 102 same microbial community e.g. longitudinal time-series or cross-sectional studies where the 103 communities are similar enough to share a significant fraction of strains. STRONG can 104 determine the number of 'metagenome strains' in a MAG formed from binning of a coassembly 105 of all the samples, together with their sequences across multiple single-copy core genes, which 106 we define as the strain haplotype, and the coverages of each strain in each sample. STRONG 107 avoids the limitations of the variant-based approaches by resolving haplotypes directly on 108 assembly graphs using a novel variational Bayesian algorithm, BayesPaths.

109
This graph-based approach allows more complex variant structure and incorporates read 110 information. The usefulness of graphs for understanding microbial strains has been noted 111 before, and efficient algorithms developed for querying complex graphs and extracting more 112 complete representatives of MAGs in the presence of strain diversity [10]. STRONG, however, 113 is the first time that graphs have been used in an automated workflow to actually decompose 114 that strain diversity into haplotypes across multiple genes using multiple samples. We compare 115 STRONG to the current state of the art, DESMAN, on synthetic microbial communities and a 116 real metagenome time series from an anaerobic digester. In the former case we validate using 117 the known genome sequences, and for the latter we compare abundant MAGs with haplotypes 118 derived independently from Oxford Nanopore MinION long reads.

STRONG pipeline 121
The detailed pipeline is described in the Methods but the key steps are summarised in Figure 1 122 and reiterated here. We start from multiple samples of the same community and jointly Step 1) Co-assembly with metaSPAdes and storage of a high-resolution graph (HRG).
Step 2) Contig binning with CONCOCT and annotation of single-copy core genes (SCGs).
Step 3) Mapping of SCGs onto the HRG and extraction of individual SCG assembly graphs together with per-sample unitig coverages.
Step 4) Joint solution of SCG assembly graphs from each MAG with BayesPaths to determine strain number, haplotypes and per-sample coverages.
simultaneously solves for the number of strains present, their coverage in each sample, and their 135 sequences on the SCGs. SCGs from the same MAG are linked through the binning process and 136 jointly solved in the strain resolution procedure to generate linked strain resolved sequences for 137 each SCG. We will refer below to the SCG sequences for a given strain as its haplotype. The 138 pipeline also applies DESMAN [32], to the same MAGs for comparative purposes, and will 139 perform benchmarking if known genomes are available. It is important to note that some SCGs 140 will be filtered during the BayesPaths procedure, see Methods, so that sequence inference is 141 only performed on a subset in the final output.

142
Synthetic data sets 143 In order to provide an example metagenome data set with a known strain configuration for each 144 species, we created a synthetic community comprised of 100 strains, with known genomes 145 deriving from 45 species, with 20 species represented by a single strain, 10 with two strains, 5 146 with three, 5 with four and 5 species with five strains. We then generated four data sets from 147 this community with the same total number of reads (150 million 2X150 bp) but increasing 148 sample numbers (3, 5, 10 and 15 samples). This configuration, where most species have a single 149 strain, might be an appropriate approximation to the human gut microbiome [38]. We denote 150 these data sets Synth S03, Synth S05, Synth S10 and Synth S15. For each sample number, 151 random species abundances were generated from a log-normal distribution, with strain 152 proportions from a Dirichlet. Full details of the synthetic sequence generation are given in the 153 Methods.

154
The STRONG pipeline was then applied to each of these data sets in turn. In Figure 2 we 155 illustrate the STRONG output for a single gene, COG0532 'Translation initiation factor 156 IF-2' [37], from one MAG, Bin 55 of the ten sample synthetic data set, giving the resulting Figure 2. BayesPaths algorithm. This illustrates the BayesPaths algorithm for a single COG0532 from one MAG, Bin 55 of the ten sample synthetic data set. The algorithm predicted 3 strains. We show the input to the algorithm: A) the unitig coverages across samples plus B) the unitig graph without strain assignments. The outputs of the algorithm are shown in C) the assignments of haplotypes to each unitig, D) the strain intensities across samples, effectively coverage divided by read length (see Methods -BayesPaths), and E) unitig graphs for each haplotype with their most likely paths. This algorithm is explained in detail in the Methods -BayesPaths.
gene the haplotypes were found without errors.

160
For each of the four synthetic data sets we considered only MAGs which were assigned to 161 species (see Methods) with at least two strains -20, 21, 24 and 22 MAGs, from the Synth S03, 162 Synth S05, Synth S10 and Synth S15 data sets respectively. For each MAG we mapped the 163 predicted haplotypes for the optimal strain decomposition for both the STRONG pipeline and 164 DESMAN algorithms onto the known reference strains. We then assigned each haplotype 165 prediction to its best matching reference. The best such match was denoted 'Found'. If multiple 166 predicted haplotypes matched to the same reference they were denoted as 'Repeated'. If a 167 reference had no haplotype prediction that matched to it better than the other references, it 168 was denoted as 'Not found'. For the aggregate across these MAGs we show the total number of 169 such strains for each of the four data sets in Figure 3.  averaged across all MAGs and all data sets was just 0.052%, three times lower than that for 175 DESMAN, 0.176%. This improvement was observed for all four data sets (see Table 1 and 176 Figure 4). STRONG was more likely to predict the correct number of strains, doing so for 73% 177 of MAGs summed across samples numbers versus 60% for DESMAN. It was also better at 178 predicting the strain relative abundances. Regressing true abundance against predicted  Figure 3. No. of strains resolved by STRONG and DESMAN algorithms in the synthetic community data sets. For MAGs with two or more strains we mapped haplotypes to the references and assigned each predicted haplotype to its best matching reference. The best such match was denoted 'Found'. If multiple haplotypes matched to the same reference they were denoted as 'Repeated'. If a reference had no predicted haplotypes matched to it, it was denoted as 'Not found'. The bars give the total numbers in each category summed over MAGs for the two methods (DESMAN and STRONG) and the panels results for the four different data sets with increasing number of samples (Synth S03, Synth S05, Synth S10 and Synth S15  Figure 4. Error rates in strains found against coverage depth for STRONG and DESMAN algorithms in the synthetic community data sets. For the 'Found' strains we computed per base error rate to the matched reference, this is shown on the y-axis, against strain total coverage depth summed across samples on the x-axis, both axes are log transformed. The results are separated across methods (DESMAN and STRONG) and sample number in the synthetic community.
Some of these, 7 out of 63, were below the minimum coverage of detected strains (5.68), but 187 most were not, suggesting that either they were not sufficiently divergent in terms of 188 nucleotides or coverage profiles to be detected. Examination of phylogenetic trees for the 189 haplotypes and reference genomes constructed using the SCGs revealed that in many cases 'Not 190 found' strains had identical SCG haplotypes to those that were resolved.

191
The BayesPaths algorithm used to resolve strains in STRONG uses variational inference (see 192 Methods -BayesPaths), an approximate Bayesian strategy [7]. This has the advantage of   Figure S1). Thus the divergence is a useful prediction of 202 uncertainty in the haplotype sequence inference, enabling us to estimate error rates in real data 203 sets in the absence of known reference sequences. Roughly speaking, the expected per base 204 error rate is 0.01 times the divergence, so that a strain divergence of 0.1 predicts a 0.1% error 205 rate. In real data sets, the uncertainty estimates in the abundances are also useful, placing 206 bounds on the abundance of individual strains in each sample.

207
In Table S3 we give approximate run times for each component of the STRONG pipeline on 208 the synthetic community data sets, using 64 threads on a standard bioinformatics server (see 209  Table S3). The BayesPaths step is the most time consuming part of the analysis (up to 36 210 hours), but it is still comparable to the initial coassembly. The only part of the pipeline with 211 substantial memory requirements is the initial coassembly with metaSPAdes, the other steps 212 are CPU limited.

213
Anaerobic digester time series 214 We next applied the STRONG pipeline to a real metagenomics time series, comprising ten 215 samples taken at approximately 5 weekly intervals, from an industrial anaerobic digestion 216 reactor (see Table S4 and Methods for details). This provides an evaluation community of 217 intermediate complexity to test the pipeline's capability to resolve strains and reconstruct 218 intraspecies dynamics. Each sample was sequenced on the NovaSeq platform with 2x150 bp 219 reads at a mean depth of 11.63 Gbp. One sample was also run on a Nanopore MinION flow cell 220 producing 43.78 Gbp of reads with a read N50 of 6,727 bp and a maximum length of 108 kbp. 221 CONCOCT binning produced 905 bins, of which 309 had 75% of SCGs present in 222 9/33 single-copy, which we designate MAGs. In total 11 of these MAGs exhibited overlapping SCG 223 graphs and were merged into 6 composite MAGs (see Methods -STRONG Pipeline), so that 224 304 MAGs were actually used in the strain decomposition. We calculated coverage depth per 225 sample for each bin and then normalised by sample size to obtain a community profile at each 226 time point. Overall the reactor exhibited a clear shift in community structure over time, despite 227 consistent operating conditions, with sample time explaining 48% of the variation in community 228 structure (p = 0.001 - Figure S2). Of the MAGs, 110 had an abundance that changed 229 significantly over time (Bonferonni adjusted p-value < 0.05 from Pearson's correlation of log 230 transformed normalised abundance) and these were evenly split between those that increased 231 (55) or decreased in abundance (55).  We used the STRONG algorithm to resolve strains in the 304 MAGs. This is a complex data 233 set and running the complete pipeline took over 16 days, of which roughly 60% of the time was 234 10/33 spent on the BayesPaths strain resolution (see Table S3). The number of strains found varied 235 between 1 and 7, with a mean of 1.7, shown as a function of coverage depth in Figure 5. In 236 total 121 (39.8%) of these MAGs had more than one strain, and there was a significant positive 237 association between MAG coverage depth and number of strains (r = 0.36, p = 1.004e − 10), 238 which is expected, as low coverage MAGs will be under-sampled. This correlation disappears 239 though when we restrict to all MAGs with a coverage greater than thirty (r = 0.19, p = 0.1023). 240 On average 20.9 SCGs were used after filtering for strain haplotype predictions. 241 Figure 6. MAG summary for anaerobic digester time series. For the 114 MAGs with aggregate coverage > 20 we give their phylogeny constructed using concatenated marker genes together with their normalised coverages in the ten samples. We also indicate which MAGs significantly increased (SigUp) or decreased (SigDown) in total abundance (adjusted p < 0.05), their GTDB phylum assignment, no. of strains resolved by STRONG and whether the strain abundances changed significantly over time (adjusted p < 0.05) using permutation ANOVA (SigStrainChange).
For the 108 MAGs that had at least two strains with relative frequencies determined in five 242 or more samples we used permutation ANOVA to determine whether strain proportions 243 depended on sampling time. In total 13 of the MAGs had an adjusted p-value < 0.05 i.e. 244 12.0%. For these same MAGs 37 had a total coverage that changed significantly over time with 245 an adjusted p-value < 0.05 i.e. 34.2%. Therefore the intra-species dynamics are more stable 246 than inter-species, with strain proportions remaining fixed as the MAG coverages vary, this was 247 true for 33 of the 37 MAGs that changed significantly in coverage.

248
In Figure 6, we use the Anvi'o program [16] to summarise information on phylogeny, 249 taxonomy, normalised coverages in the ten samples, and whether the MAGs changed 250 significantly in total abundance, together with the number of strains resolved by STRONG and 251 if those strain relative proportions changed significantly with time. This was restricted to just 252 those 114 MAGs with an aggregate coverage greater than twenty to simplify the diagram.

253
The Nanopore sequencing provides us with a means to directly test the validity of the 254 STRONG haplotype reconstructions, at least for the most abundant MAGs. The most Figure 7. Comparison of Nanopore reads to STRONG prediction for COG0532 from Bin 72. Non-metric multidimensional scaling of Nanopore reads that mapped to COG0532 from Bin 72 of the anaerobic digester time series (red) together with the three haplotypes reconstructed from short reads by STRONG (black 0, 1 and 2 taxonomy [11]. Interestingly, this is an example of a MAG which changes significantly in 258 abundance, decreasing over time, (adjusted p = 4.9e − 05) but where the proportions of the 259 three strains predicted varied less dramatically (R 2 = 0.35 adjusted p = 0.089) -see Figure S5. 260 We will focus on the longest SCG for which strains were resolved, COG0532, where the three 261 strains are present in only two variants, haplotypes 0 and 2 being identical on this core gene. In 262 Figure S4 we give the short read variant graph for this gene, which in this case is mostly simple 263

12/33
bubbles, together with the assigned haplotypes. In fact, across the 18 SCGs used to decompose 264 strains, haplotypes 0 and 1 were most similar with 99.7% nucleotide identity. These two strains 265 had 99.4% and 99.1% identity with haplotype 2 respectively. That this pattern was not 266 observed on COG0532 may suggest some recombination in the evolution of these organisms.

267
In Figure 7 we show for COG0532 both the Nanopore reads that map to this gene and the 268 three haplotypes inferred by BayesPaths, as an Non-metric Multi-dimensional Scaling (NMDS) 269 plot using fractional Hamming distances on the short read variant positions. These are defined 270 as the Hamming distance between two reads but only on the intersecting variant positions and 271 ignoring gaps. We then normalise by the number of such non-gap intersecting positions to give 272 a distance between 0 and 1. The Nanopore reads are consistent with the inference of two  In order to provide a quantitative comparison of the Nanopore reads and the STRONG 277 predictions, we applied the EM algorithm defined in the Methods (Nanopore Sequence 278 Analysis) on the 1,603 Nanopore reads mapping to this COG (cluster of orthologous groups). 279 Examining the negative log-likelihood as a function of number of strains, it flattens at two 280 strains (see Figure S3) and the two strains inferred exactly match (100% identity over 2,313 281 bps) haplotypes 0/2 and 1 respectively. Furthermore, STRONG in this sample predicted 282 frequencies of 28.0% for haplotype 1. This closely matched the Nanopore haplotype frequencies 283 for this strain of 27.6%. We also ran the Nanopore EM algorithm for all 18 filtered COGs in 284 this bin separately. For the 11 COGs where more than one strain was predicted from the 285 Nanopore reads, we compared the STRONG and Nanopore predictions. For haplotypes 0, 1 286 and 2 exact matches were found for 6, 7 and 4 SCGs respectively with average nucleotide 287 identities across all genes of 99.89%, 99.89% and 99.82%.

288
For lower coverage MAGs we generally obtain a reasonable correspondence between the 289 STRONG haplotypes and Nanopore predictions. In most cases the number of strains is 290 comparable between the two, although the accuracy of matches reduces with decreased 291 Nanopore read counts, as we might expect. As an example, in Figure S7 we compare Nanopore 292 reads with the five STRONG haplotypes from COG0072 of Bin 846, a Firmicutes MAG in the 293 AD time series. The most abundant Nanopore mode clearly matches STRONG haplotype 4, the 294 most abundant strain in this sample, and there is also some support for haplotypes 0 and 2.

295
There is less evidence for strains 1 and 3, but these are low abundance in this sample (see 296 Figure S9). This is confirmed from the EM algorithm applied to the Nanopore reads matching 297 this gene, where we would predict four Nanopore haplotypes ( Figure S6). Comparing these 4 298 Nanopore strains to the STRONG predictions we find that three closely match: Nanopore 299 haplotype 0 matched best to STRONG strain 4 with 98.8% nucleotide identity, Nanopore 300 haplotype 1 to STRONG 4 with 99.9% identity and Nanopore haplotype 2 to STRONG strain 0 301 with 99.7% identity. There is also a correspondence in relative abundance, with the most 302 abundant Nanopore haplotype 1 recruiting 82% of the reads vs 74% relative frequency for the 303 corresponding strain haplotype from STRONG.

305
We have demonstrated that on synthetic data the STRONG pipeline and the BayesPaths 306 algorithm are able to accurately infer strain sequences on the SCGs and abundances across 307 13/33 samples. Performance does improve with increasing sample number in terms of the number of 308 strains resolved, but reassuringly even when only a small number of samples are available we 309 are still able to accurately predict (with 0.068% per base error rate) strains, and when ten or 310 more samples are available we obtain error rates below 0.05% i.e. 1 error in every 2000 bps 311 from short read data. This is better performance than the state of the art, and sufficiently 312 accurate for high resolution phylogenetics. Strains are resolved more accurately as they increase 313 in coverage (see Figure 4), and in fact, when coverages exceed twenty fold we can resolve strains 314 very reliably, with just 0.011% error rate averaged across strains in the ten sample synthetic 315 data set. We believe therefore that this pipeline will be useful whenever high quality de novo 316 strains are required from metagenome short read time series.

317
This is to our knowledge the first algorithm capable of constructing strains from 318 metagenomes using assembly graphs from multi-sample coassemblies. Graph-based haplotype 319 resolution has been applied to viruses [4] and for eukaryotic transcripts [5,6], but ours is the 320 first algorithm to resolve strains across multiple gene subgraphs connected through a contig 321 binning procedure. The BayesPaths algorithm is also a substantial algorithmic advance 322 enabling coverage across multiple samples to be incorporated into a rigorous Bayesian 323 procedure that gives uncertainties in both the paths (i.e. the sequences) and the strain 324 abundances. This algorithm could be utilised outside of the actual STRONG pipeline in other 325 application areas, for example for finding viral haplotypes.

326
In addition, to the new strain resolution algorithm, BayesPaths, STRONG incorporates a 327 number of useful tools for large-scale variant graph processing, in particular, the tools for 328 extraction of subgraphs that correspond to individual coding genes and the spades-gsimplifier 329 tool for error correction on those graphs. These can be applied to any graph in the GFA format, 330 and could therefore find applicability outside of the context of our pipeline. This also means 331 that in the future we could add alternative choices for the coaseembly step, for instance 332 replacing metaSPAdes with MEGAHIT [25]. Similarly, we plan to add support for alternative 333 binners to CONCOCT.

334
Currently, we are restricted to core genes that are single-copy and shared across all strains in 335 a MAG. We can in theory use any such genes, so if a particular MAG is of interest the pipeline 336 could be run with a larger set of COGs that are SCGs for that MAG. There would be a cost in 337 terms of increased running time, which will increase with more genes and unitigs in a roughly 338 linear fashion.

339
The analysis of a time series from an anaerobic digestor illustrates the practicality of our 340 pipeline on a realistically sized data set. We should note though that to resolve strains on these 341 304 MAGs took nearly 10 days using 64 threads on a standard bioinformatics server (see 342   Table S3. The AD analysis also demonstrates the importance of strain dynamics in a real 343 microbial community with nearly 40% of MAGs exhibiting strain variation, but this variation 344 was relatively stable compared to the MAG dynamics themselves. If strains are functionally 345 redundant to one another we would expect significant neutral fluctuations over time. Therefore 346 this could be evidence for intra-species niche partitioning.

347
In general, we found a good correspondence between haplotypes inferred from Nanopore 348 reads and the STRONG predictions in the AD data set. For the most abundant MAG, Bin 72, 349 they matched very closely. In addition, the relative abundances of strains were consistent across 350 the two sequencing technologies, despite the use of different DNA extraction protocols, and the 351 different biases inherent in library preparation and sequencing platforms. These technical 352 elements in the data generation process are known to introduce bias at the species level [12], 353 14/33 but our findings suggest that intraspecies abundance may generally be robust against such 354 biases, which makes sense in that all the strains of a species will have similar physical properties 355 and genomic traits.

356
STRONG is an effective strategy to de novo resolve subpopulations at high phylogenetic 357 resolution within MAGs, but as discussed in the Introduction, it is important to add the caveat 358 that the haplotype sequences obtained are not equivalent to those from sequencing cultured 359 isolates, where we can identify the resulting genome with a single organism present in the 360 original community. The metagenome strains, in the best case, will correspond to different 361 modal sequences of the taget species, about which substantial unresolved variation may exist. 362 They will correspond to peaks in the probability distribution of all possible sequence 363 configurations, and as such will provide important insights into the naturally occurring 364 variation, but there remains the question of how to identify and quantify the unresolved 365 variation surrounding those peaks. In the worst case, when STRONG is applied to rapidly 366 recombining microbes, such as those found in the oceans [30], the resulting sequences may not 367 even be real in the sense of characterising any true individual. An additional unaddressed 368 question is how to determine when this has occurred, for now we would simply urge caution 369 when using STRONG in cross-sectional studies of rapidly evolving microbes, and suggest that 370 the term 'metagenome strain' or 'metagenome haplotype' be used when referring to the output 371 sequences. The same caveat does of course apply to any current purely bioinformatics strategy 372 for de novo resolution of genomes from metagenomes. Even if a single sample is used for 373 binning and there are no subpopulations, the resulting MAG is still a composite and not a 374 strain in the traditional microbiological sense [39]. 375 An obvious extension of our algorithm would be to resolve the accessory genome into strain 376 genomes. This could be done on a per gene basis by relaxing the requirement that every strain 377 passes through every gene, but an approach that incorporates the path structure in the full 378 metagenomic assembly would be more powerful. Use of the full assembly may be possible in an 379 efficient manner by factorising the variational approximation on a per gene basis and allowing 380 the solutions for one gene to depend on the expectations across their neighbours. Or it may be 381 that more computationally tractable versions of the algorithm can be developed that will scale 382 to larger graphs. In any case the issues discussed above of our inferred 'strains' containing 383 unresolved variation will become more pertinent when we extend our algorithm to the full 384 genome, and it will be necessary to consider not just the most likely genome associated with a 385 subpopulation but also its variants.

386
In the future we also plan to directly incorporate long read information into the strain 387 resolution rather than just using it for validation. It was encouraging therefore to see the 388 correspondence in strain frequencies between the two approaches. We are confident that in the 389 near future, through the combination of long reads with methods similar to those we have 390 introduced in STRONG, that complete metagenome de novo strain resolution will become a 391 realistic possibility.

393
We have introduced a complete bioinformatics pipeline, STrain Resolution ON assembly Graphs 394 (STRONG), that is capable of extracting single-copy core gene variant graphs from short read 395 metagenome coassemblies for individual metagenome assembled genomes (MAGs). We 396 demonstrated how these graphs and associated per-sample unitig coverages can be used in a 397 15/33 novel Bayesian algorithm, BayesPaths, to find MAG strain number, haplotypes and abundances. 398 This approach achieves superior accuracy to variant based methods on synthetic communities 399 and predictions on real data that match those from long Nanopore long reads.

402
Synthetic data set generation 403 The in silico synthetic communities were generated by first downloading a list of complete 404 bacterial genomes from the NCBI and selecting species with multiple strains present. Genomes 405 were restricted to those that were full genome projects, possessed at least 35 of 36 single-copy 406 core genes (SCGs) identified in [3], and with relatively few contigs (< 5) in the assemblies. Communities were created by specifying species from this list and the number of strains desired. 408 The strains selected were then chosen at random from the candidates for each species, with the 409 extra restrictions that all strains in a species were at least 0.05% and no more than 5% 410 nucleotide divergent on the SCGs from any other strain in the species. This corresponds to a 411 minimum divergence of approximately 15 nucleotides over the roughly 30 kbp region formed by 412 summing the SCGs. The genomes used are given in Tables S1 and S2.

413
Each species indexed i was then given an abundance, y i,s , in each sample, s = 1, . . . , S, which was drawn from a lognormal distribution with a species dependent mean and standard deviation, themselves drawn from a normal and gamma distribution respectively: and: For all four community configurations -S equal to 3, 5, 10 and 15 -we used µ p = 1, σ p = 0.125, k p = 1 and θ p = 1. The species abundances were then normalised to one (y i,s = y i,s / i y i,s ). For each strain within a species its proportion in each sample was then drawn from a Dirichlet: with α = 1.

414
This allowed us to specify a copy number for each genome g in species i in each sample as 415 y i,s ρ g,s . We then generated 150 million paired-end 2x150 bp reads in total across all samples 416 with Illumina HiSeq error distributions using the ART read simulator [21]. The code for the 417 synthetic community generation is available from 418 https://github.com/chrisquince/STRONG_Sim.

419
Synthetic data set evaluation 420 We can determine which contig derived from which reference genome by considering the 421 simulated reads that map onto it. We know which reference each of these came from, enabling 422 us to assign a contig to a genome as that which a majority of its reads derive from. We can 423 16/33 then assign each MAG generated by STRONG to a reference species as the one which the 424 majority of its contig's derive from weighted by the contig length. 425 Anaerobic digester sampling and sequencing 426 AD sample collection 427 We obtained ten samples from a farm anaerobic digestion bioreactor across a period of 428 approximately one year. The sampling times, metadata and accession numbers are given in 429   Table S4. The reactor was fed on a mixture of slurry, whey and crop residues, and operated 430 between 35-40 • C, with mechanical stirring. Biomass samples were taken directly from the AD 431 reactor by the facility operators and shipped in ice-cooled containers to the University of 432 Warwick. Upon receipt, they were stored at 4 • C and then sampled into several 1-5mL aliquots 433 within a few days. DNA was usually extracted from these aliquots immediately but some were 434 first stored in a -80 • C freezer until subsequent thawing and extraction.     STRONG processes co-assembly graph regions for multiple metagenomic datasets in order to 488 simultaneously infer the composition of closely related strains for a particular MAG and their 489 core gene sequences. Here, we provide an overview of STRONG. We start from a series of S 490 related metagenomic samples, e.g. samples of the same (or highly similar) microbial community 491 taken at different time points or from different locations.

492
The Snakemake based pipeline begins with the recovery of metagenome-assembled genomes 493 (MAGs) [22]. We perform co-assembly of all available data with the metaSPAdes assembler [28], 494 and then bin the contigs obtained by composition and coverage profiles across all available 495 samples with CONCOCT [3]. Each bin is then analyzed for completeness and contamination 496 based on single-copy core genes, and poor quality bins are discarded. The default criterion is 497 that a MAG requires greater than or equal to 75% of the SCGs in a single copy. While we 498 currently focus on this combination of software tools, in principle we could use any other 499 software or pipeline for MAG recovery, e.g. we could use MEGAHIT as the primary 500 assembler [25] or alternative binning tools or their combination. For each MAG we then extract 501 the full or partial sequences of the core genes that we further refer to as single-copy core gene 502 (SCG) sequences.

503
The final coassembly graph produced by metaSPAdes cannot be used for strain resolution 504 because, as with other modern assembly pipelines, variants between closely related strains will 505 be removed during the graph simplification process. Instead, we output the initial graph for the 506 final K-mer length used in the (potentially) iterative assembly following processing by a custom 507 executable -spades-gsimplifier based on the SPAdes codebase -to remove likely erroneous 508 edges using a 'mild' coverage threshold and a tip clipping procedure. We refer to the resulting 509 18/33 graph as a high-resolution assembly graph or HRAG.

510
The graph edges are then annotated with their corresponding sequence coverage profiles 511 across all available samples. As is typical in de Bruijn graph analysis, the coverage values are 512 given in terms of the k-mer rather than nucleotide coverage. Profile computation is performed 513 by a second tool for aligning reads onto the HRAG: unitig-coverage. The potential advantage of 514 this approach in comparison to estimation based on k-mer multiplicity, is that it can correctly 515 handle the results of any bubble removal procedure that we might want to add to the 516 preliminary simplification phase in future.

517
For each detected SCG sequence (across all MAGs) we next try to identify the subgraph of 518 the HRAG encoding the complete sequences of all variants of the gene across all strains 519 represented by the MAG. The procedure is described in more detail in the next section. During 520 testing we faced two types of problems here: (1) related strains might end up in different MAGs 521 and (2) some subgraphs might consist of fragments corresponding to several different species. 522 We take several steps to mitigate those problems. Firstly, we compare SCG graphs between all 523 bins, not just MAGs. If an SCG graph shares unitigs between bins, then it is flagged as 524 overlapping. If multiple SCG graphs between MAGs (> 10) overlap then we merge those MAGs, 525 combining all graphs and processing them for strains together. Following merging any MAG 526 SCG graphs with overlaps remaining are filtered out and not used in the strain resolution.

527
After MAG merging and COG subgraph filtering we process the remaining MAGs one by 528 one. Before the core 'decomposition' procedure is launched on the set of SCG subgraphs 529 corresponding to a particular MAG, they are subjected to a second round of simplification, 530 parameterised by the mean coverage of the MAG, to filter nodes that are likely to be noise 531 again by the spades-gsimplifier program. This module is described in more detail below. The 532 resulting set of simplified SCGs of the HRAG for a MAG are then passed to the core graph 533 decomposition procedure, which uses the graph structure constraints, along with coverage 534 profiles associated with unitig nodes, to simultaneously predict: the number of strains making 535 up the population represented by the MAG; their coverage depths across the samples; paths 536 corresponding to each strain within every subgraph (each path encodes a sequence of the 537 particular SCG instance).

538
A fraction of the SCGs in a MAG may properly derive from other organisms due to the 539 possibility of incorrect binning i.e contamination. In fact, the default 75% single-copy criterion 540 allows up to 25% contamination. In addition, the subgraph extraction is not always perfect. We 541 therefore add an extra level of filtering to the BayesPaths algorithm, iteratively running the 542 program for all SCGs, but then filtering those with mean error rates, defined by Equation 18, 543 that exceed a default of 2.5 times the median deviation from the median gene error. Filtering 544 on the median deviation is in general a robust strategy for identifying outliers. As a result of 545 this filtering the pipeline only infers strain sequences on a subset of the input SCGs. We have 546 found, however, that the number of SCGs for which strain haplotypes are inferred is sufficient 547 for phylogenetics.

Relevant subgraph extraction 549
Provided with the predicted (partial) gene sequence,T , and the upper bound on the length of 550 the coding sequence, L, defined as 3α Ū n where Ū n is the average length in amino acids of 551 that SCG in the COG database, and α = 1.5 by default. The procedure for relevant HRAG 552 subgraph extraction involves the following steps. First, the sequenceT is split into two halves, 553 T andT , keeping the correct frame (bothT andT are forced to divide by 3).T andT are 554 19/33 then processed independently. Without loss of generality we describe the processing ofT : 555 1. Identify the path P corresponding toT in the HRAG. We denote its length as L P .    1 Due to the properties of the procedure and the fact that it deals with DBGs, the actual implementation encodes frame state as an integer [0,2] rather than the string of last partially traversed codon.
2 Actually Dijkstra runs are initiated from the ends/starts of corresponding edges and the distances are later corrected. 20/33 6. After the sets of the graph edges potentially encoding the gene sequence are gathered for 593 T andT the union of the two sets, ER, is then taken and augmented by the edges, 594 connecting the 'inner' ER vertices (vertices which have at least one outgoing and at least 595 one incoming edge in the ER) to the rest of the graph.

596
Initial splitting ofT intoT andT is required to detect relevant stop codons which are not 597 reachable from the last position ofT in HRAG (or from which the first position ofT in HRAG 598 can not be reached). In addition to the resulting component in gfa format, we also store the 599 positions of the putative stop codons, and ids of edges connecting the component to the rest of 600 the graph (further referred to as 'sinks' and 'sources').

601
Subgraph simplification 602 While processing SCG subgraphs from a particular MAG we use the available information on 603 the coverage of the MAG in the dataset. In particular, we set up the simplification module to 604 remove tips (a node with no successors) below a certain length and edges with coverage below a 605 fraction of the total coverage across all samples. If a tip is not removed it is labelled as a

606
'dead-end' to distinguish it from potential connections to the rest of the graph.

607
While simplifying a COG subgraph, edges connecting it to the rest of the assembly graph 608 should be handled with care (in particular, they should be excluded from the set of potential 609 tips). This is because in the BayesPaths algorithm they form potential sources and sinks of the 610 possible haplotype paths. Moreover, during the simplification the graph changes, and such • Source node s and sink node t such that φ g+ Then assume normally distributed counts for each node in each sample giving a joint density for observations and latent variables: where [] indicates the Iverson bracket evaluating to 1 if the condition is true and zero otherwise. We assume an exponential prior for the γ g,s with a rate parameter that is strain dependent, such that: P (γ g,s |λ g ) = λ g exp(−γ g,s λ g ) We then place gamma hyper-priors on the λ g : This acts as a form of automatic relevance determination (ARD) forcing strains with low 647 intensity across all samples to zero in every sample [8]. 648 We use a standard Gamma prior for the precision: For the biases θ v we use a truncated normal prior: where Ψ is the standard normal cumulative distribution. The mean of this is set to one, µ 0 = 1, 650 so that our prior is that the coverage on any given node is unbiased, with a fairly high precision 651 τ 0 = 100, to reflect an assumption that the observed coverage should reflect the summation of 652 the strains. Finally, we assume a uniform prior over the possible discrete values of the η g v,u . If 653 the assembly graph is a directed acyclic graph (DAG) then η g v,u ∈ 0, 1. We have found that for 654 most genes and typical kmer lengths this is true, but we do not need to assume it.  Our starting point is to assume the following factorisation for the variational approximation: where A is the set of edges in the assembly graph and V = 1, . . . , V the set of unitig sequence 671 vertices. Note that we have assumed a fully factorised approximation except for the η h v,u , the 672 paths for each strain through the graph. There we assume that the path for each strain forms a 673 separate factor allowing strong correlations between the different elements of the path. This is 674 therefore a form of structured variational inference [20].

675
To obtain the CAVI updates we use the standard procedure of finding the log of the optimal 676 distributions q for each set of factorised variables as the expectation of the log joint distribution 677 Equation 2 over all the other variables, except the one being optimised. Using an informal 678 notation we will denote these expectations as ln P −q j where q j is the variable being optimised. 679

23/33
Then the mean field update for each set of {η g v,u } u,v∈A is derived as: Consider the second term only: This becomes: Which can be reorganised to: Where: It is apparent from Equation 7 that the q * g ({η g v,u } u,v∈A ) takes the form of a multivariate 681 discrete distribution with |u, v ∈ A| dimensions. The first term in Equation 7 enforces the flow 682 constraints, and does not separate across nodes, the next two terms are effectively coefficients 683 on the total flow through a unitig and its square. The updates for the other variables below, 684 depend on the expected values of the total flow through each of the unitig nodes for the strain 685 g, φ g v , which themselves depend on the η g v,u . These expected values can be efficiently obtained 686 for all v by representing Equation 7 as a factor graph comprising nodes consisting of factors 687 corresponding to both the constraints and the flow probabilities through each node with 688 variables η g v,u . We can then find the marginal probabilities for both the η g v,u and the φ g v using 689 the Junction Tree algorithm [41], from these we can calculate the required expectations.

690
Next we consider the mean field update for the γ g,s : ln q * (γ g,s ) = ln P −γg,s with the restriction γ g,s > 0 this gives a normal distribution but truncated to (0, inf) for γ g,s , with mean and precision: Derivations for the other updates follow similarly giving a Gamma posterior for the τ with parameter updates: where Ω = V S and we have used λ v,s as a short hand for the predicted count number: Then the τ have the following expectations and log expectations: where ψ is the digamma function. The biases θ v have a truncated normal distribution and their 691 updates can be derived similar to the above.

692
Evidence lower bound (ELBO) 693 Iterating the CAVI updates defined above will generate a variational approximation that is optimal in the sense of maximising the evidence lower bound (ELBO) so called because it bounds the log evidence, log(p(x)) ≥ ELBO(q(z)). It is useful to calculate the ELBO whist performing CAVI updates to verify convergence and the ELBO itself is sometimes used as a Bayesian measure of model fit, although as a bound that may be controversial [7]. The ELBO can be calculated from the relationship:

25/33
The first term is simply the expected log-likelihood of the data given the latent variables. In our case it is: where Ω = V S and the expectations are over the optimised distributions q.

694
The second term is the expectation under q(z) of the log prior distributions. In our case with standard distributions it is easy to calculate for instance for each of the γ g,s : With the µ g,s and τ g,s given by their current values derived from Equation 10 and the moments 695 of γ g,s calculated from a truncated normal distribution with those current parameters. The After a maximum fixed number of iterations or if the ELBO converges we stop iterating.

711
Variational inference can be sensitive to initial conditions as it can only find local maxima of 712 the ELBO, we therefore use a previously published variational Bayesian version of non-negative 713 matrix factorisation [8], to find an initial approximate solution.

714
Empirical modelling of node variances 715 For low-coverage MAGs a precision that is identical for all nodes performs satisfactorily, but since the true distribution of read counts is Poisson this overestimates the precision for nodes with high counts x v,s . To address we developed an empirical procedure where we first calculate log τ v,s for each node using Equation 14 as: log τ v,s = ψ (α 0 + 1/2) − log β 0 X v,s + 1 2 (x v,s − λ v,s ) 2 26/33 a quantity which exhibits high variability, so we then smooth this over log(x v,s ) using 716 generalised additive models as implemented in pyGAM [36] to give log τ v,s * . The term β 0 X v,s 717 gives a prior which is effectively Poisson. We then obtain τ v,s as exp( log τ v,s * ). This 718 procedure has no theoretical justification but gives good results in practice. This approach of 719 modelling a non-Gaussian distribution as a Gaussian with empirical variances is similar to that 720 used in voom for RNASeq [23].

721
Cross-validation to determine optimum number of strains 722 The ARD procedure usually converges on the correct number of strains except for high-coverage MAGs where overfitting may occur and too many strains can be predicted. We therefore additionally implemented a cross-validation procedure, splitting the entire data matrix x v,s into test and train folds (default ten folds) and training the model on the train fold and then testing on the held out data. The correct number of strains was then taken to be the one that maximised the log predictive posterior with an empirical variance reflecting the Poisson nature of the data. The exact test statistic being: v,s∈E where τ v,s = 1/(0.5 + x v,s ) and E indicates data points in the test set to down-weight high read 723 count nodes reflecting approximately Poisson noise.

726
To enable a qualitative comparison between haplotypes obtained from the Nanopore reads and 727 the BayesPaths predictions we developed the following pipeline applied at the level of individual 728 single-copy core genes (SCGs) from MAGs: 729 1. We mapped all reads using minimap2 [26] against the SCG contig consensus ORF 730 sequence and selected those reads with alignments that spanned at least one third of the 731 gene with a nucleotide identity > 80%. 732 2. We then extracted the aligned portion of each read, reverse complementing if necessary, 733 so that all reads ran in the same direction as the SCG ORF.  predicted COG sequences from BayesPaths. From these we generated NMDS plots indicating 745 sequence diversity, and they provided an input to the hybrid Nanopore strain resolution 746 algorithm below.

747
EM algorithm for hybrid Nanopore strain resolution 748 We also developed a simple EM algorithm for direct prediction of paths and their abundances 749 on the short read assembly graph that are consistent with a set of observed long reads. We 750 began by mapping the set of n = 1, . . . , N Nanopore sequences denoted {S n } onto the 751 corresponding simplified SCG graph generated by STRONG using GraphAligner [33]. This 752 provided us with N optimal alignment paths as walks in our SCG graph. We denote this graph 753 G comprising unitig vertices v and edges e ∈ {u, v} defining overlaps. 754 We assume, as is almost always the case that the graphs contain no unitigs in both forward 755 and reverse configurations, and that there are no cycles, so that each SCG is a directed acyclic 756 graph (DAG) with one copy of each unitig, and we only need to track the direction that each 757 overlap enters and leaves each unitig. Then best alignment walks comprise a sequence of edges, 758 e n 1 , . . . , e n Wn where W n is the number of edges in the walk of read n, that traverse the graph.

759
Given these observed Nanopore reads we aim to reconstruct G haplotypes comprising paths from a dummy source node s, which has outgoing edges to all the true source nodes in the graph, through the graph to a dummy sink t, which connects all the true sinks. We further assume that each haplotype has relative frequency π g . Each such haplotype path P g = {s, e g 1 , . . . , e g Wg , t} will translate into a nucleotide sequence T g . We assume that these haplotypes generate Nanopore reads with a fixed error rate which gives a likelihood: where m n,g is the number of basepair mismatches between S n and T g counting insertions, 760 deletions and substitutions equally and M n,g the number of matches.

761
To maximise this likelihood we used an Expectation-Maximisation algorithm. Iterating the 762 following steps until convergence: 763 1. E-step: Calculate the responsibility of each haplotype for each sequence as: Alignments of reads against haplotypes were performed using vsearch [34]. 764 2. M-step: We update each haplotype by finding the most likely path on the short read graph given the current expectations. These are calculated by assigning a weight w g e to each edge e in the graph as: w g e = n∈e z n,g L e where n ∈ e are the set of reads whose optimal alignment contains that edge and L e is the unique length of the unitig the edge connects to, i.e. ignoring the overlap length. We then 28/33 find for haplotype g the maximal weight path through this DAG using a topological sort. The error rates are updated as: = n g z n,g m n,g n g z n,g L n,g (22) where L n,g are the alignment lengths.

765
As is often the case with EM algorithms convergence depends strongly on initial conditions. 766 Therefore we initialise using a partitioning around medoids clustering using the distances 767 calculated in Methods -Nanopore Sequence Analysis. We can estimate the number of 768 haplotypes from the negative log-likelihood as a function of haplotype number.