SemiBin: Incorporating information from reference genomes with semi-supervised deep learning leads to better metagenomic assembled genomes (MAGs)

Metagenomic binning is the step in building metagenome-assembled genomes (MAGs) when sequences predicted to originate from the same genome are automatically grouped together. The most widely-used methods for binning are reference-independent, operating de novo and allow the recovery of genomes from previously unsampled clades. However, they do not leverage the knowledge in existing databases. Here, we propose SemiBin, an open source tool that uses neural networks to implement a semi-supervised approach, i.e. SemiBin exploits the information in reference genomes, while retaining the capability of binning genomes that are outside the reference dataset. SemiBin outperforms existing state-of-the-art binning methods in simulated and real microbiome datasets across three different environments (human gut, dog gut, and marine microbiomes). SemiBin returns more high-quality bins with larger taxonomic diversity, including more distinct genera and species. SemiBin is available as open source software at https://github.com/BigDataBiology/SemiBin/.


Generating must-link and cannot-link constraints
SemiBin uses taxonomic annotation results to generate cannot-link constraints and break up contigs to build textitmust-link  To generate must-link constraints, SemiBin breaks up long contigs into two fragments with equal length artificially and gener-167 ates must-link constraints between the two shorter contigs. By default, SemiBin uses a heuristic method to automatically find 168 the minimum size of contigs to break up (alternatively, the user can specify the threshold): it breaks up contigs that are at least 169 as long as the longest contig such that the basepairs contained in all contigs that are as long (or longer) encompass ≥ 98% of 170 the total number of basepairs (with an additional minimum size of 4,000bps).

4/41
As it is semi-supervised, this neural network is trained with two loss functions. The supervised loss is a contrastive loss which is used to classify pairs of inputs as must-link or cannot-link: where M x denotes the must-link pairs and C x the cannot-link pairs in the training set, d(x 1 , x 2 ) is the Euclidean distance of the 175 embedding of (x 1 , x 2 ), and y is an indicator variable for the (x 1 , x 2 ) pair (with the value 1 if (x 1 , x 2 ) ∈ M x , and 0 if (x 1 , x 2 ) ∈ C x ). 176 The goal of the supervised embedding is to transform the input space so that contigs have a smaller distance if they are in the same genome, compared to pairs of contigs from different genomes. To ensure that the embedding learns structure shared by all genomes, we also used an autoencoder (Hinton and Zemel, 1994) to reconstruct the original input from the embedding representation with an unsupervised mean square error (MSE) loss function: where x is the original input andx is the reconstructed input, X are whole contigs in the dataset.

Similarity between contigs 185
The similarity between two contigs is defined as where d(x 1 ,x 2 ) is the Euclidean distance of the semi-supervised embedding. When there are fewer than 5 samples, the 186 embedding distance only contains k-mer information. In this case, we modeled the number of reads per base of one contig 187 as a normal distribution, and used the Kullback-Leibler divergence (Kullback and Leibler, 1951) to measure the divergence 188 between the normal distributions from contigs, denoted as a(x 1 , x 2 ) above.

190
The binning problem is modeled as clustering on a graph. First, SemiBin considers the fully connected graph with contigs 191 as nodes and similarity between contigs as the weight of edges. To convert the community detection task to an easier task, 192 the fully connected graph is converted to a sparse graph. A parameter (max_edges, defaulting to 200) is used to control the 193 sparsity of the graph, but the results are robust to different values of this parameter (see Supplementary Fig. 6 and 10). For 194 each node, only the max_edges edges with the highest weights are kept. To remove any potential artefacts introduced by the 195 embedding procedure, SemiBin builds another graph with the same procedure using the original features and the edges in the 196 graph built from embedding that do not exist in the graph from original features are also removed.

197
After building the sparse graph, Infomap, an algorithm to reveal community structure in weighted graph based on information 198 theory (Rosvall and Bergstrom, 2008), is used to reconstruct bins from the graph. If the user requests it, SemiBin can use 199 single-copy genes of the reconstructed bins to independently re-bin bins whose mean number of single copy genes is greater   205 The default pipeline of SemiBin is to (1) generate must-link/cannot-link constraints, (2) train the semi-supervised deep learning 206 model for this sample, (3) bin based on the embeddings. To address the issue that contig annotations and model training 207 requires significant computational time and considering the k-mer frequencies can be transferred between samples or even 208 project, we proposed SemiBin(pretrain) for single-sample binning that (1) we trained a model with constraints from one 209 sample or several samples and (2) we directly applied this model to other samples. To use SemiBin(pretrain) in the tool, 210 users can train a model from their datasets or use one of our built-in pretrained models for human gut, dot gut, and ocean 211 environments. 212 2.1.7 Binning modes 213 We have evaluated our model in three binning modes: single-sample, co-assembly, and multi-sample binning (Nissen et al.).

214
Single-sample binning means binning each sample into inferred genomes after independent assembly. This mode allows for 215 parallel binning of samples, but it does not use information across samples.

216
Co-assembly binning means samples are co-assembled first and then binning contigs with abundance information across 217 samples. This mode can generate longer contigs and use co-abundance information, but co-assembly may lead to inter-sample  Multi-sample binning means the resulting bins are sample-specific (as in single-sample binning), but information is aggregated 221 across samples (in our case, abundance information). This mode requires more computation resources as it requires mapping 222 reads back to the concatenated FASTA file.

223
In single-sample and co-assembly binning, we calculate the k-mer frequencies and abundance for every sample and bin 224 contigs from every sample independently. For multi-sample binning, we first concatenate contigs from every sample into  For the benchmarking of binners, we used 5 simulated datasets from CAMI I and CAMI II and 5 real metagenomic datasets. 596 genomes with 5 samples. We also used skin and oral cavity datasets in toy human microbiome project dataset of CAMI 235 II. Skin dataset has 610 genomes with 10 samples and Oral dataset has 799 genomes with 10 samples. We used low com-236 plexity dataset to evaluate the single-sample binning mode of our method, medium and high complexity dataset to evaluate 237 the co-assembly binning mode and Skin, Oral datasets to evaluate the multi-sample binning mode. We used fastANI (Jain 238 et al., 2018) (version 1.32, default parameters) to calculate the ANI value between genomes for every sample from CAMI II 239 datasets. 240 We also used five real microbiome projects from different environments to evaluate the proposed method:    We used the first three datasets to evaluate single-sample and multi-sample binning mode and the last two human gut projects 248 as hold-out datasets to evaluate the pretrained model in SemiBin.

6/41
For simulated datasets, we used the gold standard contigs provided as part of the challenge. Then reads from every sample were mapped to the concatenated contig to get the BAM files for every sample.  258 We compared SemiBin to other methods in three binning modes. For single-sample and co-assembly binning of CAMI I SolidBin-SFS-CL (which generates must-link constraints from feature similarity and reference genomes). We also added 265 SolidBin-naive (without additional information) to show the influence of different semi-supervised modes.

266
For multi-sample binning of CAMI II datasets, we compared to the existing multi-sample binning tool VAMB, which clus-267 ters concatenated contig based on co-abundance across samples and then split the clusters according to original samples 268 and default Metabat2. For more comprehensive benchmarking, we converted Metabat2 to multi-sample mode. We used 269 jgi_summarize_bam_contig_depths (with default parameters) to calculate depth values using the BAM files from every sam-270 ple mapped to the concatenated contig. Then, we ran Metabat2 to bin contigs for every sample with abundance information 271 across samples, which is the similar idea of the multi-sample mode in SemiBin. 272 We also benchmarked single-sample and multi-sample binning mode in real datasets. For single-sample binning, we compared 273 the performance of SemiBin to Maxbin2, Metabat2 and VAMB, and for multi-sample binning, we compared to VAMB. These 274 tools have been shown that they can perform well in real metagenomic projects (see Supplementary Table 3). For VAMB, we 275 set the minimum length as 2,000bp. For SolidBin, we ran SolidBin with constraints generated from annotation results with CheckM. For SemiBin, we ran the whole pipeline described in the Methods (with default parameters, except for the inclusion 278 of the reclustering step). For other benchmarking methods, we ran tools with default parameters.

279
To evaluate the effectiveness of the semi-supervised learning in SemiBin, we also benchmarked modified versions of SemiBin:    purity (precision), and F1-score to evaluate the performance of different methods. 296 In the real datasets, as ground truth is not available, we evaluated the completeness and contamination of the predicted bins  To evaluate the generalization of the learned models, we selected three models as training sets from human gut, dog gut, 304 and ocean microbiome datasets. In each dataset, we selected a model from the sample that could generate the highest num-305 ber, median number, and lowest number of high-quality bins. For the human gut dataset, we termed them as human_high, 306 human_median and human_low, with models from the other environments named analogously. For every environment, we 307 also randomly sampled 10 samples from the rest of the samples as testing sets (no overlap in the training sets and testing 308 sets). Then, we transferred these models to the testing sets from the same environment or different environments and used the 309 embeddings from these models to bin the contigs.  To evaluate the pretraining approach, we used another two human gut datasets. We termed SemiBin with a pretrained model 315 from the human gut dataset used before as SemiBin(pretrain; external). We also trained a model from 20 randomly chosen 316 samples from the hold-out datasets and applied it to the the same dataset; an approach we termed SemiBin(pretrain; internal). 317 We benchmarked SemiBin(pretrain; external), Metabat2, original SemiBin, and SemiBin(pretrain; internal). constraints only covered a small part of the genomes in the environment and it would lead to bias of the learning of model.

378
Owing to the noise and bias of the must-link constraints obtained from taxonomic annotations, we chose to not use them and 379 instead to generate must-link constraints by breaking up long contigs.

381
To evaluate the learning ability of SemiBin, namely that it has the ability to learn the underlying genome structure from the 382 must-link and cannot-link constraints, not just reproduce its inputs, we compared the full pipeline to NoSemi (no constraints 383 are used) as well as SemiBin_m, SemiBin_c and SemiBin_mc which directly use the must-link and cannot-link constraints to 384 generate the graph without semi-supervised learning step (see Methods). The complete SemiBin pipeline performed similarly 385 or better (average 12.4% more high-quality bins) in low and medium complexity datasets and large improvements (average 386 34.6% more high-quality bins) in high complexity dataset (see Supplementary Fig. 11) compared to the second best binner.

387
SemiBin could also reconstruct on average 7.0% and 16.0% more distinct high quality strains in Skin and Oral datasets, 388 respectively (see Supplementary Fig. 12). These results show the ability of semi-supervised model in learning the genome 389 structure of the environment beyond what was present in the inputs, especially in complex environments.  between genomes and better aggregation within genomes (see Fig. 2d and Supplementary Fig. 7).

403
When comparing the different versions of SolidBin on CAMI I datasets, in most situations, SolidBin with additional informa-404 tion performed worse than SolidBin-naive which showed that the semi-supervised Ncut algorithm used in SolidBin could not 405 leverage additional information very well, perhaps due to the noise in these annotations (see Supplementary Table 2). 406 7.5 Applying SemiBin to real data 407 In the human gut dataset, SemiBin with multi-sample binning reconstructed more high-quality bins than SemiBin with single-408 sample binning, but these came from fewer distinct species, genera and families. This showed that multi-sample binning 409 (which uses abundance across several samples) might lead to the recovery of more genomes from species occurring in multiple 410 samples, while overlooking rare species (see Supplementary Fig. 16). best result (see Supplementary Fig. 13). Nonetheless, it is noteworthy that SemiBin with a pretrained model from the same 418 environment could still perform better than Metabat2, reconstructing at most 26.0%, 59.2% and 60.0% more high-quality 419 bins on human gut, dog gut, and ocean testing datasets, respectively. In most situations, the pretrained model that generated the highest number of high-quality bins performed better than the model that generated median and lowest number from the 421 same environment. Furthermore, transferring models between different environments could still improve results compared 422 to NoSemi version of SemiBin and in some situations, transferring across environments performed better than Metabat2 (see 423 Supplementary Fig. 13).

10/41
The results of model transfer indicated that the semi-supervised model learned high-level or shared structure of microorganism 425 between environments. However, transferred models still underperformed models learned for each sample and there was 426 a dependency on the sample used to train. Thus, we attempted to mitigate this by learning models on multiple samples 427 simultaneously. This approach achieved the best results, while not requiring computationally-costly per-sample training (see 428 Fig. 2 and main text).   Supplementary Fig. 21b). 448 We observed that a total of 13 genes were differentially present in dog and human gut strains (see Fig. 2 and main text). SemiBin also outperforms all other methods in the medium and high complexity datasets (see Supplementary Fig. 5a). c, SemiBin reconstructed a larger number of distinct high-quality genera, species and strains in the CAMI II Skin and Oral datasets compared to either Metabat2 or VAMB. A high-quality strain is considered to have been reconstructed if any bin contains the strain with completeness > 90% and contamination < 5% (see Methods). If at least one high-quality strain is reconstructed for a particular genus or species, then those are considered to have been reconstructed. d, Semi-supervised embedding separates contigs from different genomes. Shown is a two-dimensional visualization of embedding of the low complexity dataset from CAMI I, with contigs colored by their original genome (using t-SNE, we used Sklearn.manifold.TSNE(perplexity=50,init = pca)) (Pedregosa et al.   In the dog gut dataset, SemiBin(pretrain) always produced more high-quality bins than any other per-sample methods. In the human gut and ocean datasets, other methods occasionally outperformed SemiBin(pretrain) (observed for 2 human gut samples and 9 ocean samples), but the difference is never large (at most, two extra high-quality bins were produced). Results of VAMB in multi-sample binning mode are compared to SemiBin(multi), with SemiBin(multi) producing more high-quality bins overall (although not in every sample). P-values shown are from a Wilcoxon signed-rank test (two-sided) on the counts for each sample. c, We identified the overlap between the bins in the human gut from Metabat2 and SemiBin(pretrain) using Mash (see Methods). While some high-quality bins from Metabat2 were matched to lower quality SemiBin(pretrain) bins, the reverse was much more common (i.e. SemiBin(pretrain) recovered higher-quality versions of bins that were only medium-quality in the Metabat2 outputs as well as some that were completely absent). Within bins that are present at high-quality in the output of both binners, SemiBin(pretrain) achieved higher completeness (recall) (P = 2.236ů10 -67 ) without a statistically-significant increase in contamination (1 -precision) (P = 0.167). The overall F1 statistic (2OE(recall OE precision)/(recall + precision) ) is significantly better (P = 9.287ů10 -67 ; all P-values were computed from using Wilcoxon signed rank test, two-sided null hypothesis). Results in the dog gut and ocean datasets are qualitatively similar (see Supplementary Fig. 15). (HQ-HQ: high-quality in both; HQ-MQ: high-quality in one and medium-quality in the other; HQ-LQ: high-quality in one and low-quality or worse in the other; HQ-Miss: high-quality in one and missed in the other) d, Strains of B. vulgatus recovered from the human and dog gut microbiomes and a type strain B. vulgatus ATCC 8482 cluster according to their host. Shown is the maximum-likelihood phylogenetic tree based on core genes; branches with bootstrap values higher than 70 were marked pink in the nodes. Clustering based on ANI or gene presence also showed a separation between the hosts (see Supplementary Fig. 21). The gene content of the strains was also statistically different between the two hosts (P < 0.05 using Fishers exact test after FDR correction using Benjamini-Hochberg method, see Methods), in particular in the presence of different  Supplementary Fig 1. Overview of the SemiBin pipeline. a, Generating must-link constraints by breaking up contigs artificially and generating cannot-link constraints based on contig taxonomic annotations (i.e. GTDB reference genomes). b, Calculating abundance value (average and variance of the number of reads per base) and k-mer frequency of every contig . c, Training semi-supervised siamese neural network of the cannot-link and must-link constraints as inputs. The learned embedding will be used in step e for binning. textbfd, Based on the assumption that the number of reads per base obeys normal distribution, calculating the Kullback-Leibler divergence of the normal distributions of two contigs. SemiBin uses this value as the abundance similarity when the number of samples used is smaller than 5. e, Generating a sparse graph with the embedding distance and abundance similarity and Infomap algorithm is used to get the preliminary bins. Finally, SemiBin uses weight k-means to recluster bins whose mean number of single copy genes is greater than one to get the final bins.

Input2
Unsupervised loss Unsupervised loss Shared Weight Input2_reconstructed Input1_reconstructed Distance Contrastive loss Supplementary Fig 2. Semi-supervised neural network model used in the SemiBin. A pair (input1, input2) is input to a shared-weight neural network. Input1 can be k-mer frequencies and the abundance distribution (n ≥ 5) or just the k-mer frequencies (n < 5). During the training, the unsupervised loss and the contrastive loss are optimized at the same time. After the training, the embedding of the inputs can be used for the following binning. Genomes in CAMI I datasets are defined as common strains and unique strains according to the genome similarity. Common strains are defined as genomes with an ANI (average nucleotide identity) ≥ 95% to the most similar genomes in the environment and unique strains are defined as genomes with < 95% ANI value to any other genome. Shown is the number of reconstructed genomes per method above certain completeness and contamination < 5% for a, medium and high complexity datasets considering all strains; b, three datasets considering common strains; and c, three datasets considering unique strains. SemiBin reconstructed more high-quality bins (considering all strains, common strains and unique strains), especially for common strains which is a big challenge for binning in an environment with multiple strains.   Supplementary Fig 8. SemiBin outperformed Metabat2 and VAMB in CAMI II datasets with multi-sample binning. a, The comparison of Metabat2 with single-sample binning (Metabat2(default)) and our adapted multi-sample binning (Metabat2(multi)). The adapted multi-sample binning Metabat2 led only modest improvements. b, and c, Shown are the overlaps of the reconstructed distinct high quality strains, species and genus for the Skin and Oral datasets. SemiBin reconstructed more distinct strains, species and genera compared to VAMB and Metabat2. Supplementary Fig 9. SemiBin outperformed Metabat2 and VAMB across almost all ANI intervals in CAMI II datasets. We stratified the datasets according to the ANI value which is defined as the ANI of one genome to the most similar genome in the same sample. We calculated the number of distinct high-quality strains in every interval. Shown are a, the number of reconstructed distinct high-quality strains in every interval, and b, the genome distribution according to the ANI values for every sample of Skin and Oral datasets. To evaluate that the semi-supervised learning model could learn the underlying structure of the environment (not just remembering the must-link and cannot-link constraints), we compared SemiBin to the NoSemi version (removing the semi-supervised learning part in SemiBin), SemiBin_m (directly using must-link constraints to generate the sparse network for clustering, no semi-supervised learning), SemiBin_c (directly using cannot-link constraints to generate the sparse network for clustering, no semi-supervised learning), and SemiBin_mc (directly using must-link and cannot-link constraints to generate the sparse network for clustering, no semi-supervised learning) (see Methods). Shown is the number of high-quality bins in low complexity, medium complexity and high quality datasets from CAMI I.  Supplementary Fig 12. Semi-supervised deep learning in SemiBin learnt the underlying structure of the environment in CAMI II datasets. Shown is the number of high-quality distinct strains, species and genera in Skin and Oral datasets. The methods used here are the same as those in Supplementary Fig. 11.

29/41
Human gut We transferred the learned semi-supervised models from one sample between human gut, dog gut and ocean datasets. We chose three models from samples which reconstructed the highest, median and lowest number of high-quality bins for each environment, and termed these models as human_high, human_medium, human_low, dog_high, dog_medium, dog_low, ocean_high, ocean_medium, ocean_low. For every environment, we randomly chose 10 samples which were not used before as testing sets (No overlap in training samples and testing samples). The transferring results compared to Metabat2, original SemiBin and nosemi version of SemiBin(NoSemi) are shown as the number of high-quality bins for each environment (the darker the color, the higher number of high-quality bins).

30/41
Single-sample binning Multi-sample binning  Supplementary Fig 14. SemiBin outperformed other binners in real datasets with single-sample and multi-sample binning evaluated by CheckM. The high-quality bin is defined as a bin with completeness > 90% and contamination < 5% evaluated by CheckM. Shown is the number of high-quality bins generated by Maxbin2, VAMB, Metabat2, SemiBin and SemiBin(pretrain) with single-sample binning and VAMB and SemiBin with multi-sample binning in the human gut, dog gut and ocean datasets. Based on results from CheckM, SemiBin(pretrain) reconstruncted 55.7%, 94.6% and 49.7% more high-quality bins than Metabat2 with single-sample binning and 44.7%, 25.0% and 41.0% more high-quality bins than VAMB with multi-sample binning in the human gut, dog gut and ocean datasets, respectively.

31/41
Dog gut Results here are qualitatively similar to results of the human gut datasets (see Fig. 2c).

33/41
S p e c i e s G e n u s High-quality bins High-quality bins Supplementary Fig 19. SemiBin with the pretrained model outperformed Metabat2 on two hold-out human gut datasets from African and German populations. We transferred the pretrained model from the human gut dataset to two hold-out human gut datasets from African and German populations. We also benchmarked Metabat2, SemiBin and SemiBin with the pretrained model that trained from the hold-out datasets (using 20 samples). (SemiBin(pretrain; internal): SemiBin with the pretrained model from the hold-out human gut datasets; SemiBin(pretrain; external): SemiBin with the pretrained model from the human gut datasets used in Fig. 2b. a, African human gut; b, German human gut.