Metagenomic Binning through Multi-resolution Genomic Binary Patterns

Motivation High-throughput sequencing has facilitated the analysis of complex microbial communities. Consequently, an enormous number of sequences have been generated containing various regions of bacterial and viral genomes. Image processing offers a rich source of descriptors for data analysis. Here, we introduce a feature space called multi-resolution local binary patterns (MLBP) from image processing as a feature descriptor to extract local ‘texture’ changes from nucleotide sequences. We demonstrate its applicability to the alignmentfree binning of metagenomic data. Results The effectiveness of our approach is tested using both simulated and real human gut microbial communities. We compared the performance of our method with several existing techniques that are based on k-mer frequency to show it outperforms existing techniques. In addition, we provide a time-series study of the abundance pattern of each bin to help refine the formed clusters automatically and to find relations that may exist among the clusters. Although the main aim is to introduce the use of genomic signatures using an alternative feature space (MLBP), our results show its application to the analysis of contigs from a metagenomic study. Availability The source code for our Multi-resolution Genomic Binary Patterns method can be found at https://github.com/skouchaki/MrGBP


I. INTRODUCTION
High-throughput (so-called next-generation) sequencing technologies are generating an enormous volume of biological data. Sequence analysis therefore plays an important role in studying the genetic information present in the sequenced samples. In metagenomic studies the sequence reads can be from the same or different genomes from a community of viruses and bacteria. Hence, reconstructing individual genomes from this mixed data can be problematic. Moreover, sequencing errors, sequence repetition, insufficient coverage, and genetic diversity can give rise to fragmented assemblies. Consequently, alignment-free techniques [1], [2] have been introduced as an alternative way for analysing the species composition of metagenomic data [3]. A main category of alignment-free techniques are based on species-specific genomic signatures extracted by calculating the normalised frequency of k-mers of a specific size, e.g., k = 4. The signatures are obtained by counting the occurrences of each k-mer combination where the k-mer frequency of each sequence represents a feature vector in high-dimensional space.
Here our aim is to introduce an alternative feature space, local binary patterns (LBP), to extract the local changes in a sequence. LBP is a feature descriptor capturing local texture changes first introduced for segmenting an image into several meaningful partitions [4], [5]. Its one-dimensional implementation also found application to other signal processing areas including speech processing [6]. Moreover, it has a multi-resolution version, called multi-resolution LBP (MLBP), which considers texture changes at various scales [7]. Here, we assume that each genomic contig or sequence read has 'texture' patterns at various scales that can be extracted using MLBP. Moreover, the arbitrary location of each pattern does not affect the extracted feature vector. However, calculating MLBP requires numerical data as an input. Thus, genomic sequences need to be mapped into one or several numerical representations [8], [9]. A group of such representation methods are based on biochemical or biophysical properties of DNA molecules and some others are arbitrary assigned numbers. MLBP features can be extracted from these numerical representations, that can be used to analyse the metagenomic data.
To demonstrate an application of this feature vector and its effectiveness, we consider the problem of automatically grouping reconstructed genomic contigs into species-level groups ('binning'). Binning plays an important role in metagenomic analysis as it groups related reads or contigs for further analysis. Unsupervised binning and visualisation of the metagenomic data is especially helpful when there is no related reference genomes or any other prior information about the taxonomic structure of the data.
A number of metagenomic binning techniques have employed genomic signatures. Across-sample coverage-profiles, or a hybrid approach, using genomic signatures are commonly used to describe genomic fragments [10], [11]. Emergent self organising maps (ESOM) based binning is one example that uses contour boundaries to visualise the clusters [11]. Unfortunately, ESOM plots are computationally very expensive. There are also methods that consider coverage across multiple samples, e.g., CONCOCT [10] and MetaBAT [12], however it requires a high number of samples to perform well, e.g., 50 or more. VizBin [13] is another reference-independent visualization approach that considers a single sample, but it needs manual selecting of the centroids for binning.
Here, for extracting the features from the numerical mapping of nucleotide contigs, we use MLBP in one-dimension. We also consider the effect of considering coverage information across-samples in a hybrid approach to maximise the performance in longitudinal metagenomic samples. For visualisation, the feature vectors need to be projected from a higher di- mensional to lower dimensional space (e.g., two-dimensions). The common feature reduction techniques are: (i) linear based, such as singular value decomposition (SVD) [14], [15], and (2) non-linear based, such as ESOMs [16] and Barnes-Hut t-Distributed stochastic neighbor embedding (BH-tSNE) [17]. Although the non-linear techniques preserve the underlying structure of the data, they are computationally expensive. Therefore, we have used randomised SVD (RSVD) [18] to first reduce the higher dimension to obtain 'eigengenome' information [19] in a shorter time. Finally, there eigengenome features are passed as an input to BH-tSNE for visualising and binning the metagenomic dataset.

II. METHODS
In this section we present our methodological pipeline ( Figure 1). We numerically represent the genomic contigs using a nucleotide mapping ( Table I). After that MLBP is used to extract features from these numerical representations. In addition, across-sample coverage information (mean and standard deviation) is extracted separately using Bowdtie2 [20]. RSVD is used to reduce the dimensions of the MLBP feature vectors by capturing the eigengenome information. BH-tSNE is then used to map RSVD features to a two-dimensional space for visualisation. For quantitatively evaluating the visualisation performance, we cluster the BH-tSNE projected data using DBSCAN a density-based spatial clustering algorithm [21] and calculate the precision, recall, and F1 score between the DBSCAN assigned labels and the original labels.

A. The Nucleotide Mapping
Methods to numerically represent the genomic reads can be categorised into two groups: (1) Assigning an arbitrary value to each letter A, C, G, or T of the nucleotide sequence: Voss [22], two or four bit binary representations [23], [24] can be considered as examples of this group. (2) Defining numerical representations that correspond to certain biochemical or biophysical properties of the DNA molecules: electron ion interaction potential (EIIP) [25], paired nucleotide representations [26], and atomic representations [27] are examples of this group.
Various representations carry different properties (texture patterns) of each sequence. Here, we compare the EIIP, atomic, paired, real, and integer nucleotide representations. Table I  shows the value assigned to each nucleotide in each of the representations. Figure 2 shows an example of mapping a nucleotide sequence to two numerical vectors.

B. Multi-resolution Local Binary Patterns
LBP and its extensions (e.g., MLBP) have gained significant popularity in the field of image, speech, and signal processing [28]. Using LBP, each two-dimensional window is mapped to a binary number with a fixed length. LBP codes illustrate the data patterns (e.g., for textural changes in images and frequency changes in speech), while the histogram distribution shows how often each pattern appears. These histograms are considered as the feature vectors which essentially extract the species specific genomic signatures.
LBP assigns a binary code to each sample by examining its neighbouring points. By considering x(t) as the tth sample of the numerical representation of a genomic segment, LBP is defined as where p is the number of neighbouring points and Sign indicates the sign function Sign assigns a binary number by thresholding the difference between each neighbouring point and the centre point t.
Consequently, it assigns a p-bit binary number to each window of length p + 1. Each binary number is converted to a LBP code using a binomial weight. An example of the LBP operator can be seen in Figure 3 where p = 6. The value of the centred point (in the square in Figure 3) is compared with the Fig. 3. Calculating the LBP code. A threshold of the atomic numerical representation of the sequence is determined by comparing the centre point (in the square) and its neighbours. The LBP code is then obtained by using binomial weights.
six neighbouring points to produce the LBP code. This code describes the data changes locally all in a compressed format. Finally, by considering all the obtained codes, the distribution of the LBP codes can be defined as where k = 1, 2, ..., 2 p and N is the genomic fragment length.
Considering the distribution makes the feature space dependent of happening location of each pattern. MLBP is an LBP extension that combines the results of LBP distribution from various values of p ≤ P . Consequently, the pattern changes of different resolution levels are considered to improve the description of the data inputs. Here, we apply MLBP to one-dimensional linear sequences to consider pattern changes of various lengths.

C. Across-Samples Coverage Information
To obtain the coverage profile for contigs across the longitudinal samples, the Illumina reads were mapped to contigs with Bowtie2 [20] for each time point. Samtools [29], [30] was then used to produce a per base depth file. As a result, our coverage feature vector for each genomic contig is the average and standard deviation of the per base depth for each contig.

D. Randomised Singular Value Decomposition
A metagenomic community can be considered as a linear combination of genomic variables. The sequence of MLBP codes for each genomic fragment captures the changes in the pattern (the "texture") of each distinct fragment. By representing a vector of MLBP codes for each fragment, low-rank matrix approximations can be used for efficient analysis of the metagenomic data. Our assumption in using SVD is that the MLBP codes of the contigs from each species have a distinct energy contribution. Therefore, the data can be represented as a linear combination of mutually independent components. SVD decomposition of a matrix X is defined as where U and V are the left and right singular vectors, Σ is singular values, and (·) T denotes the transpose operator. SVD can be time consuming when dealing with large scale problems such as metagenomic data analysis. Therefore, RSVD is used as an accurate and robust solution to estimate the dominant eigen components quickly [31].
RSVD calculates the first ith eigen components of the data by using QR decomposition and mapping X to a smaller matrix as Ω = randn(N, i), where randn generates a random matrix of the size of its inputs and N is the number of contigs. After decomposing B using SVD, the final factors are obtained using Q and the eigen factors of B.
E. Barnes-Hut t-Distributed Stochastic Neighbor Embedding BH-tSNE has become a common technique for highdimensional data visualisation in several applications [17]. It is based on the divergence minimisation of two distributions: pairwise similarities of the input objects and the corresponding low-dimensional points. As a result, the data in the final lower dimension keeps the original local data structure.
The ordinary similarity measure of the data points is defined based on normalised Gaussian kernel values that scales quadratically to the number of data points. The main objective function is approximated by defining the similarity function based on a number of neighbouring points [17]. In addition, a vantage-point tree is employed for rapidly finding the neighbouring points. BH-tSNE is thus a more efficient (O(N logN )) data reduction approach and used in this paper for data visualisation.

F. Datasets
To validate the effectiveness of our methodology we consider both simulated and real datasets. Simulated metagenomic data of Illumina sequences for 10 genomes was downloaded from http://www.bork.embl.de/ ∼ mende/ simulated data/27. The data were assembled by Ray Meta [32] into contigs (k = 31). Their %GC and genome size are illustrated in supplementary Table 1. Using this dataset, various aspects of our method, including MLBP window length and RSVD number of eigen components, are analysed.
For the real data analysis, a time-series metagenomics human gut dataset comprised of 11 samples taken over nine days from faeces from a newborn infant [11] was analysed. The authors have assembled the data into 2329 contigs. This assembly and binning information is provided at http://ggkbase.berkeley.edu/carrol/. Corresponding Illumina reads can be downloaded from the NCBI, SRA052203, which consists of 18 Illumina sequencing runs (SRR492065-66 and SRR492182-97). For the real data, we mapped the reads to the contigs using Bowtie2 and coverage profiles have been obtained using SAMtools. For both datasets only contigs with a length equal or longer than 1000 bp has been considered.

G. Performance Evaluation
In order to check the performance of our Multi-resolution Genomic Binary Patterns method, DBSCAN [21] has been used to cluster the final results. The precision, recall, and F1 score are calculated between the DBSCAN assigned labels and the original labels to determine the performance as a measure of a clusters "purity". Assuming there is m genomes in the dataset and it is binned to k clusters, the precision, recall, and F1 score can be calculated as where s ij is the length of contigs in cluster i corresponds to genome j.

H. Selecting DBSCAN parameters
DBSCAN does not need the number of clusters but has two parameters that need to be determined: epsilon that indicates the closeness of the points of each cluster to each other and minPts, the minimum neighbours a point should have to be considered into a cluster. Usually these values are not known prior to analysis and there are several ways to select their values. One way is to calculate the distance of each point to its closest nearest neighbour and use the histogram of distances to select epsilon. After selecting epsilon a histogram can be obtained of the average number of neighbours for each point using the epsilon. Some of the samples do not have enough neighbouring points and can be considered as noise. Implementation of the parameter selection is included in spark dbscal (https://github.com/alitouka/spark dbscan). DB-SCAN may result in some unclustered samples. Our implementation assigns the unclustered points to the closest cluster.

III. RESULTS AND DISCUSSION
In this section we present our analysis and results on both real and simulated data, discuss the results, and provide the run time(s) using 2.8 GHz Intel Xeon processor with 4 GB RAM. First, various aspects of our proposed method are analysed and discussed, then, the real data is analysed and compared with some existing tools.

A. Effect of Selecting the Nucleotide Mapping
Since various representations assign different values to each letter, they can lead to different texture patterns. Consequently, it can affect the final binning performance and the data visualisation. This can be considered as an advantage of using numerical representations as it provides the option of having various feature spaces and representations that can be selected based on applications or the final results.
Here, the performance of our method is illustrated for different numerical mappings (EIIP, atomic, paired, real, and integer nucleotide representations) for MLBP lengths p ≤ 6 (supplementary Figure 1, Figure 4, and Table II). The EIIP representation is selected for the following subsection as it has high performance and more discrimination compared to other representations. As demonstrated, all mapping methods, except for the paired number provide, slightly different data visualisations and quite similar performance. This is because different numerical representations can differ in the relative location of clusters (in the two-dimensional data space). Therefore, in different applications it can result in a different result. Here contigs of different species form visually separate clusters with a very limited overlap with the clusters of other species. Average run time in seconds is also provided in Table III. The run time includes loading the data, numerically representing the data, MLBP feature extraction, and dimension reduction using BH-tSNE.

B. Effect of MLBP Window Length
The number of features is related to the window length of the MLBP method and can effect final performance. Consequently, to check its effect, we considered various lengths of MLBP windows (Table IV and supplementary Figure 2). Here, run time only includes the time to numerically represent the data and MLBP feature selection.
Considering small window lengths, the feature space cannot describe the underlying structure of the metagenomic dataset, while a large feature vector increases the time complexity (Table IV). Hence, window size should be sufficiently large to maintain the distinctness of the signal (information regarding texture changes across various contigs).

C. Effect of RSVD
The computational complexity of our method increases as the dimensions of the features space increase. Therefore, we considered how keeping different numbers of eigen factors can effect the performance and run time of our method ( Figure 5). Here, EIIP is considered for nucleotide mapping and p ≤ 6 for   (Table V). These results show that the MLBP method can analyse a small metagenomic data in a short time. Moreover, it is performing well considering only one sample.

D. Real Data: Infant Human Gut
A relatively low-complexity infant human gut dataset was analysed to test the performance of the Multi-resolution Genomic Binary Patterns method with real data. Here, EIIP was used for the nucleotide mapping, p ≤ 8 for feature selection, and the first 60 eigen components in the dimension reduction stage (RSVD).
The algorithm binned the data into 19 clusters with precision and recall of 88.34 and 97.22. BH-tSNE representation of the data demonstrates the genomic contigs of the same or very similar contigs are binned together ( Figure 6). While some of the plasmids and viruses (bacteriophages) clustered with associated clusters, most species formed their own cluster. The bacterial species tend to form separate clusters, for example, Anaerococcus sp. and C. albicans form clusters 1 and 3 ( Figure 6). However, separating plasmid or virus from its host is less straight-forward due to their closer genome compositions. Nonetheless, our method manages to bin S. aureus strains, their plasmid, and virus into two groups; (1) S. aureus strain and plasmid and (2) S. aureus strain 2 and virus. Propionibacterium sp. appears as a separate bin. E. faecalis and one of its plasmids forms one cluster. S. epidermidis has three strains, three viruses, one integrated virus (prophage) and several plasmids, and the algorithm managed to bin them into five clusters where S. epidermidis strains 1 and 3 clustered together (including virus 13 and 14), with strain 4 forming a separate cluster (including virus 46).
To investigate the relationship between clusters, the abundance patterns of each cluster were calculated based on the number of reads mapped to contigs at the different sampling time points (Figure 7). Pairwise correlation coefficients were then calculated to check for any pattern among the clusters.
The results suggests that there is a strong correlation between clusters of related species (Figure 7). For example, the clusters of Propionibacterium and Peptoniphilus species have similar abundance patterns (Clusters 9-10). Similar results were also found in [11] where both species have proliferation in later stages and hence are well-adapted to the gut. Moreover, two clusters has been formed for F. magna with very similar coverage patterns (clusters 5-6). Consequently, this similarity could be analysed further to join some of the clusters. A similar pattern can be observed in the clusters of S. aureus, confirming the relationship between each bacteria and virus (clusters [11][12]. The five clusters of S. epidermidis also share similar coverage patterns (clusters [13][14][15][16][17]. A further step could be to cluster all the contigs of these five clusters separately to    Fig. 7. Longitudinal abundance patterns of the 19 identified clusters, see Figure 6. The associated species or groups of species are indicated for each cluster. The x-axis corresponds to the longitudinal sampling over nine days [11]. The y-axis corresponds to normalised read coverage. The red box indicate the correlated clusters across longitudinal samples.
have a better separation of the related strains and viruses.
Our results compared favourably with CONCOCT [10], MetaBAT [12], and MaxBin2 [33], [34]. CONCOCT bins the data by employing sequence composition and acrosssample coverage. The method has been compared with a range of methods including MetaWatt [35], SCIMM [36], and CompostBin [37] to show its advantage over composition based techniques. However, the shortcoming of the COCOCT method is having many small bins as the output (a large group of contigs may break to many clusters). MetaBAT bins the metagenomic data using probabilistic distances of genome abundance with sequence composition. It is an efficient method for analysing complex metagenomic data. However, both COCOCT and MetaBAT need large numbers of samples to perform well. MaxBin was originally introduced for single sample data in which it bins the data based on tetra-nucleotides frequencies and it has been extended to MaxBin2 to support multiple samples. However, MetaBAT and MaxBin2 produce many unclassified contigs. Consequently, they have higher precision but lower recalls. Our proposed MLBP method introduces a new feature space and compares well to these three methods. The results show better performance on this dataset with small sample size (11 samples) in comparison with the other techniques (Table VI).
Finally, we checked the run time of our method. It takes about three minutes (182.71 s) to analyse this dataset (the number of contigs is 2293 and total length of them is 27594702). Although the code is relatively fast, it could be further optimised in terms of both time and memory.

IV. CONCLUSION
Here we have demonstrated that image processing techniques can be applied to nucleotide sequence data comparisons. Specifically, a metagenomic visualisation and binning approach has been implemented by representing the nucleotide genomic contigs numerically. MLBP was employed to describe the genomic signature changes followed by a dimensionality reduction step to visualise the data in a lower dimension. Our results on simulated genomic fragments show the underlying taxonomic structure of the metagenomic data and verify the advantage of using signal processing approaches for metagenomic data analysis. As illustrated in Section 3, our method can be used for the visualisation and clustering of human gut metagenomic data at the genus or species level.
In addition, only a limited number of contigs overlap with the clusters of other species.