CuBi-MeAn Customized Pipeline for Metagenomic Data Analysis

Whole genome shotgun sequencing is a powerful to study microbial community is a given environment. Metagenomic binning offers a genome centric approach to study microbiomes. There are several tools available to process metagenomic data from raw reads to the interpretation there is still lack of standard approach that can be used to process the metagenomic data step by step. In this study CuBi-MeAn (Customizable Binning and Metagenomic Analysis) create a customizable and flexible processing pipeline, to process the metagenomic data and generate results for further interpretation. This study aims to perform metagenomic binning to enhance taxonomical classification, functional potentials, and interactions among microbial populations in environmental systems. This customized pipeline which is comprised of a series of genomic/metagenomic tools designed to recover better quality results and reliable interpretation of the system dynamics for the given systems. For this reason, a metagenomic data processing pipeline is developed to evaluate metagenomic data from three environmental engineering projects. The use of our pipeline was demonstrated and compared on three different datasets that were of different sizes, from different sequencing platforms, and generated from three different environmental sources. By designing and developing a flexible and customized pipeline, this study has showed how to process large metagenomic data sets with limited resources. This result not only would help to uncover new information from environmental samples, but also, could be applicable to any other metagenomic studies across various disciplines.


Introduction
Advances in molecular biology techniques such as next generation sequencing (NGS), PCR, molecular cloning, DNA Microarrays and protein mass spectrometry have improved our knowledge of microbiomes (Jovel, Patterson et al. 2016, Mendes, Braga et al. 2017, Quince, Walker et al. 2017, Heyer, Schallert et al. 2019, Sun, Liao et al. 2020). Among them, NGS is becoming one the most popular techniques used to generate targeted amplicon sequencing, shotgun metagenomics, and meta-transcriptomics data to study microbial communities. NGS technologies depend on sequencing technologies capable of generating millions (and sometimes billions) of small fragments of short (e.g. Illumina and SOLiD) or long (e.g. PacBio and Oxford Nanopore) DNA sequences (Mardis 2008, Quail, Kozarewa et al. 2008, Rhoads and Au 2015, Jain, Olsen et al. 2016. These NGS technologies can be used to produce sequencing libraries which contain substantial information about the entire microbial population living in an environment. Depending on the research question, scientists may choose between two type of popular approaches for generating libraries from DNA isolated from samples: amplicon sequencing and metagenomic sequencing. Amplicon sequencing analysis is aimed at exploring microbial community composition based on targeted sequencing of PCR amplicons of certain conserved region (16S rRNA, ITS or 18S) of the genomes that could be used as unique marker for phylogenic classification of the organisms present in a sample. Since this method relies on existing databases for classifications, they are inherently biased towards known organisms (Walsh, Crispie et al. 2018). In addition, taxonomical classification based on only marker genes fails to address the functional variation within closely related genomes (Hiergeist, Gläsner et al. 2015, Tremblay, Singh et al. 2015, Gohl, Vangay et al. 2016. Isolated DNA from a mixed microbial community can also be used for metagenomic shotgun sequencing, which has become increasingly utilized to characterize both taxonomy and function of microbial communities (Woloszynek, Zhao et al. 2018). Through metagenomic shotgun sequencing, it is possible to generate sequence libraries containing the genetic information from hundreds or even millions of cells in sample, in order to understand their taxonomical classification, potential functional capabilities, and physiological traits. Metatranscriptomic is another approach reliant on NGS technologies commonly used how genes are regulated in response toward environmental factors and stimuli (Ranjard, Poly et al. 2000). The rest of this chapter only focuses on developing a pipeline for processing the metagenomic data and producing outputs that could be used in downstream analysis and interpretations such as taxonomical classification, interactions and potential functions of the microbial communities exist in a microbiome. For more information on targeted amplicon sequencing and meta-transcriptomics, please see Rausch et al. (2019) article and Hodkinson and Grice (2015) and Bashiardes et al. (2016) reviews.
Recovering information from several fragments of DNA sequences generated by NGS facilities (also known as "reads") is not an easy task (Tyson, Chapman et al. 2004). Currently, there are two approaches for processing metagenomic libraries: gene-centric and genome-centric approaches. Gene-centric approaches (Venter, Remington et al. 2004, Tringe, Von Mering et al. 2005 involve the recovery and investigation of the entire microbiome as a "supra-organism" regardless of their individual function (Juengst andHuss 2009, Juengst 2009). In gene-centric approaches, individual genes are regarded as selfish units and are the central keys in carrying out the functions while genomes are nothing more than the vessel for the genes (Dawkins 2016).
Here, the genes are the fundamental framework of molecular biology for decoding the blueprint of the life and evolution (Venter, Adams et al. 2001, Tishkoff and Verrelli 2003, Schloss and Handelsman 2004, Guénet 2005. Gene-centric approaches rely heavily on existing databases and often overlook novel genes (Jaenicke, Ander et al. 2011, Wong, Zhang et al. 2013. In gene centric approach certain functions are attributed to a gene or a gene cluster. These genes are going to be used as a reference for annotation of the unknown genes. Therefore, any variation of these genes may increase the errors of annotations. In addition, another problem is confusion of homologous genes that have very similar genes may have different functions. For example, ammonia monooxygenases is very similar to methane monooxygenases. This could cause perplexity to annotation and interpretation of annotation. In addition, especially for short read sequencing libraries, this approach fails to address questions related to the function of individual genes without considering that metabolic and functional traits could be dependent on multiple genes and how they are regulated. For example, in metagenomic investigation approaches, a certain pathway is complete in gene-centric approaches, however, in this approach it remains unspecified if all of the genes belong to one organism or belong different organisms. For proper function of the some pathways intermediates/metabolites may need to be transported out and into the cell (Strambio-De-Castillia, Niepel et al. 2010, Villegas andZaphiropoulos 2015). This would be also true about the co-expression or co-regulation of the genes. Cells respond to environmental changes by reprogramming expression of specific genes throughout the genome.
The transcription rate of a particular gene is determined by the interaction of diverse regulatory proteins-transcriptional activators and repressors-with specific DNA sequences in the gene's promoter. How a collection of regulatory proteins accomplishes the task of regulating a set of genes can be described as a regulatory network (Wyrick and Young 2002). These networks might be present in a dataset, but the arrangement or transporters control the proper function of these genes. Therefore, in this approach metabolic interactions of genes is difficult to prove (Heng 2009, Vanwonterghem, Jensen et al. 2016).
These shortcomings in gene centric approaches in metagenomic studies led to the development of the genome-centric concept which has revealed the functional properties of individual genomes, leading to a more detailed comprehension of the microbial interactions occurring in the microbiome (Kougias, Campanaro et al. 2018). While, gene-centric approaches focus on to the function of individual genes and correlating it to the biochemical activities of the system, genome-centric approaches decipher the complexity of the genome by considering the genes functions and interplay within a genome. Genome-centric approaches create an additional dimension to functional analysis of metagenomic data by correlating the interaction of the products of the different genes existing in a genome and environmental factors (Raghoebarsing, Pol et al. 2006, Wrighton, Castelle et al. 2014. The genome-centric concept is based on the premise that a microbial community is composed of taxonomically and functionally related bacterial populations that can interact. Each bacterial population is comprised of a 'core-genome', consisting of genes that are always present and carry out major functions and a 'pan-genome' which contains genes that are variably present (Tettelin, Masignani et al. 2005). The pan-genome is a holistic snapshot of the collective genomes from closely related organisms and thus includes specific and specialized functions and adaptations of divergent taxonomical units belonging to the diversity among species or strains that compose the pan-genome. This provides valuable genetic data for understanding the evolutionary processes which affect the structure and dynamics of related bacterial populations in relation to the environmental factors (Holmes, Gillings et al. 2003, Whitaker andBanfield 2006). Furthermore, integrating functional and taxonomical results by using genome-centric methods and coupling them to the existing databases enables us to have a deeper and more comprehensive insight into dynamics of a biological system. While gene-centric analysis is heavily dependent on assembling fragmented DNA reads, genome centric analysis of metagenomic data depends on the clustering of reads into bins.
"Binning" is a method for clustering reads based on certain characteristics which is used as an alternative to full metagenome assembly which is the basically assembly of the reads to a supraorganism for downstream analysis (Teeling, Waldmann et al. 2004, Woyke, Teeling et al. 2006, Albertsen, Hugenholtz et al. 2013, Cotillard, Kennedy et al. 2013, Le Chatelier, Nielsen et al. 2013. Development of new algorithms improved the metagenomic tools used in processing metagenomic data including binning tools (Anantharaman, Brown et al. 2016, Parks, Rinke et al. 2017, Almeida, Mitchell et al. 2019, Pasolli, Asnicar et al. 2019). There are different metagenomic binning tools available for processing metagenomic data. These metagenomic binning tools use K-mer frequency, codon content, and read coverages read coverages across multiple data sets for clustering the short read into the bins , Kang, Froula et al. 2015, Wu, Simmons et al. 2015, Graham, Heidelberg et al. 2017, Lu, Chen et al. 2017). However, due to different algorithms used in these tools, generated bins could be different for the same data bases.
Therefore, refining tools is developed intended to improve the quality of the quality of the bins (Sieber, Probst et al. 2018). These improved quality bins can be used in taxonomical classifications to approximate the position of bins into the phylogenetic tree and the potential metabolic pathways that these bins could carry out.
There are many methods to go from raw reads to bins but there are no methods that make use of these bins to making use of bins to gain information about the function of different populations identified in a microbial community and how they may, interact. Therefore, the purpose of the pipeline described here is to go from processing of the metagenomic data from starting point to take the sequencing data and generate data that are ready for downstream analysis and interpretations. Although, there are many powerful tools available to process metagenomic data from raw reads to the interpretation there is still no standard approach that can be used to process of a standard approach that user could be used to process the metagenomic data step by step. Existing cloud services have some limitation such as the size of the data bases high dependency on internet and lack of flexibility of the options. Uritskiy et al. (2018) developed a pipeline to process metagenomic data in this pipeline which is called MetaWrap.
MetaWrap is an automated pipeline which comprised of several metagenomic tools that process raw reads from metagenomic samples, cluster them into the metagenomic bins then generate outputs for final interpretation such as taxonomical classification and functional annotations.
This pipeline includes several tools that all required to be installed and run to generate the results however, user would be able to customize individual tools, but the overall processing steps are almost fixed and need to be run.
Another metagenomic data processing and interpretation pipeline created by Clarke et al. (2019) named Sunbeam. This pipeline includes series of metagenomic tools for quality control, decontamination assembly, taxonomical classification and functional annotations. Unlike MetaWrap, Sunbeam uses pre-processed metagenomic reads for taxonomical classification rather than clustering the bins into "genome". This tool used mapped reads for functional annotation.
The advantage of this pipeline is due to the parallel configuration of tools which make the steps independent and the pipeline highly flexible and customizable. In this pipeline taxonomical classification and annotation are directly from the pre-processed reads and independent of each other. Therefore, the user will not be able to investigate assigned functions and taxonomy at the same time to correlate interaction of the genes within a genome which is the basic of genome centric approach.
The aim of this work is to create a more customizable and flexible processing pipeline to process the metagenomic data and generate results for further interpretation. This pipeline which is called CuBi-MeAn (Customizable Binning and Metagenomic Analysis) generates taxonomical classification and functional annotations that could be used for genome-centric as well as genecentric investigation of the given microbiome. CuBi-MeAn is comprised of a series of metagenomic tools that could be customized by user. The flexibility of this pipeline allows the users to add new tools to each step (ex. different assembly or binning tools can be added). Since, the tools in CuBi-MeAn are independently installed the user would be able to install and use them on separate system such as shared clouds or local systems. This flexibility would be an advantageous when users handling large size metagenomic datasets which need to deal with some system limitations (RAM, storage, etc.). In the following section we reviewed the details, different steps and the tools used in CuBi-MeAn; Then, we discussed about the performance of this pipeline on processing three different metagenomic datasets.

Methodology
CuBi-MeAn is comprised of a series of metagenomic tools that are able to use metagenomic raw reads as input and generate bins for functional annotation and taxonomical classifications.
The overall workflow is summarized in Figure 2.1. The modules used in this pipeline may require different dependencies that could be installed and run separately. The Anaconda installation package and module instruction is also available for most of the modules. Detailed instruction and functions are covered in the following sections.

Data preparation
Quality Filtering: To improve the quality of the raw data, Sickle tool is used to remove low quality end of the reads (Joshi and Fass 2011). Users can customize the options of Sickle tool based on sequencing technology and input and output (https://github.com/najoshi/sickle).

Data processing
Assembly: Assembly of the entire metagenomic libraries is the first step for the metagenomic binning. De-novo metagenomic assembly tools which does not require reference genomes are used for assembly of the raw reads into the contigs.
Here, IDBA de-novo assembly tool was used to assemble the whole metagenomic data (Peng, Leung et al. 2010). In this step, the whole metagenomic libraries are assembled using appropriate metagenomic tools to create a "Supra-genome". This "supra-genome" which represent the entire metagenomes use as "reference" for the next binning clustering step.
Then, selected assembly tools were compared for their computational resources' requirements. Among them IDBA outperformed the others used for metagenomic assembly in this pipeline. The user may customize IDBA parameters to optimize the assembly (https://github.com/loneknightpy/idba).

Binning:
In this study, the metagenomic binning approach was utilized to investigate the subject metagenomic data sets. The two different main approaches for binning of metagenomic data are supervised and unsupervised binning. In supervised binning, in reference genomes are used for clustering of the metagenomic reads. Supervised binning is suitable when there are specific targeted species in the data set. In this method the reference genomes are aligned to the query and binning is based on the GC content and k-mers frequency (Mohammed, Ghosh et al. 2011, Mande, Mohammed et al. 2012). However, the accuracy of the supervised methods is questionable, especially for environmental samples that have higher diversity (Sedlar, Kupkova et al. 2017). In addition, supervised method could be biased towards the reference genomes and leave out new species (Cole, Brosch et al. 1998).
Another binning approach is called unsupervised binning in which the reference genome is not required for read clustering. Instead, unsupervised binning relies on sequence composition, abundance of genome fragments or a hybrid method. Nucleotide composition methods are based upon a theory that oligonucleotide, dinucleotide or G+C content showed species-specific pattern within the DNA of the same genomes (Sandberg, Winberg et al. 2001, Pride, Meinersmann et al. 2003, Wu and Ye 2011. This notion became the main pillar for design of the algorithm of tools such as TETRA (Teeling, Waldmann et al. 2004), MetaCluster (Woyke, Teeling et al. 2006), and MetaCluster, (Yang, Peng et al. 2010). However, there are still some disputes on accuracy of the sequence composition method due to sequence variation within a single genome, which makes it challenging to accurately classify very short reads (Yang, Peng et al. 2010, Wu andYe 2011). It has been suggested that the abundance of certain genes (aka marker genes) are constant across the same genomes (Wang, Leung et al. 2012, Albertsen, Hugenholtz et al. 2013, Nielsen, Almeida et al. 2014). In the other word, for a given sample the abundance of a gene in a specific genome will be the same as other genomes of the same species. Therefore, coverage-based binning was introduced as an alternative for composition-based binning in unsupervised binning (Albertsen, Hugenholtz et al. 2013, Cotillard, Kennedy et al. 2013, Le Chatelier, Nielsen et al. 2013). These two principals were later integrated to create hybrid binning tools such as BinSanity (Graham, Heidelberg et al. 2017), MaxBin2 (Wu, Simmons et al. 2015) , MetaBAT (Kang, Froula et al. 2015), COCACOLA (Lu, Chen et al. 2017) and CONCOCT ) that outperformed any of those individual tools. Therefore, unsupervised hybrid binning tools are ideal option for metagenomic analysis of samples with higher diversity such as environmental samples.
In this study, metagenomic assembled contigs from the assembly step were used as the reference for binning tools. CuBi-MeAn utilize the following five hybrid binning tools: BinSanity (Graham, Heidelberg et al. 2017), MaxBin2 (Wu, Simmons et al. 2015) , MetaBAT (Kang, Froula et al. 2015), COCACOLA (Lu, Chen et al. 2017) and CONCOCT ). As these tools are able to be run in parallel within this pipeline in this pipeline, the user capable of add new binning tools or opt out any aforementioned binning tools.
More information regarding these binning tools can be find in their refence and webpages.

Bins refinement:
The results of binning tools can be used for downstream analysis.
However, since many of these binning tools use different parameters and approaches (i.e. different algorithms) for processing metagenomic data; This results low quality and incomplete bins and better performance for different data sets using the same binning tool. Therefore, finding an appropriate tool for each data set would be another challenge for obtaining high quality bins. DASTool (Sieber, Probst et al. 2018

Downstream Data Assessment and Analysis
Quality Assessments: For quality assessment of the bins, CheckM software (Parks, Imelfort et al. 2015) was used to evaluate the bins generated by binning tools and DASTool. Quality of bins generated from metagenomic data are a major factor impacting the performance of the binning tools. In metagenomic assembled genomes, unlike the single isolate genome assembly, the genomes are recovered from a diverse group of microorganisms, therefore there is always the potential to introduce DNA fragments into the metagenomic assembled bins that are not actually belongs to. Identification and quantification of the universal single copy genes (USCG) present in the bins are one of the most common approaches to evaluate the quality of the MAGs. For taxonomical classifications CheckM relies on single copy marker genes that are specific to a genome's lineage within a reference genome tree. CheckM uses 104 linage specific marker data sets taxonomical classification.
PhyloPhlAn uses the most conserved 400 proteins for extracting the phylogenetic signal.
The marker gene identification step aims at first selecting the most relevant and the highest number of phylogenetic markers for the input sequences and then identifying them in the input sequences. The selection of the markers depends on the type of phylogeny considered and ranges from the 400 universal proteins to a variable number of core genes and species-specific genes.
CAT/BAT used the DIAMOND protein database (Buchfink, Xie et al. 2015) and

Results and Discussions
CuBi-MeAn was developed and applied to carry out a three-pronged approach to the analysis of metagenomic data from three different environmental engineering projects (TNT contaminated soil, EBPR reactor and algae-bacteria bioreactor). These three-pronged approaches are as follow: Approach 1: To understand the dynamics of microbial community structure; Approach 2: To understand microbial function in the given environment; Approach 3: To explain how environmental factors affect the microbial communities. In this section, the performance of CuBi-MeAn in processing three different metagenomic datasets was evaluated. The information regarding the data types and other metrics are summarized in Table 2.1. The performance of CuBi-MeAn pipeline for abovementioned data sets will be discussed in this section.
The metagenomic data are generated from different origins. The first data sets, TNT was generated by collecting soil samples contaminated with an old TNT manufacturing site. The soil contains high concentrations of nitroaromatic compounds such as TNT, DNTs etc. The site was under aeration treatments by providing periodic tilling for 6 years. The second data set are EBPR data collected from aqueous sample. In this study bench-scale enhanced biological phosphorous removal (EBPR) reactors was designed and operated to study the organic phosphorous uptake from synthetic wastewater by certain microorganisms. The metagenomic samples in this study are from the EBPR reactor after 23 days of aerobic-anaerobic cycles. Finally, Algae data which also sequenced from aqueous samples are collected from an algae-bacteria bioreactor designed to enhance nitrogen removal from wastewater. As mentioned, and shown in Table 2.1, these metagenomic samples have different properties therefore we expect to have different performance from the pipeline which is used in this study.

Assembly:
In this study three de-novo assembly tools Velvet, IDBA and Celera assembly were tested initially to evaluate the performance and computational requirements of these tools; Then, one assembly tool selected for downstream analysis. Assembly of the metagenomic data is one of the most computational resource intensive steps which requires high storage and RAM. This is a major constraint which needs to be considered for developing pipelines or simply processing metagenomic data sets. Therefore, in our study, four different subsample raw reads extracted from TNT data sets were tested to compare the performance of selected assemblers. These three reads were generated by mapping raw reads to selected reference genomes. The results of this test are summarized in Table 2.2. Among these assemblers, IDBA had the better performance overall, with Velvet performing better than Celera. Also, Celera assembler was not specifically designed to handle metagenomic samples, while the two others had options to process metagenomic data. Therefore, for the next test in this study, we only compared the assemblies generated by IDBA and Velvet for the unfiltered EBPR and TNT datasets.
For both IDBA and Velvet, large "kmer" files are generated, which tabulates the number of occurrences for each fixed-length word of length k in a DNA data set. Generating the kmer file is an extremely time-consuming task and it usually produces an intermediate file that requires a large amount of storage to run properly. In our tests, TNT generated a ~170 Gb and EBPR generated a ~310 Gb kmer file. The next steps after generating kmer hash tables is contigs generation-the most RAM extensive task. For IDBA assembly, it used up almost 220Gb and 240Gb of RAM while it never went through for Velvet to complete the assembly. This has been also confirmed previously that IDBA is one of the least memory intensive assembly tools among other popular assembly tools (Abbas, Malluhi et al. 2014, van der Walt, van Goethem et al. 2017). Since, our disposable RAM for this project was 256 Gb, due to memory limitation IDBA was our option for metagenomic assembly. The running time for IDBA to be completed and generate final contigs file was almost 36 hours for TNT and 42 hours for EBPR data. It could have been possible to choose Velvet or other assembler such as MetaSPAdes (Nurk, Meleshko et al. 2017) or MEGAHIT (Li, Liu et al. 2015). if we had better computing/memory resources in dispose.

Raw data quality filtering
Quality filtering of the raw reads removes low quality reads that generate at the sequence reading step by sequencing facilities as non-determinized sequences or "N" (Figure 2.2). The low-quality reads with deteriorating quality reads are more observed towards the 3'-end, but they can be observed towards the 5'-end as well. These, incorrectly called bases negatively impacts assembles, mapping, and downstream bioinformatics analysis (Young, Abaan et al. 2010). The assembled metagenomes from the assembly step are used as reference for clustering of metagenomic for binning. Shorter contigs size could cause inaccurate ambiguous binning due to low-complexity repetitive sequence (Chaisson and Pevzner 2008). Thus, in this study the quality filtering of the raw reads was tested on the assembly performance.
For this study initially no quality filtering of the raw metagenomic libraries was performed, which resulted in shorter and less completed contigs. Then, the metagenomic libraries were filtered to remove the low-quality sequences. In order to demonstrate the benefits of quality filtering the raw reads, we compared the summary statistics of the generated contigs of filtered and unfiltered reads when assembled by IDBA. The results for TNT data are shown Table 2.3. Quality filtering of the raw reads by Sickle tool improved the quality of the assembled metagenome by increasing the average length (e.g., N50, N90 and N95) of contigs generated by assembly tool (Table 2.3).

Binning and bin refinement
As discussed in the first chapter we adopted a genome-centric approaches to study the systems used in this study; Thus, we processed our metagenomic data sets using binning approaches. Binning of metagenomic reads approximates the functions and taxonomy of the assigned genomes, while bypasses the challenges of full genome assembly. (Ribeca andValiente 2011, Imelfort, Parks et al. 2014). Metagenomic assembled genomes (MAGs) or bins includes core genes of closely related taxa that has common genes and functions and at the same time pan-genes contains genes that are variably present in the bins (Tettelin, Masignani et al. 2005).
Pan-genes have specific and specialized functions and adaptations of divergent taxonomical units. Therefore, binning could appropriately address challenges of genome centric approaches of diverse metagenomic samples.
Here, five different binning tools were used to process our three subject metagenomic datasets. These tools used different approaches (i.e. different algorithms) for processing metagenomic data, which results in low quality and incomplete bins for some data sets and better performance for other data sets using the same tool. Thus, finding an appropriate tool for each data set would be another challenge for obtaining high quality bins. DASTool (Sieber, Probst et al. 2018) offers a solution for improving the quality of the bins. The selected binning tools in this study were BinSanity, MaxBin2, MetaBAT, COCACOLA and CONCOCT which are all hybrid clustering methods that use both kmer frequency and samples' co-abundance. Then, DASTool was used as a refinement tool to select best quality contigs among different binning tools outputs.
In this study, the quality of the bins was evaluated by CheckM. As discussed, earlier CheckM is an automated tool that uses a broader range of marker genes for quality assessments and taxonomical classification. CheckM offers several options that generates results to evaluate MAGs. Completeness and contaminations are two factors that are extensively used to assess the quality of MAGs. Contamination is the "false positive", while completeness is "true positive" from single copy marker genes in the given MAGs. In our study, we also used mostly these two parameters for quality assessment of our bins. The CheckM results indicate that the quality of the bins generated by DASTool outperformed individual binning methods (Figure 2.3). However, there are some variation in quality of the bins generated by DASTool among different data sets (Figure 2.4). The results suggested that the quality of the DASTool bins depended upon the quality of the bins generated by individual binning tools. This could be explained by the approaches and algorithms of applied by DASTool to generate the bins. It means that DASTool is not an assembly or clustering tool, instead it generates the new bins by evaluation and aggregation of the best contigs from bins generated by other binning tools.
Also, data source, coverage, reads length and technology used affect the quality of the final results. Among our data sets "Algae data" generates the best quality bins with less contaminations and more completeness (Figures 2.3 and 2.4). This could be explained by the fact that raw Algae data has the longest reads among others (Table 2.1). Longer reads would generate longer contigs which can be aligned and mapped with less ambiguity (Chaisson and Pevzner 2008). In addition, the origin of the sample (soil vs water, etc) could impact the quality of the bins. For example, TNT data is from the soil samples which inherently are very diverse, compared to aqueous samples, making clustering more difficult. Therefore, we have the least high and medium quality bins in TNT samples comparing to other data sets which are from water samples. could be generated simply by adding a single script to CheckM data processing steps used in this pipeline for quality assessments of the bins. Instead, PhyloPhlAn needs more steps to generate phylogenetic trees. CAT/BAT also used a different approach as mentioned earlier and has a few simple steps which makes it an easier and faster tool compared to PhyloPhlAn. Overall, the users can use these classification tools or add new tools and choose the best of the results from the bins' classification. Quantification of the microbial profile: To find out the microbial community profile in addition to taxonomical classifications their abundance is also investigated in this study. The abundance of certain groups of microorganisms in a system could explain why those group are more successful in that system in the given period of time, how they interact with chemical and biological composition of the environment, etc. In this study contigs in the MAGs were mapped to the metagenomic raw reads to evaluate the alignment rate of the contigs in the bins. The overall microbial community composition profile could explain the dynamic of our systems. For example, Figure 2.5 shows microbial community profile of Algae reactors. The results show strong presence of the predatory microorganisms in the system which plays an important role in the dynamics and community composition of the reactors. Nitrogen Metabolism Predators Macromolecule Degraders/Algae Assiciated Others complete degradation pathway of using these annotation tools. In this study certain genes were not detected in SEED database; therefore, other alternative such as alignment of those genes against bins by NCBI BLAST tool, was used for investigation. Figure 2. 6: TNT degradation pathway constructed by using CuBi-MeAn In addition, online genome data bases such as GeneBank used as benchmark to compare. and investigate our constructed MAGs. For example, in Algae project there are certain bacterial guilds in the reactor were suggested to have some sorts of defense mechanism against the predation. We hypothesized that could be the reason why those groups of bacteria are more abundant in our reactors. Previous studies suggested that these defense mechanisms in the MAGs

A B
were certain genes and some specific DNA structures known as clustered regularly interspaced short palindromic repeats (CRISPRs) elements which could be an indication of the defense mechanisms. However, these defense elements were not detected in SEED database using KBase and RAST platforms. Instead, the genes of the defense mechanism genes were obtained from the GeneBank database and BLAST tool used to test the gene presence in our bins. BLAST tools aligned these genes to our bins with high scores. For CRISPRs detection we also used another tool that was designed specifically for this purpose (Edgar 2007). The gene annotation of the bins helps to find out to investigate and understand the dynamics of the entire system for TNT, EBPR and Alga projects. These results will be discussed extensively in projects specific sections in the following sections.

Conclusions
In this study a customized pipeline was developed to process and analyze the metagenomic libraries. CuBi-MeAn pipeline was used for investigation of microbial community profile and functional annotations. This study was demonstrated how a genome centric approach could explain the functions of environmental systems and answer questions underlying the dynamics of the systems by using this pipeline. In this study, CuBi-MeAn pipeline clusters the metagenomic reads to approximate the genome of microorganisms exist in that system for downstream analysis. In addition, to the clustering of the metagenomic reads using this pipeline, this study showed how the selected tools in CuBi-MeAn pipeline, improved the quality of the raw data which enhanced the metagenomic assembly, generate bins, and downstream analysis. By designing and developing a flexible and customized pipeline, this study showed how to process large metagenomic data sets with limited resources.
Our proof-of-concept can be applied to process similar metagenomic datasets from short read metagenomic sequences of environmental samples. The user of CuBi-MeAn would be able to update, customize, replace, or skip specific software or steps. Since this pipeline is comprised of several metagenomics tools, they could be performed sequentially on different platforms and machines based on their available resources.
Despite successful demonstration of CuBi-MeAn, still there is room for further development of this pipeline. Thereby, the authors plan to test the performance of CuBi-MeAn with a wide variety of datasets such as human microbiome and using different sequences technologies such as PacBio sequencer that generate long reads.