## Abstract

High-throughput sequencing (HTS) of metagenomes is proving essential in understanding the environment and diseases. State-of-the-art methods for discovering the species and their abundances in an HTS metagenomic sample are based on genome-specific markers, which can lead to skewed results, especially at species level. We present MetaFlow, the first method based on coverage analysis across entire genomes that also scales to HTS samples. We formulated this problem as an NP-hard matching problem in a bipartite graph, which we solved in practice by min-cost flows. On synthetic data sets of varying complexity and similarity, MetaFlow is more precise and sensitive than popular tools such as MetaPhlAn, mOTU, GSMer and BLAST, and its abundance estimations at species level are two to four times better in terms of *ℓ*_{1}-norm. On a real human stool data set, MetaFlow identifies *B.uniformis* as most predominant, in line with previous human gut studies, whereas marker-based methods report it as rare. MetaFlow is freely available at http://cs.helsinki.fi/gsa/metaflow

## 1 Introduction

Microbes—microscopic organisms that cannot be seen by the eye, which include bacteria, archaea, some fungi, protists and viruses—are found almost everywhere, in the human body, air, water and soil, and play a vital role in maintaining the balance of ecosystems. For example, diazotrophs are solely responsible for the nitrogen fixation process on earth [17], and half of the oxygen on earth is produced by marine microbes [19].

Metagenomic sequencing allows studying microbes sampled directly from the environment without prior culture. A fundamental analysis of a metagenomic sample is finding what species it contains (the sample richness) and what are their relative abundances. This is a challenging task due to the similarity between the species' genomes, sequencing errors, and the incompleteness of reference microbial databases.

One of the first methods, applicable only at high taxonomic levels [15], focused on 16S ribosomal RNA marker genes. Its limitations have been mitigated by high-throughput sequencing of the entire genomes in a metagenomic sample, and a number of methods dealing with this data have been proposed. The simplest of them is to align the reads to a reference database, using e.g. BLAST [1], and to choose the best alignment for every read. This approach cannot break the tie between multiple equally-good alignments of a read, and it cannot detect false positive alignments. One way of avoiding alignment ties, but not false positive alignments, is to estimate the sample structure only at a high taxonomic level. For example, MEGAN [7] assigns each read with ties to the lowest common ancestor of its alignments in the reference taxonomic tree. Another method for breaking ties was proposed in PhymmBL [2], based on Interpolated Markov Models (IMMs). For each read, it combines the BLAST alignment score to a particular genome, with another score based on the probability of the read being generated by an IMM of that genome. This results in a single maximum-scoring alignment for every read, which improves over basic BLAST alignments. This method still cannot eliminate false alignments, and does not scale to high-throughput sequencing samples. For example, it takes more than one hour to classify 5,730 reads of length 100bp [2].

The current state-of-the-art approach is to construct a small curated reference database of genomic markers. These markers can be clade-specific marker genes (MetaPhlAn [20]), universal marker genes (mOTU [22]), or genome-specific markers not restricted to coding regions (GSMer [23]). The reads are aligned only to these marker regions, which makes it a fast process, and the estimations are more accurate at the species level because of the marker curation process. However, due to the short length of the marker regions, the abundance estimations can be extremely skewed in some cases. In addition, the markers are uniquely identifying the microbial genomes only among the currently known genomes, meaning that the entire database of markers must be re-computed with the addition of each new reference microbial genome.

In this paper, we propose a new method which addresses the problems of equally-good alignments and of false alignments, with accurate predictions at the species level. Our method takes into account the entire read alignment landscape inside all genomes in the reference database. The main idea is to exploit the assumption that if enough reads are sampled, then these reads will cover most of the genomic regions. In addition, not relying on specific genomic markers, there is no need to curate a reference database. Our problem formulation is based on a matching problem applied to a bipartite graph constructed from read alignments. This problem is NP-hard, but we give a practical strategy for solving it with min-cost network flows. See Fig. 1 for an overview of all methods.

We performed experiments on synthetic and real data sets, and compared MetaFlow with popular tools MetaPhlAn [20], mOTU [22], GSMer [23], and standard BLAST alignments. MetaFlow outperforms the other methods in its ability to correctly identify the species and their abundance levels. On synthetic data, MetaFlow’s predictions are more precise and sensitive, and its abundance estimations are up to two to four times better in terms of *ℓ*_{1}-norm. On a real fecal metagenomic sample from the Human Microbiome Project, MetaFlow reports *B.uniformis* as most predominant species, in line with previous human gut studies [16]. However, marker-based methods MetaPhlAn and mOTU assign a very low abundance to it.

## 2 Methods

We assume in input a set of BLAST hits of the metagenomic reads inside a collection of reference genomes, which we call *known genomes*. The output is the richness of the sample and the relative abundance of each known species. This is obtained by selecting, for every read, exactly one hit in a reference genome, or classifying the read as originating from a species not in the reference database (we call such species *unknown*).

The optimal selection is the one simultaneously achieving the following three objectives: (1) few coverage gaps, (2) uniform coverage, and (3) agreement between BLAST scores and final mappings. Objective (1) allows the detection of outlier genomes that have only few regions covered. Objective (2) breaks ties between read alignments, and is also based on detecting abnormal read coverage patterns.

We model the above-mentioned input for this problem as a bipartite graph, such that the reads form one part of the bipartition, and the reference genomes form the other part of the bipartition. Objectives (1)-(3) are not modeled independently, but combined in a single objective function, as discussed in the next section. Our modeling is inspired by the interesting Coverage-sensitive many-to-one min-cost bipartite matching problem, introduced in [11] for mapping reads to complex regions of a reference genome. We extended this model to our metagenomic context, since reads can have mappings to more than one reference genome, or can originate from unknown species.

### 2.1 Problem formulation and computational complexity

Assume that the reads have BLAST hits in the collection of reference genomes . We partition every genome *G ^{i}* into substrings of a fixed length

*L*, which we call

*chunks*. Denote by

*s*the number of chunks that each genome

_{i}*G*is partitioned into. We construct a bipartite graph

^{i}*G*= (

*A*∪

*B, E*), such that the vertices of

*A*correspond to reads, and the vertices of

*B*correspond to the chunks of all genomes

*G*

^{1},…,

*G*. More specifically, for every chunk

^{m}*j*of genome

*G*, we introduce a vertex , and we add an edge between a read

^{i}*x*∈

*A*and chunk if there is a BLAST mapping of read

*x*starting inside chunk

*j*of genome

*G*. This edge is assigned the cost of the mapping (BLAST scores can be trivially transformed to costs), which we denote here by . In order to model the fact that reads can originate from unknown species (whose genome is not present in the collection ), we introduce an ‘unknown’ vertex

^{i}*z*in

*B*, with edges from every read

*x*∈

*A*to

*z*, and with a fixed cost

*c*(

*x, z*) =

*γ*, where

*γ*is appropriately initialized.

In the *coverage-sensitive metagenomic mapping* problem stated below, the tasks are: first, for each *G ^{i}*, to find the number of reads sequenced (i.e., originating) from it, which we denote by

*r*(

_{i}*r*is 0 if

_{i}*G*is an outlier); second, to select an optimal subset

^{i}*M*⊆

*E*such that for every read

*x*∈

*A*there is exactly one edge in

*M*covering it (i.e., a mapping of

*x*). These must minimize the sum of the following two costs:

the sum, over all chunks of every genome

*G*, of the absolute difference between its due read coverage, namely^{i}*r*, and the number of read mappings it receives from_{i}/s_{i}*M*(this corresponds to Objective (2));the sum of all edge costs in

*M*(this corresponds to Objective (3)).

Our formal problem definition is given below. We use the following notation: *n* is the number of reads, *m* is the number of different genomes where the reads have BLAST hits, *s _{i}* is the number of chunks of each genome

*G*∈ {1,…,

^{i}, i*m*}. In a graph

*G*= (

*V, E*),

*d*(

_{M}*v*) denotes the number of edges of a set

*M ⊆ E*incident to a vertex

*v*.

#### Coverage sensitive metagenomic mapping:

**Input:**

– a bipartite graph

*G*= (*A*∪*B, E*), where*A*is the set of*n*reads,*B*= is the set of all genome chunks plus*z*, the ‘dummy’ node,– a cost function

*c: E*→ ℚ,– constants

*α*∈ (0,1),*β*,*γ*∈ ℚ_{+}.

**Tasks:**

– find a vector

*R*= (*r*_{1;}…,*r*) containing the number of reads sequenced (i.e., originating) from each genome_{m}*G*∈ {1,…,^{i}, i*m*}, and– find a subset

*M*⊆*E*such that*d*(_{M}*x*) = 1 holds for every*x*∈*A*(i.e., each read is covered by exactly one edge of*M*),

which together minimize:

We first show that this problem is computationally intractable.

#### Theorem 1. The coverage-sensitive metagenomic mapping problem is NP-hard.

*Proof*. We reduce from the Exact cover with 3-sets (X3C) problem (see e.g., [5]). In this problem, we are given a collection S of 3-element subsets *S*_{1},…,*S _{m}* of a set

*U*= {1,…,

*n*}, and we are required to decide if there is a subset , such that every element of

*U*belongs to exactly one .

Given an instance of Problem X3C, we construct the bipartite graph *G* = (*A* ∪ *B, E*), where *A* = *U* = {1,…, *n*}, and *B* corresponds to *S* in the following sense:

–

*s*= 3, for every_{i}*i*∈ {1,…,*m*}– for every

*S*= {_{j}*i*_{1}<*i*_{2}<*i*_{3}}, we add to*B*the three vertices*y*_{j,1},*y*_{j,2},*y*_{j,3}and the edges {*i*_{1},*y*_{j,1}}, {*i*_{2},*y*_{j,2}}, {*i*_{3},*y*_{j,3}}, each with cost 0.

For completeness, we also add vertex *z* to *B*, and edges of some positive cost between it and every vertex of *A*.

We now show that an instance for Problem X3C is a ‘yes’ instance if and only if the coverage-sensitive metagenomic mapping problem admits on this input a solution set *M* ⊆ *E* of cost 0. Observe that, in any solution *M* of cost 0, the numbers *r _{i}* of reads sequenced from each genome are either 0 or 3, and vertex

*z*has no incident edges in

*M*.

For the forward implication, let be an exact cover with 3-sets. We choose the vector *R* = (*r*_{1},…, *r _{m}*) as follows:

and we construct *M* as containing, for every *S _{j}* = {

*i*

_{1}<

*i*

_{2}<

*i*

_{3}} ∈ , the three edges {

*i*

_{1},

*y*

_{j,1}}, {

*i*

_{2},

*y*

_{j,2}}, {

*i*

_{3},

*y*

_{j,3}}. Clearly, these

*R*and

*M*give a solution of cost 0 to the coverage-sensitive metagenomic mapping problem.

Vice versa, if the coverage-sensitive metagenomic mapping problem admits a solution (*R, M*) of cost 0, thus, one in which the numbers *r _{i}* of reads sequenced from each genome are either 0 or 3, and

*z*has no incident edges, then we can construct an exact cover with 3-sets by taking:

Due to the hardness result from Thm. 1, we opt for the common *iterative refinement strategy*, akin to the strategy behind *k*-means clustering [21], or Viterbi training strategies with Hidden Markov Models [4]. In Sec. 2.3 we present the details of this approach, but the main ideas are:

If the unknown vector

*R*= (*r*_{1},…,*r*_{m}) is fixed to some value*R*= (*a*_{1},…,*a*), then finding only the optimal mapping_{m}*M*can be solved in polynomial-time with min-cost flows. We show this in Sec. 2.2.For finding the optimal

*R*(and*M*), we start with a vector*R*^{0}= (*a*_{1},…,*a*), where_{m}*a*equals the number of reads with BLAST hits to_{i}*G*. We repeat the following process, until the vector^{i}*R*converges to a stable value. For each iteration*j*:compute the optimal read mapping

*M*by min-cost flows (as we will show in Sec. 2.2), using^{j}*R*as input;^{j}update

*R*to^{j}*R*^{j+1}, a vector whose*i*-th component equals*mean*; here_{i}· s_{i}*mean*is the 20%-trimmed mean read coverage of the chunks of genome_{i}*G*, obtained from^{i}*M*.^{j}

### 2.2 The reduction to a min-cost flow problem

We refer the reader to Sec. A.1 of the Supplementary Material for some standard notions and definitions of flow networks and min-cost flows. Given an input bipartite graph *G* = (*A* ∪ *B, E*) for the coverage-sensitive metagenomic mapping problem, with a fixed vector *R ^{j}* = (

*a*

_{1},…,

*a*) containing the number of read sequenced from the

_{m}*m*genomes (at an iteration

*j*), we construct a flow network

*N*= (

*G*, ℓ, u, c, q*) (see Fig. 2(b)). Here,

*ℓ, u*:

*E*(

*G**) → ℕ are the

*demand*and

*capacity*of every arc, respectively;

*c*:

*E(G*)*→ ℚ is the function giving the cost per unit of flow for every arc of

*G**;

*q*is the required value of the flow, that is, the value of the flow that must exit the unique source of

*G*

^{*}, which equals the value of the flow that must enter the unique sink of

*G*

^{*}.

The construction of *G** starts with *G*, and orients all its edges from *A* towards *B*, and sets their demand to 0, their capacity to 1, and their cost as in the input metagenomic mapping instance, multiplied by (1 – *α*). We add a global source *s** with arcs toward all vertices in *A*, with demand and capacity 1, and 0 cost. Since we pay the fixed penalty *α · β* for every read below or above the given coverage of each genome chunk , we add the following arcs:

We also add arcs from every vertex *x ∈ A* to the unknown vertex *z ∈ B*, with demand *ℓ(x, z)* = 0, *u(x,z)* = 1, and cost *c(x,z)* = *γ*. From *z* we add an arc to *t** with *ℓ*(*z, t**) = 0, *u*(*z, t**) = ∞, *c*(*z, t*^{*}) = 0.

Finally, we set , and also add the arc (*s**, *t**) with 0 demand and cost, and infinite capacity. The optimal matching for the coverage-sensitive metagenomic mapping problem consists of the edges of *G* whose corresponding arcs in *G*^{*} have non-zero flow value in the integer-valued min-cost flow over *N*. See Fig. 2 for an example.

The correctness of this reduction can be seen as follows. As long as a genome chunk has exactly *a _{i}*/

*s*reads mapped to it, then no cost is incurred; this is represented by the arc with demand and capacity

_{i}*a*/

_{i}*s*and 0 cost. If receives more reads than

_{i}*a*then these additional reads will flow along the parallel arc of cost

_{i}/s_{i}*α · β*. If receives less reads than

*a*then some compensatory flow comes from

_{i}/s_{i}*s** through the arc , where these fictitious reads again incur cost

*α · β*. We set the capacity of to

*a*since at most

_{i}/s_{i}*a*fictitious reads are required to account for the lack of reads for genome chunk .

_{i}/s_{i}Requiring flow value ensures that there is enough compensatory flow to satisfy the demands of the arcs of type . Having added the arc (*s**, *t**) with 0 cost ensures that there is a way for the exogenous flow which is not needed to satisfy any demand constraint to go to *t**, without incurring an additional cost.

### 2.3 Overview of the implementation

Our practical implementation is divided into five stages. These depend on some parameters, whose complete list we give in Sec. B.3 of the Supplementary Material.

**Stage 1: Removing outliers species** A genome *G ^{i}* ∈ is considered an outlier if at least one of the following conditions holds.

– The average read coverage of

*G*(i.e., the number of reads with BLAST hits to^{i}*G*multiplied by the average read length and divided by the length of^{i}*G*) is lower than a given parameter.^{i}– The average read mapping per chunk (i.e., the number of reads with BLAST hits to

*G*divided by^{i}*s*) is lower than a given parameter._{i}– The percentage of chunks without any BLAST hit is more than a given parameter.

In this stage, we remove each outlier genome *G ^{i}* and the reads that have BLAST hits only to

*G*.

^{i}**Stage 2: Breaking ties inside each genome** A read can have BLAST hits to different chunks of the same genome. In this stage, for each read remaining after Stage 1, we select only one BLAST hit in each genome, as follows. For each remaining genome , we create a sub-problem instance *G* = (*A* ∪ *B, E*) where *A* consists only of the reads that have BLAST hits to *G ^{i}*, and

*B*consists only of the chunks of

*G*(excluding the unknown vertex

^{i}*z*, which will be dealt with in Stage 4). We fix the one-element vector

*R*as

*R*= (

*r*

_{1}) = (|

*A*|), and solve this sub-problem using the min-cost flow reduction described in Sec. 2.2. After this stage, every read has at most one hit to each genome, but it can still have hits to multiple genomes.

**Stage 3: Breaking ties across all genomes** A read can be mapped to different species, due to the similarity between their genomes. In order to select only one read mapping across all genomes, we solve the coverage-sensitive metagenomic mapping problem on a graph *G* = (*A* ∪ *B, E*), as follows. The set *A* consists of all remaining reads, and the set *B* of all chunks of the remaining genomes. The set of edges *E* is the one obtained after the filtration done in Stage 2. Since this problem is NP-hard, we employ the iterative refinement strategy, coupled with min-cost flows, mentioned at the end of Sec. 2.1. After each iteration *j*, we use the resulting mapping *M ^{j}* to remove outlier genomes, as in Stage 1. After this stage, each read is mapped to exactly one genome, and to only one of its chunks.

**Stage 4: Identifying reads from unknown genomes** In this stage we identify reads originating from species whose reference genomes are not present in the reference database. We run the same min-cost flow reduction as in Stage 2, to which we add the unknown vertex *z*. If a read is mapped to *z*, then it will be marked as coming from an unknown genome and removed from the graph. Finally, we again remove outlier genomes, as in Stage 1.

**Stage 5: Estimating richness and relative abundances** For every genome *G ^{i}*, we compute its average read coverage

*read_coverage*(

*i*), and its relative abundance

*rel_abundance*(

*i*) as:

*read_coverage*(

*i*) = ·

*R/length(i), rel_abundance(i)*=

*read_coverage*(

*i*)/

*read_coverage*(

*j*), where is the number of reads mapping to

*G*after Stages 1-4,

^{i}*R*is the average read length, and

*length*(

*i*) is the length of

*G*.

^{i}## 3 Experimental setup

Our experimental setup measures the effect of the following three factors: (1) genotypic homogeneity between the species, (2) complexity of the sample (i.e. the number of species), and (3) the presence of species unknown to the methods. We simulated two types of data sets: a *low complexity* type (LC), consisting of 4M reads sampled from 15 different species, and a *high-complexity* type (HC), consisting of 40M reads sampled from 100 different species. The goal of the simulated LC data sets is to evaluate the methods under different levels of genotypic homogeneity between the species. For example, species with high genotypic homogeneity are difficult to precisely identify and their presence usually leads to incorrect predictions. We explain these levels of similarity in Table 4 of the Supplementary Material. The simulated HC data sets test how a large number of randomly chosen species influences the accuracy of the methods. As opposed to the LC data sets, they do not test the ability of the tools to deal with similar species. This experimental setup is in line with previous studies [20,13].

In both the LC and HC data sets, we had two experimental scenarios: one in which all species in the sample are known to the methods (LC-Known, HC-Known), and one in which a fraction of them are unknown (LC-Unknown, HC-Unknown). LC-Known is the “perfect-information” scenario, which though not realistic, shows the performance of the tools in the best possible conditions. HC-Unknown is the most realistic scenario. In total, these simulated experiments contain 48 data sets.

The abundances of the species in each data set were chosen log-normal distributed (with mean=1, standard deviation=1), also in line with previous experiments [20]. This presents a challenge in finding less abundant species, since the ratio between the most abundant species and the least abundant is 100 in most cases, and the top 10% most abundant species represents about 35% of the sample. We selected in total 817 bacterial species from the NCBI microbial genome database, and used Metasim [18] to create the data sets using the default 454-pyrosequencing error model (with mean read length=250 and standard deviation=1).

In order to evaluate the accuracy of the richness estimations, we evaluated the *sensitivity* (also called *recall*) and the *precision*. Sensitivity is defined as *TP/NS*, and precision is defined as *TP/(TP + FP)*, where *TP* is the number of species correctly identified by the tool, *FP* is the number of species not present in the sample but reported by the tool, and *NS* is the true number of species in the sample. In order to obtain a single measure of accuracy, we also computed the harmonic mean of precision and specificity, known as *F-measure*, and defined as 2 · *precision · sensitivity/(precision + sensitivity)*. To evaluate the accuracy of the relative abundance predictions, we measured the *ℓ*_{1}-norm of these abundances for each data set, expressed in percentage points. For the data sets with unknown genomes, we excluded the unknown genomes from the abundance vectors and we re-normalized the resulting relative abundances to the known genomes. Note that some methods, such as MetaPhlAn [20] and mOTU [22], give some abundance estimations for the unknown genomes at a higher taxonomic level (e.g. genus). Our method is focused instead on the core problem of analyzing the known species, without attempting to estimate the relative abundances of the unknown genomes.

We compared the performance of MetaFlow against BLAST [1], MetaPhlAn [20], mOTU [22] and GSMer [23]. In the BLAST analysis, we always selected the best alignment; in case of multiple equally-good alignments, we randomly selected one of them; if the read coverage of a species is below 0.3 X, then we considered it an outlier. GSMer does not provide relative abundances, so we compared only the accuracy of the richness estimations. For mOTU, some species among our known genomes were not covered in its database, so in our evaluation it received full marks on these species. On the other hand, some of the species chosen as unkown in our experiments already existed in mOTU’s database. For these species, we removed mOTU’s correct prediction.

We also ran our tool using a real data set. We merged 6 G_DNA_Stool samples of a female from Human Microbiome Project (5 samples were generated using Illumina, and one sample using LS454). Their accession numbers are in the Supplementary Material. The read length of all reads was normalized to 100bp. The total number of reads from all samples was 287,565,377, out of which 98,223,162 BLAST mapped to one or more species. Only alignments with identity ≥ 97% were selected as an input for MetaFlow. In addition to the full reference genomes in NCBI’s microbial database, we also used two other references: a supercontig of *B.uniformis* (accession number NZ_JH724260.1), because *B.uniformis* was previously reported as most abundant in fecal samples [16]; and the longest scaffold of *B.plebeius* (accession number NZ_DS990130.1) because MetaPhlAn and mOTU report *B.plebeius* as most abundant in this sample.

See Table 2 for the running times of the methods tested in this paper.

## 4 Discussion

### 4.1 Synthetic data sets

We summarize the experimental results on simulated data in Table 1 and in Fig. 3. The complete results on each LC data set are in Tables 5 and 6 of the Supplementary Material. Since BLAST reports alignments in all the reference genomes, its sensitivity is always the maximum achievable. However, this comes at the cost of low precision, since there is no proper strategy for breaking ties among alignments, or for properly removing outlier genomes. MetaPhlAn and mOTU have better precision and F-measure than BLAST, confirming that marker genes are a good way of distinguishing between similar species. However, the sensitivity of the marker-based methods suggests that such an approach is not always accurate in identifying all species present in the sample. For example, MetaPhlAn’s sensitivity is not maximum in the LC12,LC13,LC14-Known data sets, even though these are perfect-information scenarios.

These results also confirm our hypothesis that taking into account the coverage across the entire genome improves the abundance estimation. For example, even though BLAST has a lower F-measure, it has better *ℓ*_{1}-norm than marker-based methods. This is due to the fact that marker regions are much shorter compared to the genome length, and thus slight variations in coverage in these regions can easily skew the abundance estimation. MetaFlow achieves the maximum sensitivity of BLAST, but it manages to highly improve the precision, and it also obtains the best F-measure among all the tested methods. Comparing the LC and HC scenarios, we can also observe that MetaFlow’s richness estimations are robust with the increase in sample size. Moreover, since MetaFlow is also based on whole-genome read alignments, it gives a much better abundance estimation than marker-based methods, with an improvement of 2-4 × in average *ℓ*_{1}-norm. Our problem formulation also filters out outlier species and false alignments, thus improving the abundance estimations over BLAST.

Finally, the results are always better on the HC data sets than on the LC data sets for all tools, because a small variation is more severe in a sample with 15 species than in a sample with 100 species. Recall also that the LC data sets were constructed to have similar species, whereas in the HC data sets the species were randomly chosen. The data sets with high genotypic homogeneity (LC13-LC15) show that such scenario remains a difficult one: even though MetaFlow improves both the richness and abundance estimation of the competing methods, its precision drops to an average of 0.85 and its *ℓ*_{1}-norm increases to an average of 40.

### 4.2 Real data set

On the fecal metagenomic sample (see Fig. 5 and Table 7 in the Supplementary Material), the most abundant species reported by MetaFlow is *B.uniformis* (23.6% relative abundance), which was also reported as the most abundant species in human feces [16]. This high abundance is also supported by the fact that 15,418,699 reads are mapped by BLAST only to *B.uniformis*. Out of these, 10,721,492 are finally assigned by MetaFlow to *B.uniformis*, because of the uneven read coverage. This corresponds to an average read coverage of 220. To put this number into perspective, the 10th most abundant species according to MetaFlow, *A.shahii*, has relative abundance 2.3% and average read coverage 21. MetaPhlAn and mOTU assign to *B.uniformis* abundances 1.7% and 6.4%, respectively.

The second most abundant species reported by MetaFlow is *B.vulgatus*, another common species in human feces [16]. MetaFlow’s predicted abundance is 22.3% (average read coverage 208), which is in line with MetaPhlAn’s prediction of 17.7% and, to an extent, mOTU’s prediction of 11.9%. In Table 7 of the Supplementary Material we give the list of the top 10 prediction of MetaFlow, and their abundances reported by MetaPhlAn and mOTU. Four species out of the top six species have also been reported by [16] as predominant in human feces, and they constitute 59% of the sample according to MetaFlow (relative to the species known to MetaFlow in this experiment).

The top abundant species in MetaPhlAn’s and mOTU’s predictions is *B.plebeius*, with 25% and 16% relative abundance, respectively. MetaFlow’s reported abundance is 5.2%. Note also that 62 species reported by MetaPhlAn are not available in the database of reference genomes given to MetaFlow (NCBI’s database plus *B.uniformis* and *B.plebeius*). Since our predicted abundances are relative to the known genomes only (average read coverages are also outputted by MetaFlow), the abundance of *B.uniformis* relative to all species in the sample may be lower than 23.6%, but it cannot be significantly lower than the one of *B.vulgatus*, as MetaPhlAn and mOTU predict.

## 5 Conclusion

Analysis of high-throughput sequencing samples of metagenomes is crucial for understanding diseases and the environment. This involves discovering what bacterial species are present in a sample, and what are their relative frequencies. State-of-the-art methods are based only on marker regions of known genomes, which can heavily skew the abundances, especially at the species level.

MetaFlow is the first method based on coverage analysis across entire known genomes which also scales to HTS samples. MetaFlow shows consistent results across a wide range of synthetic data sets, with abundance estimations two to four times better, in terms of *ℓ*_{1}-norm, than popular tools. On a real human stool data set MetaFlow also proves robust, whereas marker-based methods report highly skewed results. Finally, our whole-genome coverage method makes obsolete the need to curate and continuously update a database of genomic markers.

## Acknowledgement

We thank Romeo Rizzi for discussions about the computational complexity of our problem formulation. This work was partially supported by the Academy of Finland under grants 284598 (CoECGR) to A.S. and V.M. and 274977 to A.T.

## Footnotes

A shorter version of this manuscript will appear in the Proceedings of RECOMB 2016, 20th Annual International Conference on Research in Computational Molecular Biology.