The GMD-biplot and its application to microbiome data

Exploratory analysis of human microbiome data is often based on dimension-reduced graphical displays derived from similarities based on non-Euclidean distances, such as UniFrac or Bray-Curtis. However, a display of this type, often referred to as the principal coordinate analysis (PCoA) plot, does not reveal which taxa are related to the observed clustering because the configuration of samples is not based on a coordinate system in which both the samples and variables can be represented. The reason is that the PCoA plot is based on the eigen-decomposition of a similarity matrix and not the singular value decomposition (SVD) of the sample-by-abundance matrix. We propose a novel biplot that is based on an extension of the SVD, called the generalized matrix decomposition (GMD), which involves an arbitrary matrix of similarities and the original matrix of variable measures, such as taxon abundances. As in a traditional biplot, points represent the samples and arrows represent the variables. The proposed GMD-biplot is illustrated by analyzing multiple real and simulated data sets which demonstrate that the GMD-biplot provides improved clustering capability and a more meaningful relationship between the arrows and the points.


Introduction
A biplot simultaneously displays, in two dimensions, rows (samples) of a data matrix as points, and columns (variables) as arrows. Based on a matrix decomposition of the data matrix, the biplot is a useful graphical tool for visualizing the structure of large data matrices. It displays a dimension-reduced configuration of samples, as in a PCoA plot, and the variables with respect to the same set of coordinates. If meaningful sample groupings are observed, this allows for visualizing which variables contribute most to the separation. The traditional biplot, as first introduced in [1], displays the first two left and right singular vectors of the singular value decomposition (SVD) of the data matrix as points and arrows, respectively. This biplot, which we hereafter refer to as the SVD-biplot, uses the SVD to find the optimal least-square representation of the data matrix in a low-dimensional space. The SVD-biplot can show Euclidean distances between samples and also display approximate variances and correlations of the variables. It also has the appealing property that the singular values obtained from the SVD are non-increasing, indicating that the decomposition of the total variance of the data matrix into each dimension is non-increasing.
In many scenarios, the Euclidean distance may not be optimal for characterizing dissimilarities between samples. An important example arises in the analysis of microbiome data, in which marker gene sequences (e.g., 16s rRNA) are often grouped into taxonomic categories using bioinformatics pipelines such as QIIME [2] or Mothur [3]. These taxon counts can be summarized into a data matrix with rows and columns representing samples and taxon abundances, respectively. A variety of non-Euclidean distance measures, including nonlinear measures, are then used to quantify the similarity between samples. One common measure of dissimilarity is the UniFrac distance (weighted or unweighted), which is a function of the phylogenetic dissimilarity of a pair of samples [4; 5].
Other non-phylogenetic, non-Euclidean dissimilarities include Jaccard or Bray-Curtis distances (see, e.g., [6] and the references therein). Plotting the samples in the space of the first few principal components (PCs) of the similarity matrix obtained from such non-Euclidean distance matricesoften referred to as principal co-ordinates analysis (PCoA)-may reveal an informative separation between samples. However, the configuration of samples yielded by PCoA only keeps pairwise distances between samples and lacks a coordinate system that relates to the taxa which constitute each sample. Hence, it does not shed any light on which taxa may play a role in this separation.
One approach for addressing this problem is to simply plot an arrow for each taxon based on its correlation with the first two PCs of the non-Euclidean similarity matrix [7]. However, in such a "joint plot" [8], the direction and length of an arrow does not represent the taxon's true contribution to the dissimilarity between samples. In addition, due to the lack of a coordinate system, one cannot add sample points for future observations into this "joint plot".
Three main approaches have been recently proposed to extend the SVD-biplot to more general distances defined on the samples. The R package "ade4" [9] provides a biplot that can handle weighted Euclidean distances but cannot apply to non-Euclidean distances. The second approach, 3 proposed by [10], aims to approximate the non-Euclidean distance by a weighted Euclidean distance.
Weights are estimated for variables and the biplot can subsequently be constructed using weighted least-square approximation of the matrix. This approach has a straightforward interpretation.
However, the estimated weighted Euclidean distance may not capture all the information from the original non-Euclidean distance. A recent proposal in [11] appears to be the first to address the lack of mathematical duality between the samples' locations (points) and the variables' contribution (arrows) to those locations. This approach seeks an approximate SVD-like decomposition of the data matrix, which directly takes the non-Euclidean distance into consideration. This SVD-like decomposition has the following two advantages. First, the left singular vectors are the eigenvectors of the similarity measure derived from the non-Euclidean distance, which preserve the role of the non-Euclidean distance in classifying the samples. Second, an approximate matrix duality (AMD) between the left and right singular vectors is restored, which simply means that each set of vectors can be immediately obtained from the other. To emphasize this connection, we hereafter refer to this decomposition as the AMD. Unfortunately, the AMD also suffers from two drawbacks. First, the AMD is only an approximate decomposition of the data matrix, and hence may not capture all the variation of the original data. In particular, the configuration of samples displayed in an AMDbiplot is independent of the data matrix, since the left singular vectors of the AMD only depend on the non-Euclidean distances. Ignoring the data matrix for classifying samples seems non-intuitive since the data matrix is typically assumed to contain some information on the sample similarities.
Second, the AMD may result in non-decreasing "singular values". While these seem like minor technical issues, the second drawback can have important practical implications: which of the left and right singular vectors should be displayed in the resulting biplot? The authors of [11] suggest constructing the AMD-biplot based on the two left and right singular vectors that correspond to the 4 two largest singular values. This AMD-biplot assures that the arrows for variables are as meaningful as possible, but may fail to reveal meaningful sample clusters if the information of sample clusters is only associated with the first several left singular vectors. An alternative approach may be to simply display the first and second left and right singular vectors of the AMD (as done for the SVD). Unfortunately, this strategy does not solve the problem either: although we may observe meaningful sample clusters, the arrows may not be meaningful due to the small singular values.
There is thus a lack of clarity regarding which singular vectors should be used to construct the AMD-biplot.
The drawbacks of the AMD motivate our proposal which is based on the generalized matrix decomposition (GMD) [12]. The GMD is a direct generalization of the SVD that accounts for structural dependencies among the samples and/or variables. This approach has several advantages.
First, as with the AMD, it directly handles any non-Euclidean distance matrix. Specifically, the full information from that distance matrix is used. Second, unlike the AMD, which provides an approximate decomposition of the data matrix, the GMD provides an exact decomposition of the original data matrix without losing any information. Third, the GMD restores the matrix duality in a mathematically rigorous manner, unlike the approximate matrix duality obtained with the AMD; it naturally extends the duality inherent in the SVD and allows one to plot both the configuration of samples and the contribution of individual variables with respect to a new coordinate system. Fourth, the GMD gives non-increasing GMD values and so the resulting GMD-biplot can be directly constructed based on the first two left and right GMD vectors. Lastly, unlike the AMD-biplot whose sample clusters only depend on the distance, the GMD-biplot uses both the non-Euclidean distance and the data matrix for classifying samples, which more directly connects the contribution of the individual variables to the configuration of samples. Additionally, besides accounting for the non-Euclidean distances between samples, the GMD can also account for auxiliary information on (dis)similarities between the variables.
In the following, we first summarize the GMD-biplot framework and then compare the GMD, AMD and SVD biplots in three numerical studies. We then discuss advantages of the proposed GMD-biplot and further extensions.

Materials and Methods
We denote the data matrix by X ∈ R n×p , where n is the number of samples and p is the number of variables (taxa). We assume that the columns of X are centered to have mean 0 and rank(X) = K ≤ min(n, p). For any matrix M, we denote its i th row (sample) and its (i, j) entry by m i and m ij , respectively. We denote the transpose of M by M T .

Biplot, distance measure and the AMD
A biplot is a graphical method to simultaneously represent, in two dimensions, both the rows (as points) and columns (as arrows) of the matrix X on the same coordinate axes. Given a decomposition of X as X = AB T , a biplot displays two selected columns of A and B. The SVD-biplot is based on the SVD of X, i.e. X = USV T , where U T U = I K , V T V = I K and S = diag(σ 1 , . . . , σ K ) with σ 1 , . . . , σ K being a sequence of non-increasing and positive singular values. Here I K is a rank The SVD-biplot displays the first two columns of US and V, which can explain (σ 2 1 +σ 2 2 )/tr(XX T ) of the total variance of X. The SVD of X is closely related to the eigen-decomposition of the sim-ilarity kernel XX T , as we can write XX T = US 2 U T . Thus, the eigen-decomposition of XX T provides a way to calculate U and S. Once U and S are calculated, one can calculate V from the duality between U and V; that is, VS = X T U. The similarity kernel XX T characterizes the Euclidean distance between samples. To see this, we define the Euclidean squared distance between the i th and j th sample as 1 n is an n × 1 vector of ones. It can then be seen that The AMD addresses this problem by fixing U H and then seeking a matrix V H with orthonormal columns and a diagonal matrix D H with non-negative elements that minimize the objective function

GMD and the GMD-biplot
The concept of the generalized matrix decomposition (GMD) was introduced by Escoufier [13] and further developed in [12]. It is a generalization of the SVD with additional structural dependencies taken into consideration. We briefly review the key ideas behind the GMD. Let H ∈ R n×n and R ∈ R p×p be two positive semi-definite matrices, which, respectively, characterize the similarities between samples and between variables. The H, R-norm of X is defined as X H,R = tr(XRX T H).
For any q ≤ K, the GMD solution ( U, V, S) finds the best rank-q approximation to X with respect to the H, R-norm, that is, subject to U T HU = I q , V T RV = I q and diag(S) ≥ 0. Here, U and V are the left and right GMD vectors, respectively, and S is a diagonal matrix containing the GMD values. Note that U and V are orthogonal with respect to H and R respectively, but they may not be orthogonal with respect to the Euclidean norm unless H = I n and R = I p . The following property of the GMD provides a way to calculate the GMD components; the proof can be found in [13].
Proposition 1: The GMD solutions ( U, V, S) satisfy: Proposition 1(a) suggests that the diagonal elements of S and corresponding columns of U are eigenvalues and corresponding eigenvectors of XRX T H respectively. Proposition 1(b) establishes the duality between U and V, meaning that V can be immediately obtained given U and S.
Alternatively, an efficient algorithm for finding the solution to Eq. (1) was proposed in [12], which 8 is less computationally intensive compared to finding the eigenvalues and eigenvectors of XRX T H.
The algorithm also ensures that the diagonal elements of S are non-increasing.
Note that the GMD can handle the non-Euclidean similarity kernel H just by taking R = I p .
Based on the GMD of X with respect to H, the GMD-biplot can be constructed with respect to the coordinate system provided by the first two columns of V. More specifically, letting v j be the j-th column of V, the i-th sample point can be configured by the coordinates of x i , given To plot the arrow for the j-th variable, we consider the vector e j ∈ R p , which has a 1 in the j-th element and 0's elsewhere. Then, the arrow for the j-th variable can be configured by the coordinates of e j , given by (e T j v 1 , e T j v 2 ). This coordinate system also allows the configuration of future samples. Letting x * ∈ R p be a future sample, we can add x * into the GMD-biplot as a point located at ( . Similar to the SVD-biplot, the GMD-biplot can explain (σ 2 1 +σ 2 2 )/tr(XX T H) of the total variance of X with respect to the H, I p norm, whereσ k is the k-th diagonal element of S for k = 1, 2.
Since the GMD values are non-increasing, for the purpose of constructing the GMD-biplot, we can choose q = 2 in the GMD problem (Eq. (1)), which may save considerable computational time.
In contrast, since the AMD may produce non-decreasing "singular values", we have to find the full decomposition of X by the AMD before deciding which singular vectors to plot in the AMD-biplot; this may become computationally intensive for large n and p.

Results
In the results below, we compare the GMD, AMD and SVD biplots on three data sets in the manner that each has been proposed recently for microbiome data. In particular, in [11], the AMD-biplot is advocated specifically for relative abundance data, while in [14] the SVD-biplot is advocated for data that have been scaled by the centered log-ratio (CLR) transformation. The GMD-biplot is constructed using the CLR-transformed data. We first examine the performance of all biplots using the smokeless tobacco data set explored in [11]. In the second study, we compare their performances using the human gut microbiome data from [15]. In the third analysis, we simulate a data set based on the smokeless tobacco data to illustrate a dilemma that the AMD-biplot may face.

Analysis of the smokeless tobacco data
Since H is not positive semi-definite, we enforced it to be positive semi-definite by removing its negative eigenvalues and corresponding eigenvectors. The resulting similarity kernel, denoted H * , has rank 27.
For the GMD-biplot, we consider the CLR transformation of X. Specifically, denoting the geometric mean of a vector z by g(z) = ( p k=1 z k ) 1/p , the CLR transformation of x i ; i = 1, . . . , 45 is given by We denote the resulting data matrix by X = ( x 1 , . . . , x 45 ) T . For the AMD-biplot, we converted each row of X into the empirical frequencies, and further centered the rows and columns to have mean 0, as done in [11]. We denote the resulting data matrix byX.
We constructed the GMD-biplot and the AMD-biplot based on H * using X andX, respectively. Fig. 1(d) displays the proportion of variance captured by each GMD component. It can be seen that the first two GMD components capture more than 80% of the total variance of X, which assures that the resulting GMD-biplot ( Fig. 1(a)) visualizes the data well. As shown in Fig. 1(a), the GMD-biplot is perfectly successful at separating the different tobacco products (dry, moist and toombak). Furthermore, the replicates corresponding to the same product are tightly clustered.
By examining the arrows for taxa in Fig. 1(a), we see that moist samples may be characterized by elevated levels of alloiococcus and halophilus, while aerococcaceae appears elevated in toomback samples. Fig. 1(e), which is the same as the right bottom panel of Fig. 1 in [11], shows that the AMD singular values are not necessarily decreasing. It should be noted that Fig. 1(b) is slightly different from Fig. 3 in [11]; this difference may be due to the use of H * here as opposed to H in [11]. This is because we wanted the the AMD-biplot to be directly comparable to the GMD-biplot since the GMD requires both H and R to be positive semi-definite. From Fig. 1(b), it can be seen that the AMD successfully separates toombak samples (purple points) from dry (blue) and moist (orange) snuffs, although the separation between dry and moist snuffs in the AMD-biplot is not as definitive as that in the GMD-biplot ( Fig. 1(a)).
Additionally, we included the SVD-biplot and its corresponding scree plot in Fig.1 (c) and (f) respectively. As the SVD-biplot assumes the Euclidean distance between samples, it is more appropriate to construct the SVD-biplot using the CLR transformed data X than the relative abundance dataX [14]. It can be seen from Fig. 1(c)  and toombak), and obtained p−values representing the strength of association between each taxon and the tobacco groups. We then sorted these p−values in a non-decreasing order, and obtained the rank of each taxon based on the sorted p−values. Hence, it is desirable that the taxa with the lowest ranks can be identified by the biplots. Table S1 summarizes the ranks of the top 10 taxa identified by each biplot. It can be seen that the top 10 taxa identified by the GMD-biplot have lower ranks on average than those identified by the AMD and SVD biplots, indicating that the GMD-biplot may identify more meaningful taxa with respect to the separation of the samples than the AMD and SVD biplots.

Analysis of human gut microbiome data
We consider the human gut microbiome data collected in a study of healthy children and adults from the Amazonas of Venezuela, rural Malawi and US metropolitan areas [15]. The original data set X consists of counts for 149 taxa for 100 samples. The squared unweighted UniFrac distance matrix ∆ ∈ R 100×100 , computed using the R package phyloseq [16], was used to measure the distance between samples. Here, the distance between two samples is based entirely on the number of branches they share on a phylogenetic tree. The distance hence accounts only for the presence/absence of each taxon (not its abundance). The corresponding similarity kernel H was then derived as H = − 1 2 J ∆J , which is a positive semi-definite matrix with rank 99. Let X anď X, respectively, denote the CLR transformed data and the relative abundance data. Similar to the first study, the GMD-biplot and the AMD-biplot were constructed based on the similarity kernel H using X andX respectively, and the SVD-biplot was constructed based on the SVD of X.
As concluded in [15], shared features of the functional maturation of the gut microbiome are identified during the first three years of life. We thus define a binary outcome h i based on the age of each sample as: for i = 1, . . . , 100. Approximately 70% of the samples are assigned to group 0 and the remaining 30% are assigned to group 1.
In all biplots, the i th sample is colored by age i and symbolized by h i . Fig. 2(d) indicates that more than 80% of the total variance is explained by the GMD-biplot in Fig. 2(a), which provides a good visualization of sample clusters across age. By examining the relationship between the arrows and the color of the sample points in Fig. 2(a), we see that prevotella may be elevated in adults, while parabacteroides appears to be elevated in infants. In contrast, Fig. 2(e) shows that less than 15% of the total variance is explained by the AMD-biplot in Fig. 2(b) and the AMD values are not decreasing. As shown in Fig. 2(b), the AMD-biplot also displays potential clusters across age, but the sample points are not as tightly clustered as those in Fig. 2(a). Odoribacter appears to be elevated in adults in Fig. 2(b), while lactobacillus appears associated with infants. As a reference, Fig. 2(c) shows the SVD-biplot of X, which looks very similar to Fig. 2(a).
To further quantify the classification accuracy, for each biplot we predicted the probability that each sample belongs to group 1 based on leave-one-out cross validation using the binary logistic regression of the group index h i on the two selected components. We then plotted an ROC curve for each biplot based on the predicted probabilities (Fig. S1) and calculated the area under the ROC curve (AUC): the GMD, AMD and SVD biplots, respectively, yield an AUC of 0.989, 0.976 and 0.990. The AUC results indicate that the GMD-biplot provides a better separation of age groups than the AMD-biplot, but there is not a clear difference between the GMD-biplot and the SVDbiplot. This may be because, for the CLR-transformed data X, the unweighted UniFrac distance is not as informative with respect to age as the weighted UniFrac distance was in the tobacco data with respect to product groups.
We emphasize that both the GMD-biplot and the SVD-biplot identify prevotella and parabacteroides as top taxa, while the AMD-biplot identifies completely different ones. As [15] confirms that the trade-off between prevotella and bacteroids (including parabacteroides) considerably drives the variation of microbiome abundance in adults and children between 0.6 and 1 year of age in all studied populations, the GMD and SVD biplots may thus identify more biologically meaningful taxa than the AMD-biplot. It should, however, be noted that these bacterial are "identified" based on circumstantial, not statistical, evidence, and more work is needed to examine statistical associations in this context.

Incorporating a kernel for variables into the GMD-biplot
The GMD problem defined in Eq. (1) allows not only the similarity kernel for samples, but also a kernel for the variables. Including both kernels may further improve the accuracy of sample classification as well as the identification of important variables. We illustrate this advantage by incorporating a kernel for variables in the analysis of the human gut microbiome data. More specifically, we first calculate a matrix ∆ R ∈ R 149×149 of squared patristic distances between the tips of the phylogenetic tree for each pair of taxa and then derive a similarity matrix R as Fig. 3(a) shows the GMD-biplot with the additional kernel R incorporated. The ROC analysis based on the leave-one-out cross validation for Fig. 3(a) yields an AUC of 0.984, which is higher than that of the AMD-biplot (Fig. 2(b)) but slightly lower than Fig. 2(a) and Fig. 2(c). This may be because both H and R highly depend on the phylogenetic tree. Thus, incorporating R may be redundant and may reduce the accuracy of the sample clustering in this case. The top 3 taxa identified in Fig. 3(a) include prevotella but not parabacteroides, which may explain the lower clustering accuracy.
Including an additional kernel for variables in the GMD-biplot is related to method of double principal coordinate analysis (DPCoA) [17]. DPCoA, as shown in [18], is equivalent to a generalized PCA which essentially incorporates an additional similarity kernel for variables into the analysis, as described in Proposition 1, but for H = I n . As suggested in [19], DPCoA can provide a biplot representation of both samples and meaningful taxonomic categories. Hence, the GMD-biplot can also be viewed as an extension of DPCoA biplots because the GMD allows kernels for both samples and variables, while DPCoA only allows a kernel for variables.

Simulation
In this section, we conduct a simulation study based on the smokeless tobacco data to illustrate a scenario in which the AMD-biplot may fail to separate the samples, whereas the GMD-biplot performs well. Let H * and X be the similarity kernel and data matrix from the smokeless tobacco data, respectively. We consider the eigen-decomposition of H * as H * = BΛB T : B is a 45 × 27 matrix whose columns are eigenvectors of H * and Λ = diag(λ 1 , . . . , λ 27 ) is a diagonal matrix whose elements are the eigenvalue of H * . Then, the AMD-biplot is based on the following approximated orthogonal decomposition of X: where D = diag(d 1 , . . . , d 27 ) and V is a 271 × 27 matrix with orthonormal columns. As shown in Fig. 2(d), d 1 , . . . , d 27 may not be decreasing. For j = 1, . . . , 27, we define and construct the simulated data set X S as X S = BD S V T , where D S = diag(d 1,S , . . . , d 27,S ). For i = 1, . . . , 45, we define a binary outcome w i that indicates the group index of the i th sample as: The GMD-biplot and the AMD-biplot of X S with similarity measure H * are presented in Fig. 4(a) and 4(b), respectively. It can be seen that the two groups are completely mixed up in the AMDbiplot because the first column of B is not selected for visualization. In contrast, the GMD-biplot successfully visualizes the sample groups by displaying the first and second GMD components.
To see why this occurs, we summarize the first three diagonal elements of Λ, D S and D 2 S Λ in Table 1 and notice that d 1,S < d 2,S < d 3,S . Consequently, the AMD-biplot displays the second and third columns of BD S , and hence it completely fails to classify the samples because the group index w i only depends on the first column of B. In contrast, Proposition 1(a) shows that the GMD-biplot is based on the two largest eigenvalues and the corresponding eigenvectors of X S X T S H * . It can be further seen that Eq.
(3) implies that the diagonal elements of D 2 S Λ are the eigenvalues of X S X T S H * and columns of B are the corresponding eigenvectors. Hence, it can be seen from Table 1 that d 2 1,S λ 1 > d 2 2,S λ 2 > d 2 3,S λ 3 , even though d 1,S < d 2,S < d 3,S . Therefore, the GMD-biplot displays the first and second column of BD S Λ 1/2 as sample points, which successfully captures sample classifications.

Discussion
Biplots have gained popularity in the exploratory analysis of high-dimensional microbiome data.
The traditional SVD-biplot is based on Euclidean distances between samples and cannot be directly applied when more general dissimilarities are used. Since Euclidean distances may not lead to an optimal low-dimensional representation of the samples, we have extended the concept of the SVD-biplot to allow for more general similarity kernels. The phylogenetically informed UniFrac distance, used in our examples, defines one such kernel. In settings where a general (possibly nonlinear) distance matrix is appropriate, our approach provides a mathematically rigorous and computationally efficient method, based on the GMD, that allows for plotting both the samples and variables with respect to the same coordinate system.
Our first data example with the smokeless tobacco data set from [11] demonstrates the merits of the proposed GMD-biplot. We found that the GMD-biplot successfully displays different types of products, while the AMD-biplot is not able to completely separate dry and moist snuffs and the SVD-biplot fails to separate moist and toombak samples. As shown in Table S1, the GMD-biplot is also able to identify biologically more meaningful taxa that are related to the different types of products, compared to the AMD-biplot and the SVD-biplot.
In In practice, we typically do not know what the true configuration of samples look like, so it is impossible to determine whether H or XX T contains more information about sample clusters. Also, it is sensible to assume that XX T and H are "co-informative" in the sense that they exhibit a shared eigenstructure; for instance, both may be informative for clustering samples. The co-informativeness can be quantified precisely using the Hilbert-Schmidt information criteria (HSIC) [20]. For any two kernels K 1 and K 2 , the empirical HSIC is proportional to tr(K 1 K 2 ). Hence, by definition, the GMD problem (1) is equivalent to minimizing the HSIC between X − USV T X − USV T T and H over U, S and V. In other words, if we consider X − USV T as the residual matrix of X, then the GMD solutions can be interpreted as the best approximation to X in the sense that the HSIC between H and the Euclidean kernel of the residual matrix is minimized. Thus, the GMD-biplot considers the co-informativeness of XX T and H. Therefore, in many cases it would be a more robust way to display the sample points compared to the AMD-biplot or the SVD-biplot. Another advantage of the GMD-biplot over the AMD-biplot is illustrated in our simulation study. Since the AMD may not give decreasing singular values, the AMD-biplot may not be able to display the most informative eigenvectors of H, and may thus fail to cluster the samples. In contrast, the GMD assures that the resulting singular values are non-increasing.
Our discussion in this paper has focused on the form biplot, which aims to visualize the relationship between variables and the sample clustering. In other scenarios, where the variation of the data matrix explained by each variable is of particular interest, the covariance biplot may be more appropriate. This biplot considers the GMD of X with respect to H; i.e. X = USV T , where U T HU = I q and V T V = I q . Note that

Supplemental Material
Table S1 Ranks of the top 10 taxa identified by the GMD, AMD and SVD biplot in the analysis of smoke tobacco data. both biplots display the top 6 taxa with the longest arrows. The samples points are colored by the group index (1 = "red"; 0 = "black").