Generalization of the minimum covariance determinant algorithm for categorical and mixed data types

The minimum covariance determinant (MCD) algorithm is one of the most common techniques to detect anomalous or outlying observations. The MCD algorithm depends on two features of multivariate data: the determinant of a matrix (i.e., geometric mean of the eigenvalues) and Mahalanobis distances (MD). While the MCD algorithm is commonly used, and has many extensions, the MCD is limited to analyses of quantitative data and more specifically data assumed to be continuous. One reason why the MCD does not extend to other data types such as categorical or ordinal data is because there is not a well-defined MD for data types other than continuous data. To address the lack of MCD-like techniques for categorical or mixed data we present a generalization of the MCD. To do so, we rely on a multivariate technique called correspondence analysis (CA). Through CA we can define MD via singular vectors and we can compute the determinant from CA’s eigenvalues. Here we define and illustrate a generalized MCD on categorical data and then show how our generalized MCD extends beyond categorical data to accomodate mixed data types (e.g., categorical, ordinal, and continuous). We illustrate this generalized MCD on data from two large scale projects: the Ontario Neurodegenerative Disease Research Initiative (ONDRI) and the Alzheimer’s Disease Neuroimaging Initiative (ADNI) with data such as genetics (categorical), clinical instruments and surveys (categorical or ordinal), and neuroimaging (continuous) data. We also make R code and toy data available in order to illustrate our generalized MCD.


Introduction
The minimum covariance determinant (MCD; Hubert & Debruyne, 2010) algorithm is a robust estimator for multivariate location (mean) and scatter (covariance). Given a matrix of observations and variables, MCD approaches, generally, find a subset of individuals that have minimum scatter where scatter is defined as the determinant of the covariance matrix.
Robust estimates of mean and covariance are computed from the subset of individuals with minimum scatter (i.e., the smallest determinant from subsample covariance matrix).
In general, MCD techniques work based on two key features in order to detect the likely-smallest cloud of observations: (1) the determinant of a given covariance matrix and (2) the Mahalanobis distances (MDs) of all observations with respect to a that covariance matrix.
Because MCD searches for the smallest and most homogeneous cloud of observations, it is also a common technique used for detection of multivariate outliers (Hadi, Imon, & Werner, 2009;Magnotti & Billor, 2014;Verity et al., 2017). Since the introduction of the MCD, there have been many improvements and variations on the technique, such as robust PCA (Hubert, Rousseeuw, & Branden, 2005) and deterministic MCD (Hubert, Rousseeuw, & Verdonck, 2012), where the most commonly used variant is probably the FAST-MCD approach (Rousseeuw & Van Driessen, 1999) See also Hubert, Debruyne, and Rousseeuw (2017) for an overview. Recently the MCD approach has been extended to address high-dimensional data through regularization (Boudt, Rousseeuw, Vanduffel, & Verdonck, 2017).
Though the MCD and its variants are standard techniques to identify both (1) robust structures and (2) outliers, there is one substantial gap: MCD approaches generally work only for data assumed to be continuous. The lack of a non-quantitative MCD is problematic because many multivariate data sets are inherently non-quantitative (e.g., categorical or ordinal) such as clinical ratings scales, questionnaires, and genetics. So how can we apply MCD approaches to non-quantitative data? The major barrier to apply the MCD on GENERALIZED MCD 5 non-quantitative data is the lack of a standard Mahalanobis distance (MD) estimate for non-quantitative data. Goodall (1966) noted that there are generally two issues with the application of standard computations and rules of MDs to non-quantitative data: (1) "If [. . . ] quantitative and binary attributes are included in the same index, these procedures will generally give the latter excessive weight." (p. 883), and (2) "indices appropriate to quantitative [data have] been applied to ordered, non-quantitative attributes [. . . ] by arbitrarily assigning metric values to their different ordered states." (p. 883). Clearly, we do not want categorical data to provide undue influence on MD estimates, nor do we want to arbitrarily apply metric values to categorical or ordinal data. While there are many Mahalanobis-like distances and alternative measures of similarity for non-quantitative data (Bar-hen & Daudin, 1995;Bedrick, Lapidus, & Powell, 2000;Boriah, Chandola, & Kumar, 2008;Leon & Carrière, 2005; McCane & Albert, 2008) many still have the drawbacks noted by Goodall (1966).
Therefore, if we want to develop a MCD algorithm for non-quantitative data, we must first define a MD that neither imposes undue influence nor arbitrarily assigned values.
Arguably the most well understood and defined metric for non-quantitative data is the χ 2 -distance, typically used for contingency and categorical data (Greenacre, 2017;Greenacre & Hastie, 1987). In our work here we first show that the MCD can be generalized to categorical data by defining MD under the assumptions of χ 2 (i.e., independence) through Correspondence Analysis (CA), which is a singular value decomposition (SVD)-based technique. We then show how our MCD approach generalizes to almost any data type, including mixed sets of variables, because of CA.
Our paper is outlined as follows. In Notation and software we provide the notation set used in this paper, the software used in and to write this paper, and a description of the code we provide for our generalized MCD. In Determinant and Mahalanobis distances via the SVD we first show how the SVD provides the two key requirements for the MCD algorithm: the determinant of a covariance matrix and MDs (specifically, through principal components GENERALIZED MCD 6 analysis; PCA). Then, we show how these properties of the SVD generalize in order to compute MDs for categorical data through CA. In Establishing the MCD algorithm for categorical data we then show how we only require PCA to perform the standard MCD technique but that a categorical version of the MCD requires a specific form of CA (via generalized CA, specifically "subset" CA) in order to to achieve the MCD for categorical data. Next in Applications and Extensions we illustrate our newly defined MCD to identify robust structures and outliers. We then show that particular recoding schemes allow for the use of other data types such as ordinal or continuous via our categorical MCD and thus our approach generalizes the MCD to any data type or even mixed data types (i.e., a matrix of categorical, oridinal, and continuous variables). In Establishing the MCD algorithm for categorical data we use toy (simulated) genetics data from the supplemental material of Beaton, Dunlop, and Abdi (2016) in order to formalize a categorical version of the MCD. In Applications and Extensions we illustrate our generalized MCD with data from the Ontario Neurodegenerative Disease Research Initiative (ONDRI) and the Alzheimer's Disease Neuroimaging Initiative (ADNI). Finally we discuss the technique with respect to the results, as well as provide concluding remarks and future directions for a generalized MCD in Discussion.

Notation
Bold uppercase letters denotes matrices (e.g., X), bold lowercase letters denote vectors (e.g., x), and italic lowercase letters denote specific elements (e.g., x). Upper case italic letters denote cardinality, size, or length (e.g., I) where a lower case italic denotes a specific GENERALIZED MCD 7 index (e.g., i). A generic element of X would be denoted as x i,j . Common letters of varying type faces for example X, x, x i,j come from the same data struture. Vectors are assumed to be column vectors unless otherwise specified. Two matrices side-by-side denotes standard matrix multiplication (e.g., XY), where denotes element-wise multiplication. The matrix I denotes the identity matrix. Superscript T denotes the transpose operation, superscript −1 denotes standard matrix inversion, and superscript + denotes the Moore-Penrose pseudo-inverse. Recall that for the generalized inverse: XX + X = X, (X + ) + = X, and (X T ) + = (X + ) T . The diagonal operation, diag{}, when given a vector will transform it into a diagonal matrix, or when given a matrix, will extract the diagonal elements as a vector.
Say D is some diagonal matrix and recall that the transpose of a diagonal matrix is itself and thus D T = D. We denote . as the floor function. Finally, we reserve some letters to have very specific meanings: (1) F denotes component (sometimes called factor) scores where, for example, F I denote the component scores associated with the I items, (2) O and E mean "observed" and "expected", respectively, as used in standard Pearson's X 2 analyses, where for example O X denotes the "observed matrix derived from X", and (3) blackboard bold letters mean some a priori or known values, for example E X would reflect some previously known expected values associated with X but not necessarily derived from X.
Furthermore, when we use a previously defined element as a subscript, for example Z X , it denotes that the Z of Z X was derived from the element in its subscript, for example Z X is a centered and/or normalized version X.

Determinant and Mahalanobis distances via the SVD
In this section we show that we only need principal components analysis (PCA) more specifically the SVD to compute both the determinant and MD. Through the SVD we can compute MD by way of "leverage" through the singular vectors. Thus the SVD provides the two key features we require for the MCD. Next we show how these ideas extend to categorical data via multiple correspondence analysis (MCA) and therefore we establish the fundamental GENERALIZED MCD 8 features (MDs and the determinant) required for the MCD applied to categorical data.

Overview of MCD algorithm
Say we have a matrix X with I rows and J columns where J < I and we assume that X is full rank. The MCD requires an initialization of subset of size H which is a random subsample from X where (I + J + 1)/2 ≤ H ≤ I. In general, the MCD algorithm can be described as: 1. From X H compute the column-wise mean as µ H and compute the covariance as where X H has been centered (i.e., subtract the column-wise mean).
2. Compute the determinant of S H .

Compute the squared MD for each observation in X as
4. Set X H as the subset of H observations with the smallest MDs computed from Step 3.
The size of H is controlled by α to determine the subsample size; α exists between .5 and 1. These steps are repeated until we find a minimum determinant either through optimal search or limited to a given number of iterations. We denote the H subset with the minimum determinant as Ω. For the Ω subset, we can obtain robust mean vector µ Ω , robust covariance S Ω = (X T Ω X Ω ) × (Ω − 1) −1 (where X Ω is column-wise centered), and a set of robust square MDs for each i observation as The two most important features for MCD algorithms are: (1) the determinant of the subsample covariance matrix, which helps identify the smallest scatter of data and (2) the squared MDs for all observations derived from the subsample covariance matrix, which helps us find the minimum determinant.
Both the determinant of a matrix and the squared MDs for observations can be computed from the eigen-or singular value decompositions (SVD): (1) the determinant can be computed as the geometric mean of the eigenvalues (Boudt et al., 2017;SenGupta, 1987) and (2) squared MD can be computed as the sum of the squared row elements of the singular GENERALIZED MCD 9 vectors (Barkmeijer, Bouttier, & Van Gijzen, 1998;Brereton, 2015;Stephenson, 1997). In the next section we show how squared MD can be computed from the SVD in order to show how the singular vectors provide a generalized estimate of squared MD.

Mahalanobis distance(s) for continuous data
Let X be a centered data matrix with I rows and J columns. We will assume that X is full rank on its columns where generally I > J. First we define the sample covariance matrix of X, per usual, as S = (X T X) × (I − 1) −1 . The squared MD is defined as: where 1 is a conformable J × 1 vector of 1s and m is a I × 1 column vector (where m T is a row vector) of squared MDs. We can reformulate Eq. (1) in two ways. The first reformulation is: where m = diag{M}. If we relax the definition of squared MD to exclude the degrees of freedom scaling factor, the second reformulation is: where m = diag{M} × (I − 1). Henceforth when we refer to squared MD it is the definition we provide in Eq. (3), specifically, m = diag{X(X T X) −1 X T }.

Principal components analysis
Let us now define the principal components analysis (PCA) of X which has I rows and J columns, where I > J. If X is only column-wise centered then it is a "covariance" PCA. If X is column-wise centered and scaled (e.g., z-scores or sums of squares equal to 1) it is a "correlation" PCA. For the following formulation we assume only a column-wise centered X.

GENERALIZED MCD 10
Assume that X is full rank, where L is the rank of X and L ≤ J < I. We can perform the PCA of X through the singular value decomposition as: where the following properties hold: 1. U and V are orthonormal left and right singular vectors of sizes I × L and J × L, scores can also be defined through projection (or rotation) as: In the regression and PCA literature there exists an influence measure called "leverage" which is bounded between 0 and 1 (Wold, Esbensen, & Geladi, 1987). Leverage is extracted from an I × I matrix called a "hat matrix" in the regression literature and is more generally called an "orthogonal projection matrix" in the multivariate literature (Yanai, Takeuchi, & Takane, 2011). The hat or projection matrix is defined generally as X(X T X) −1 X T where leverage is defined as diag{X(X T X) −1 X T }. Note that the definition of leverage is equivalent to our definition of a squared MD in Eq. (3).

11
With respect to PCA there are two types of leverage: one for the columns (typically variables) and one for the rows (typically observations) of a matrix. Here we only discuss leverage for the rows (observations). Both Wold et al. (1987) and Mejia, Nebel, Eloyan, Caffo, and Lindquist (2017) described leverage for the observations (rows) as based on the component scores F I : where each row item's leverage is an element in g I = diag{G I }. If we expand Eq. (6) and substitute UΣ for F I : which indicates that g I can be computed as: With the definition of squared MD from Eq. (3) it is obvious that leverage and squared MD are the same: m = diag{X(X T X) −1 X T } = diag{UU T } = g I . However we show this relationship in full through the SVD as follows: GENERALIZED MCD 12 because V T = V −1 then (VΛV T ) −1 = VΛ −1 V T . The relationship established in Eq.
(9) is critical to our formulation of MD for categorical data and thus a generalized MCD.

Representation of categorical data
Say we have an I × C matrix called C, as in Table 1a, where the rows are observations and the columns are categorical variables typically assumed to be unordered thus each cell contains the nominal level of each variable. These data can be represented in disjunctive coding where each variable is represented by a vector of length N c , where N c is the number of levels for the c th variable, as in Table 1b. The disjunctive form of C is an I × J matrix X.
The multivariate analysis of a disjunctive matrix is usually performed with a technique called multiple correspondence analysis.
MCA like PCA and CA is an SVD-based technique but with specific preprocessing, which requires constraints (weights) for the rows and columns and decomposes a matrix under the assumption of independence (i.e., χ 2 ). The general premise of Pearson's X 2 is to estimate how far observed values deviate from expected values. To perform MCA, we first compute the row and column weights: where w I and w J are the row and column marginal sums, respectively, divided by the total sum. In the case of purely disjunctive data, each value in w I is identical as each row contains the same number of 1s (see Table 1b): Each element in w I is simply the total Variable 2 with 3 and 2 levels respectively. (b) shows the disjunctive code for these data.
Note that one observation has missing data (NA). Missing data can be coded as barycentric (which is the mean of the variable).
GENERALIZED MCD 14 number of columns in C divided by the sum of all elements of X. Similarly, w J contains the column sums of X divided by the total sum. Next, just as in Pearson's X 2 analyses, we compute observed (O X ) and expected (E X ) matrices of X as follows: and deviations computed as MCA is performed with the generalized SVD (GSVD) of Z X : and where W I = diag{w −1 I } and W J = diag{w −1 J }. We obtain the results of the GSVD of Z X through the SVD of Z X = W 1 2 We can see the relationship between Z X and Z X via substitution: The generalized singular vectors are computed as P = W simplicity, we will henceforth refer to the GSVD decomposition where needed in triplet notation, e.g., GSVD(W I , Z X , W J ). To note, we have taken liberty with the standard GSVD triplet notation (see Holmes, 2008) and present the triplet more akin to its multiplication steps. can be thought of as each element in X divided by its respective row or column sum (for the row and column profile matrices, respectively). We can define the MCA row component scores via projection as: and similarly the column component scores are computed as

Mahalanobis distance(s) for categorical data
In Principal Components Analysis we showed for continuous data the equivalence between leverage and squared MD (via the hat matrix). Both leverage and squared MD are obtained as the diagonal of the crossproduct of left singular vectors i.e., diag{UU T } or in a computationally simpler way: the row sums of the squared singular vectors (U × U)1, where 1 is a conformable vector of 1s. Because leverage via the singular vectors is squared MD for continuous data, we use the same premise to define a squared MD for categorical data. In the case of continuous data we can define squared MD for continuous data via PCA and thus GENERALIZED MCD 16 we can define squared MD for categorical data via MCA.
As noted in Eqs. (13) and (14) MCA is performed as GSVD(W I , Z X , W J ) wherein we apply the SVD as Z X = UΣV T . Thus in MCA the GSVD via the SVD provides U. Recall the relationship between Z X and Z X : (14)). In the case of MCA, Z X would be the analog of X in PCA (see Eq. (4)). Thus in MCA we know that all columns that represent a single variable are collinear. The rank of ( We have shown previously for continuous data that leverage is squared MD. We argue that leverage in MCA i.e., Z X ( Z T X Z X ) + Z T X is an analog of squared MD for categorical data just as in PCA: Finally and most importantly: because we use the "leverage is Mahalanobis" definition there is no ambiguity nor arbitrary decisions in the definition of squared MD for categorical data (à la, Goodall, 1966); squared MD is defined by the singular vectors for the observations (i.e., U).
3.6.1 From Mahalanobis to χ 2 -distance via generalized Euclidean. While our primary focus in the previous section was and for the basis of this paper is to define squared MDs for categorical and mixed data, it is worth noting a much more common distance measure in the CA and MCA literature: the χ 2 -distance. To define χ 2 -distance we use the same matrices and notation as seen in the definition of MCA (cf. Eqs. (11), (12), GENERALIZED MCD 18 (13)). We start with a "row profile" matrix Φ I = W I O X , which is simply each element of X divided by its respective row sum. The average profile, φ Iµ , is defined as the column means of Φ I , where Φ Iµ = diag{φ Iµ } is the diagonal matrix that contains the average profile on the diagonal. We can define χ 2 -distance much in the same way as in Eq.
(2) (which is the matrix algebra expression of the generalized squared Euclidean distance formula): We show that Ξ can be obtained directly through the component scores where, for ease, we use the projection version of the component scores (2017) and Greenacre and Hastie (1987) argue that the χ 2 -distance is a type of Mahalanobis distance. However, we disagree and believe that while both are forms of generalized Euclidean distances they are distinct: squared MD is defined by the singular vectors where χ 2 -distance is defined by the component scores. To note, the "χ 2 -distance" for PCA would be defined as Here we provide shortened expressions of squared MD and χ 2 -distance for categorical data both in their generalized Euclidean forms (à la Equation 2) and in their SVD forms.

Establishing the MCD algorithm for categorical data
In this section we will refer to and use a particular toy data set. These toy data are single nucleotide polymorphisms (SNPs) which are categorical variables. The toy data were simulated from real genetic data from the Alzheimer's Disease Neuroimaging Initiative (ADNI). For a description of the data see supplemental section from Beaton et al. (2016).
Each SNP contains three possible genotypic categories: "AA", "Aa", and "aa" which are, respectively, the major homozygote, the heterozygote, and the minor homozygote. Both "A" and "a" represent alleles where "A" is the major (more frequent) allele and "a" is the minor allele. Each genotype is a level (category) within a SNP variable. We provide more details where necessary.
Recall that the MCD algorithm requires two essential parts: the determinant (via the eigenvalues) and squared MD. We now have a formal definition of both for continuous and categorical data. However, one key feature of the MCD algorithm is that it requires that we estimate squared MDs on sub-or out-of sample observations. The squared MDs for excluded samples are computed conditional to the mean and covariance of a sub-sample. While this can be done in the standard generalized Euclidean formula, we opt for a more GENERALIZED MCD 20 computationally efficient approach through the SVD and projections. We first show this with continuous data in PCA to establish squared MD for excluded samples. We then show how this can also be done with categorical data via MCA.
However, categorical data poses a unique problem: levels of variables (i.e., categories) are not guaranteed to exist in particular sub-samples. We show that the absence of categories leads to an underestimate of both squared MD and χ 2 -distances. Following the illustration of these issues we provide a solution of how to obtain accurate distances (squared MD, robust squared MD, and χ 2 -distance) of excluded observations.
In the CA and PCA literatures, "active" data refer to the data submitted to the SVD that form the components, where as "supplementary" data are excluded observations we want to project onto existing components. For squared MD, we will consider "active" data those observations used to compute the the mean and covariance matrix (e.g., S or S ). The out of sample (or sub-sample) data will be referred to as supplementary.

Active Mahalanobis and χ 2 -distances for continuous data
In order to establish the basis of squared MD for sub-and out-of sample data, we must revisit squared MD through projections à la Eq. (5). We show how to retrieve the singular vectors through the projection of data onto the space spanned by the components because only U is necessary to compute the squared MD. For quantitative data via PCA: Similarly, we can also compute the χ 2 -distances for quantitative data via PCA through just a simple rotation of X, as already shown in Eq. (5) and recreated here for convenience:

Supplementary Mahalanobis distance
Let us refer to a K × J matrix Y as a new or excluded set of observations with the same variables (columns) as X. For Y, we would obtain the squared MDs as where V are the right singular vectors (loadings) from Eq. (4). More simply we can interpret this as: V works as rotation of Y or that we project Y onto the components of X spanned by V to give supplementary component scores for the rows. We can also compute (3)). As shown in Eq. (20), we can obtain the squared MDs of X either directly from the left singular vectors U or through projections. If we substitute Y for X: Thus the squared MDs of Y are m Y = diag{F K Λ −1 F T K }. Therefore to compute the squared MDs of Y we need only V and Σ from the SVD of X (see Eq. (4)).

Mahalanobis distances (via the SVD) for sub-samples.
Let us consider that the where X is the same I × J data set as before and Z is a new data set of size (I − K ) × J ; that is Y is comprised of both active and supplementary data where Y is (1) centered by the column mean of X and (2) scaled by the same scaling factor of X. The hat (or projection) matrix of Thus we can compute the squared MDs for both active and supplementary sub-samples through the SVD and projections.

Active Mahalanobis and χ 2 -distances for categorical data
In the same way as with continuous data, we can compute both Mahalanobis and χ 2 -distances for active data through projections for categorical data. As shown previously, we can obtain the component scores for active in MCA through Eq. (15) where Therefore, if we want to obtain the squared MD through projections and still use the component scores we can do so as:  Lebart et al. (1984) and Greenacre (1984)

Generalized and subset correspondence analysis
That solution to compute accurate MDs is through a specific form of MCA called subset MCA (Greenacre, 2017;Greenacre & Pardo, 2006). Furthermore, subset MCA is a specific case of Generalized Correspondence Analysis (GCA; Escofier, 1983Escofier, , 1984. Subset MCA via GCA preserves the shape (rows and columns) of the data and thus can identify the subset of observations within the multivariate space and with the smallest possible scatter.

Escofier's Generalized Correspondence Analysis.
Hiding in the broader CA literature is "generalized correspondence analysis"-as referred to by Grassi and Visentin (1994)-established by (Escofier, 1983(Escofier, , 1984. GCA has many uses well beyond the standard applications in CA and MCA, including those for missing data and assumptions GENERALIZED MCD 24 that deviate from the typical assumptions of CA (e.g., quasi-CA with "quasi-independence" and "quasi-margins"; Leeuw & Heijden, 1988;Van der Heijden, De Falguerolles, & Leeuw, 1989).
Escofier established the idea of GCA as: "(DATA -MODEL) / MARGINS". Here we define the idea of "(DATA -MODEL) / MARGINS" more formally, and specifically for categorical data. With a disjunctive data set X, the generalized MCA of X would be defined as follows. As noted in Grassi and Visentin (1994) we can characterize each cell of the matrix Z X as derived from the generalized model of CA with the generic element The model (E) and the margins (W I and W J ), could be defined in almost any way and it is not required to compute the expected values or margins from the data themselves. We can represent the generalized CA approach in a simpler form with the GSVD where Z X = O X − E and GSVD(W I , Z X , W J ) (see Eqs. (13) and (14)). While this is a generalized form, not all choices of E, W I , and W J provide estimates that conform to Pearson's X 2 (i.e., independence; hence why Leeuw and Heijden (1988) refer to some variants as "quasi-independence").
In this generalized MCA, the expected values and weights can be derived from any prespecified models that conform to the rows and columns of the given data. For example let us refer to our toy SNPs data set transformed into disjunctive coding. We would typically use the standard MCA approach where we define E = E X , W I = W I , and W J = W J all of which are derived from the row and column frequencies of X. However for genetic data we could use population estimates of the allele frequencies (from, e.g., dbGaP, HapMap or 1000 Genomes) for the columns. Let us refer to the population allele probabilities as w A where w A are the same J columns as in X. Our generalized MCA with population allele probabilities would be GSVD( GCA provides a way to maintain the row and column structure of the data and to  (13) and (14): O X , E X , w I and w J . Let us frame O X , E X , w I as constructed by the H andH subsets as: To maintain the structure of the data and to be able to perform the necessary steps for MCD we must use a generalized That is, we replace part of the expected matrix with part of our observed matrix. Thus our assumption that expected values equal the observed is correct and therefore our assumed deviation (i.e., independence) is 0. The GSVD notation for this is: Any random subsample partitions of H andH guarantee that we maintain the same shape (rows and columns) as X when we use E = Like in Eqs. (13) and (14) we have: where Just as in Eq. (14) we would apply the SVD as: GENERALIZED MCD 26 where U T U = I = V T V and Through GCA we maintain the structure of the original data but now the data from the excluded sample are comprised of 0s through each step and thus do not contribute to the overall structure. Hence V and Q are the sets of singular vectors and generalized singular vectors, respectively, for only the H subsample with the same number of columns as X.
However, it is possible to simplify this approach through a derivation of this application of GCA called "subset MCA" (Greenacre, 2017;Greenacre & Pardo, 2006).

Subset Multiple Correspondence Analysis.
Our specific use of GCA is designed for use with the MCD algorithm. As shown in Eqs. (25) and (26) we can set the exepcted values of the excluded subsetH to 0 so that we only analyze the active subset H.
Our specific form of GCA reduces to subset MCA via Eq. (27) where thus when we compute the SVD of Z we do not require all of the observations we only require those that are not 0. Our use of subset MCA can be characterized as follows.
To compute subset MCA for MCD we use the matrices defined in Eqs. (13) and (14).
which is the H subsample of Z X and perform the SVD as Let us define the subset MCA of the complementH as In the CA literature the sum of the eigenvalues (i.e., diag{Λ}) is called "inertia" which we denote as I. Note that In essence, subset MCA partitions the full MCA space into additive subspaces.
Our particular implementation of GCA and use of subset MCA produce identical results via the SVD because Z X H = W

Subset MCA for categorical MCD.
For MCD algorithms we require (1) the singular values so we can compute the determinant and (2) robust squared MD estimates.
We can obtain the singular values in subset MCA (see Eqs. (29) and (31)) and thus we can compute the determinant for H subsamples via the geometric mean of the eigenvalues: Robust squared MDs can be computed in subset MCA through the row profiles.
Recall that we can compute component scores from the full MCA via projection as We can also compute component scores for the full I set from the subset MCA as where the χ 2 -distances are equivalent between subset and full MCA as With this formulation, we can compute supplementary squared MDs for the full set of observations conditional to the subset MCA results as: where the U H here in Eq. (33) is exactly the U H in Eqs. (26) and (29) We illustrate our formalization of a categorical MCD with our toy data set. The toy data are of size I = 60 × C = 9 (observations by variables). These data were transformed into a disjunctive coding (see Table 1) that are of size I = 60 × J = 25 (observations by columns). Figure 1 shows GMCD applied to the toy data set over the course of 1, 000 iterations. Figure 1a shows the search for a minimum determinant; note that the intial determinant is quite high by comparison to later determinants. Eventually the algorithm stops at 1, 000 (the number of iterations we set) on a much smaller determinant. Figure 1b shows the observed squared MD and χ 2 -distances. The observed distances alone (Fig, 1b) suggest the presence of possible outliers. We have denoted two individuals with "X" circumscribed by a circle. These two individuals are the only two with "aa" for one particular SNP. In Figure 1c we show the observed distances again but now individuals in red are those within the H subsample and individuals in blue are those excluded from the "best" subsample (i.e., the subsample with the smallest determinant at convergence). Finally, like with the standard MCD algorithm we can also compute robust squared MD which helps reveal "masking" effects by exaggerating the more extreme individuals. Figure 1d shows the observed vs. robust squared MD. Figure 1e is the same as Figure 1d except the two most extreme individuals to provide a better view of the observed vs. robust squared MDs.

Applications and Extensions
Now that we have established a MCD algorithm for categorical data we show in this section examples of its use including how our approach generalizes the MCD to allow for virtually any data type (i.e., categorical, ordinal, continuous). Because our approach is based on CA, we can capitalize on particular properties of CA and X 2 to allow for the analyses of non-categorical data such as ordinal, continuous, or mixed-data. We show how this is possible through a data recoding technique called data "doubling" (a.k.a. fuzzy coding, bipolar coding, or "Escofier transform").
In this section we illustrate our generalized MCD with two data sets: one from the Ontario Neurodegernative Disease Research Initiative (ONDRI) and one from the Alzheimer's Disease Neuroimaging Initiative (ADNI). First we apply our approach to the miniature survey of autobiographical memory (MSAM; adapted from Palombo, Williams, Abdi, & Levine, 2013) and treat the MSAM categorical data. We then apply our approach to two recoded versions of the MSAM: one recoded to eliminate rare categories and another that treats the data as ordinal responses. We compare and contrast the results from all three applications to the MSAM. Finally, we use a mixture of data types from ADNI data to demonstrate how our technique applies to data sets with mixed variables. The ADNI data include categorical data (genotypic markers from single nucleotide polymorphisms), ordinal data (clinical dementia rating), and continuous data (volumetric estimates from brain regions). We explain each data set and applications in their own sections.
As part of the quality control process, various subsets of data are subjected to outlier analyses []. However, one of the major challenges is that many of the variables and GENERALIZED MCD 32 instruments are not quantitative and we therefore could not use existing techniques such as the MCD, robust PCA (Candès, Li, Ma, & Wright, 2011) or principal orthogonal complement threshold (Fan, Liao, & Mincheva, 2013).
Here, we illustrate our generalized MCD on one such data set: the MSAM. We show the MSAM on N = 300 particpants from two cohorts (vascular cognitive impairment n = 161 and parkinson's disease n = 139). We focus our illustration on identification of outliers and a robust subset of individuals in the MSAM regardless of disease, age, sex, and other factors.

MSAM data.
The MSAM is a short version of the survey of autobiographical memory (SAM). The MSAM is comprised of 10 questions and each question has five possible responses: two indicate disagreement with a question ("strongly disagree", "disagree somewhat"), two indicate agreement with a question ("agree somewhat", "strongly agree") and one neutral respones ("neither"). Here, we analyze the MSAM in three ways so that we can compare and contrast the results of each. In our first example, we treat the data as categorical and apply our technique as previously derived and illustrated (see Figure 1). As noted in Figure 1 very rare responses can greatly exaggerate the squared MD, robust squared MD, and even χ 2 -distances. Furthermore, in many practical applications rare responses would likely be recoded so that they are combined with similar responses. For example if a response of "strongly agree" occurred in less than approximately 5% of responses for a particular question, it would be recoded with "agree somewhat" to make a single "agreement" response. Therefore for our second example we recode rare (≈ 5%) responses where we combine a rare response (e.g., "strongly agree") with its most similar response (e.g., "agree"). We continue to treat the data as categorical and apply our technique again. Finally, because the responses could be treated as ordinal data we use a specific recoding scheme that has many names such as "bipolar coding" (Greenacre, 1984), fuzzy coding, or "doubling" (Greenacre, 2014;Lebart et al., 1984). We refer to this coding henceforth as "fuzzy coding". First we explain fuzzy coding. Next we compare and contrast the results of each analyses with respect to which observations were identified as the robust GENERALIZED MCD 33 subsample across each technique and several α levels (i.e., α = .5, α = .75, α = .9). Finally, we also illustrate how to use the bootstrap procedure to define thresholds for outliers given the robust squared MDs.

Fuzzy coding.
With fuzzy coding each original variable was represented by two columns where the "−" column is the observed value minus the minimum of the variable for that question, and the "+" column is the maximum of the variable for that question minus the observed value. For example, the MSAM has 5 possible levels (1, 2, 3, 4, 5) so each "−" column will use 1 as the minimum and each "+" column will use 5 as the maximum, even if there are no observed values of 1 or 5 in the data themselves. The data are then normed so that each pair of columns (i.e., + and −) sum to 1. Because the data behave as disjunctive data, we can apply MCA and thus our GMCD approach also works to find a robust subset and help identify outliers. The two columns are then normalized so that the row sums equal the total number of original variables. Fuzzy coding is illustrated in Table 2.
Fuzzy coding transforms variables into "pseudo-disjunctive" in that, from the perspective of CA and MCA, the data tables behave like disjuntive tables (see Table 1) even though the data are not disjunctive: the sums of the rows equal to the number of (original) variables, the sum of variables (each pair of columns) equal the number of rows, and the sum of the table is equal to the number of rows × the number of (original) variables.

ONDRI Examples.
In the following examples we refer to the original MSAM data as "as is", the recoded for rare responses version "recode", and the ordinal version as "ordinal". We applied the generalized MCD to each with three alpha levels:   iterations for the GMCD algorithm. Figure 2 shows all nine applications of the generalized MCD applied to the MSAM. To note in Figure 2 we show the MDs and robust MDs (not squared). Part of the reason we show these is because like with the toy data in the previous section some observations excluded from the H subsample (denoted in blue) have very high robust squared MDs by comparison to the H subsample (denoted in red). Though these are the same data, we see interesting behaviors when we treat the data differently. First, when α is low, we see a greater exaggeration of outlying individuals with respect to the "inliers" and the H subset.
However this behavior essentially disappears as α increases. Finally we also see that the "ordinal" version of the data appears to behave quite differently from the other versions as α increases. In general, the MD and robust MD estimation are highly similar, with only a small gap between the H subsample and the observations excluded from H. The robust MD estimates across techniques tends to vary as a consequence of both data representation (as is, recoded, ordinal) and α. However the H subsamples across each of these are actually highly similar.
The overlap of the H subsample between all applications of the MSAM analyses is shown in Table 3. When α is small there is more variability between the different versions.
As α increases the H subset becomes much more similar across all versions. However it is important to note that of all of these analyses, the "as is" and the "ordinal" generally find the same observations as part of the H subset, even though the data are represented in two completely different ways. The mostly common subsamples between "as is" and ordinal analyses show that both retain essentially the same information.

Identifying inliers and outliers.
Besides the identification of robust subsamples (and covariance structures), MCD algorithms reveal "masking" effects and identify outliers. In some of our MSAM examples in Figure 2 it is fairly easy to identify "inliers" and the extreme outliers that were previously subject to the "masking" effects.   have a highly unique response pattern and differ greatly from the rest of the sample.
As previously described MCD algorithms aim to find the most robust H subsample wherein the scatter (covariance) is minimized. Sometimes, observations in the H subsample have higher robust squared MDs than those outside of the H subsample, like in the "as is" MSAM at α = .5. However other times, like in the oridinal MSAM at α = .9, there is no clear division between the H subsample and the rest of the observations. However Therefore to help identify inliers and outliers we recommend using the bootstrap (Efron, 1982(Efron, , 1992 to generate a distribution of robust squared MDs. We refer to "inliers" here as those individuals that are not identified as outliers. We refer to outliers here as those above some threshold. The bootstrap derived distribution of robust squared MDs can be used to identify observations in the sample that exist inside or outside of particular thresholds. The bootstrap procedure we used is as follows. First we bootstrapped the data (in disjunctive or fuzzy format) to create a new MSAM data set of the same size. Next we preprocessed the data just as in standard MCA and projected the preprocessed data onto the robust singular vectors to compute robust squared MDs. We repeat this process many (e.g., 100) times.
Each bootstrapped version of the data generates a set of robust squared MDs. All of the robust squared MDs across all bootstrapped sets are then used to create a distribution of robust squared MDs. We use the distribution of bootstrapped robust squared MDs to identify any of the originally observed robust squared MDs that are outside of a given threshold. For this example we used two thresholds: (1) a fixed threshold at 95% of the bootstrapped robust squared MDs distribution, and (2) proportional to H subsample size for each analysis (as determined by α). In Figure 3 we highlight individuals below both thresholds (red), between the thresholds (gold), and above the 95% threshold (blue). For the bootstrap procedure we refer to "inliers" as those individuals that are always under both thresholds. We show two types of outliers in this example: the less extreme outliers between the two thresholds and the extreme outliers beyond the 95% threshold. We use both outlier thresholds to highlight the behavior of the bootstrap procedure across the different GENERALIZED MCD 39 applications of the MSAM analyses.
With the bootstrap procedure, especially in the α = .5 analyses, we can see that the "inliers" constitute a more homogeneous set than those identified as the H subsample (Fig.   3). This is in part because the MCD algorithm identifies a subset with the minimum determinant ( (2) the "as is" and the "ordinal" generally find the same observations as part of the inliers, even though the data are represented in two completely different ways.
Finally we highlight the overlap between the H subsamples and the inliers identified via bootstrap in Table 5. The diagonal in Table 5 shows the overlap between the H subsample and bootstrap inliers for the same analysis. Off diagonal elements show the relationships between the analyses (i.e., different coding approaches and α levels). While this table reflects the same general story as the previous tables it does provide a clearer sense of similarity between the H subsamples and the bootstrap-based inliers. For the identification of inliners and outliers we do strongly recommend using the bootstrap procedure instead of just the H subsample (e.g., red individuals in Fig. 2) or, for example, quantiles of the robust squared MDs. The bootstrap procedure provides a more stable assessment of individuals and their likelihood of classification as "inlier" or "outlier".

ADNI Data
Data used in the preparation of this article come from Phase 1 of the ADNI database   The ADNI sample for use here was N = 613 across three diagnostic groups: Alzheimer's disease (AD) = 264, Mild cognitive impairment (MCI) = 166, and a control group (CON) = 183. We used several data sets of different data types from ADNI: single nucleotide polymorphisms (SNPs) as categorical data, the clinical dementia rating (CDR) as ordinal data, and volumetric estimates from five brain regions as continuous data. Like ordinal data, continuous data can also be transformed in a way where it behaves like disjunctive data. These data can be acquired in ADNI via the genome-wide data and the ADNIMERGE R package.
The CDR is a structured interview that has six domains: memory, orientation, judgement, communication, home and hobbies, and self-care (Morris (1993); see also http://alzheimer.wustl.edu/cdr/cdr.htm). After the interview ratings are applied to each category with the following possible responses: 0 (normal), 0.5 (very mild), 1 (mild), 2 (moderate), and 3 (severe). The CDR has ordinal responses and therefore we recoded the CDR with fuzzy coding.
We also included five brain volumetric estimates known to be impacted by AD: ventricles, hippocampus, enthorinal cortex, fusiform gyrus, and medial temporal regions.
The volumetric estimates here are treated as continuous data. Similar to ordinal data, we code the data as "doubled" continuous data so that they, too, behaves like disjunctive data.

44
We refer to the transformation of continuous to pseudo-disjunctive data specifically as the "Escofier transform" (Beaton et al., 2016) because it was proposed by Escofier (1979). The transformation is similar to "fuzzy coding" in that it too is "bipolar". The "doubling" of continuous data is simpler than with ordinal and only requires that we add or subtract 1 from the centered and scaled data, and divide each value by 2. An example of "doubling" for continuous data can be seen in Table 6. With "doubled" continous data we can apply GMCD to continuous data because it now behaves like disjunctive data. Our mixed data was thus a matrix of I = 613 ×J = 57 (13 SNPs categorical that spanned 35 columns, 6 CDR domains that spanned 12 columns, and 5 brain regions that spanned 10 columns).  Figure 4c shows the observed squared MD and χ 2 -distances for individuals (in red) that comprise the "best" determinant (the sample at convergence). Finally, like with the standard MCD algorithm we can also compute robust MD which helps revealing "masking" effects by exaggerating the more extreme individuals. Figure 4d shows the observed vs. robust MD; the robust MD was computed with respect to the individuals identified as the minimum determinant set in Figure 4c. Figure 4e shows outliers (in blue) as identified through our bootstrap procedure.
Like before we used two cutoffs: the lower bound was proportional to the H subsample and the upper was the 95%-ile of the bootstrap distribution. In the mixed data case 335 observations were identified as H subsample and 432 were identified as inliers of the total

Discussion
We presented the first MCD-based approach that generalizes to virtually any data type (i.e., categorical, ordinal, continuous). Our approach relies entirely on Mahalnobis distances derived from CA via the SVD and quite importantly avoids many of the issues of Mahalanobis-like distances that Goodall (1966) noted (i.e., excessive weight or arbitrary metrics). Furthermore, because our generalized MCD relies on CA we can use some simple recoding approaches that allow data tables to behave as disjunctive tables (see Tables 2 and   6). Our approach like many MCD-based and -inspired approaches (Boudt et al., 2017;Fritsch, Varoquaux, Thyreau, Poline, & Thirion, 2012;Mejia et al., 2017;Rousseeuw & Van Driessen, 1999) identifies a robust substructure. Also like many other approaches, our generalized MCD requires that we identify outliers in a way that is more reliable than simply using the H subset. Here we have opted to use a bootstrap procedure; this bootstrap procedure is easily adapted, can accomodate any threshold, and provides frequency estimates of outlierness (or inlierness) per observation.
Our generalized MCD was designed around the necessity to identify robust structures and outliers in complex non-quantitative data. Within the ONDRI study we had realized that many if not all existing MCD-like and related techniques could not accomodate our needs. We required a multivariate outlier technique that could work on categorical, ordinal, or mixed data with ease. While there are other possible algorithms (e.g., Candès et al., 2011;Fan et al., 2013) we chose to extend the MCD because it is a reliable and widely-used technique (Rousseeuw & Van Driessen, 1999 has been cited 1, 740 times at the time we wrote this paper) available in languages routinely used for multivariate analyses (e.g., https://cran.r-project.org/web/packages/robustbase/index.html, https://cran.r-project.org/web/packages/rrcov/index.html, and https://wis.kuleuven.be/stat/robust/LIBRA).
We have made our software available in R as part of a package focused on outliers and robust structures (see https://github.com/derekbeaton/ours).

Generalized MCD beyond our examples
We established a generalized across data types MCD. In most practical and simple cases, data matrices typically consist of one type of data. However with studies such as ONDRI and ADNI, the boundaries of what constitutes a "data set" are blurry: researchers analyze multiple variables from a variety of sources. Often these variables are of different types. Our example with the ADNI data highlights a typical data set that stems from these types of studies. However in the ADNI example we treated each variable as an independent measure which was somewhat unrealistic: each individual variable was part of a larger related set of variables (e.g., SNPs, the CDR, and brain imaging). A family of techniques have been designed specifically to handle such data sets: Multiple Factor Analysis (Abdi, Williams, & Valentin, 2013;MFA;Escofier & Pages, 1994). The MFA technique was designed for many variables that stem from specific data partitions measured on the same observations. Furthermore, the MFA technique has been extended based on the work by Escofier (1979) to accomodate mixed data types (Bécue-Bertaut & Pagès, 2008). Our GMCD approach does not only apply for single data tables but also for MFA-type problems.
Our approach also inherently allows for a priori defined covariance structures à la Eq.
(24). GCA, as defined by Escofier (1983) and Escofier (1984), allows for some known sets of weights (e.g., W I ) or expected values (e.g., E X ). Because our generalized MCD is based on GCA, we do not depend on theH sample set to 0s in the iterative process. Rather, if some known weights and/or some known structure exists then we can use those to identify the most robust set and thus outliers from the observed values.
Furthmore our approach stems broadly from CA but was presented here as MCA. In MCA the typical assumptions are that the columns comprise all levels of all variables, the rows are observations, and that generally all observations are assigned an equal weight.
However in standard CA rows do not have equal weights. In standard CA the row weights much like the column weights are proportional to all elements in that row (cf. Eq. (10)).
While our approach would still work for data with various weights assigned to the rows we Finally our approach is based on MCA. However in MCA it is standard to discard many of the low-variance components (Abdi & Valentin, 2007). The removal of low-variance components stems from the fact that the number of components is actually an overestimate of the subspace and variance because the number of columns is greater than the actual number of variables. The "Benzécri correction" is the most common approach to identify which low-variance components to discard (Benzécri, 1979). Note that in MCA it is easy to find the rank of the data and the actual components associated with the data: With C nominal variables the subspace is defined by components with eigenvalues greater than 1/C.
In our GMCD approach we use the full set of components and thus rely on data that are full rank and where I > J, not just I > C. However the notion that many low-variance components can be discarded suggests a simple approach for a high-dimensional generalized MCD.

Limitations
While our approach opens many avenues to extend the MCD there are some limitations.
First, the "Benzécri correction" (Benzécri, 1979) discards low-variance components and thus we can compute a Mahalanobis-like distance from some arbitrary number of high-variance components this does not work well in practice. In most cases outliers typically express high values on low-variance components. If we were to discard components we might not identify a proper robust structure and thus outliers. However the dimensionality correction (Benzécri, 1979) of the space is only established for strictly categorical data because the "doubling" approaches alone provide correction dimensionality estimates GENERALIZED MCD 50 (Escofier, 1979). How to determine, and then use, the correct dimensionality of the space is also an open question. For now we have opted for the more conservative approach (require full rank data and retain all components).
Finally our algorithm does not include one particular feature of the standard MCD algorithm: we are not able to compute a robust center on the data. Rather, our robust center exists entirely within the χ 2 -space as defined by the factor scores because subset MCA "maintains the geometry [. . . ] and χ 2 -distances of the complete MCA [. . . ]" (Greenacre & Pardo, 2006). For example if we were to apply our technique with continuous data (recoded as in Table 6) and compare against the standard MCD we would obtain slightly different results.

Conclusions
The MCD is a very reliable and widely-used approach to identify robust multivariate structures and likely outliers. However until now the MCD could only be used for data that are (or were assumed to be) continuous. Our approach uses the correspondence analysis to define Mahalanobis distances (via the singular vectors) and allows us to do so for virtually any data type. We believe that our generalized MCD via GCA opens many new avenues to develop multivariate robust and outlier detection approaches for the complex data sets we face today. Not only does GCA provide the basis for a new family of MCD techniques but our technique also suggests ways to approach other robust techniques (Candès et al., 2011;Fan et al., 2013). With complex studies that contain many mixed-variables, techniques like our generalized MCD will be essential to maintain data quality and integrity.
6.4 Acknowledgements 6.4.1 Funding. Some of this research was conducted with the support of the Ontario Brain Institute, an independent non-profit corporation, funded partially by the Ontario government. The opinions, results, and conclusions are those of the authors and no