## Abstract

Current large scale studies of brain and behavior typically involve multiple populations, diverse types of data (e.g., genetics, brain structure, behavior, demographics, or “mutli-omics,” and “deep-phenotyping”) measured on various scales of measurement. To analyze these heterogeneous data sets we need simple but flexible methods able to integrate the inherent properties of these complex data sets. Here we introduce partial least squares-correspondence analysis-regression (PLS-CA-R) a method designed to address these constraints. PLS-CA-R generalizes PLS regression to most data types (e.g., continuous, ordinal, categorical, non-negative values). We also show that PLS-CA-R generalizes many “two-table” multivariate techniques and their respective algorithms, such as various PLS approaches, canonical correlation analysis, and redundancy analysis (a.k.a. reduced rank regression).

## 1 Introduction

Today’s large scale and multi-site studies, such as the UK BioBank (https://www.ukbiobank.ac.uk/) and the Rotterdam study (http://www.erasmus-epidemiology.nl/), collect population level data across numerous types and modalities, (e.g., genetics, neurological, behavioral, clinical, and laboratory measures). Other types of large scale studies—typically those that emphasize diseases and disorders—involve a relatively small of participants but collect a very large number of measures on diverse modalities. Some such studies include the Ontario Neurodegenerative Disease Research Initiative (ONDRI) (Farhan et al. 2016) which includes genetics, multiple types of magnetic resonance brain imaging (Duchesne et al. 2019), a wide array of behavioral, cognitive, clinical, and laboratory batteries, as well as many modalities “between” these types, such as ocular imaging, gait/balance (Montero-Odasso et al. 2017), eye tracking, and neuro-pathology. Though large samples (e.g., UK BioBank) and depth of data (e.g., ONDRI) are necessary to understand typical and disordered samples and populations, few statistical or machine learning approaches exist 1) that can accommodate such large (whether “big” or “wide”), complex, and heterogeneous data sets and 2) that also respect the inherent properties of such data.

In many cases, the analysis of a mixture of data types loses information in part because the data need to be transformed to fit the analytical methods; but this analysis also loses inferential power because the standard assumptions may be inappropriate or incorrect. For example, to analyze categorical and continuous data together, a typical— but inappropriate—strategy is to recode the continuous data into categories such as di-Chotomization, trichotomization, or other (often arbitrary) binning strategies. Furthermore, ordinal and Likert scale data—such as responses on many cognitive, behavioral, clinical, and survey instruments—are often incorrectly treated as metric or continuous values (Bürkner & Vuorre 2019). When it comes to genetic data, such as single nucleotide polymorphims (SNPs), the data are almost always recoded by counting the number of the minor homozygote to give: 0 for the major homozygote (two copies of the most frequent of the two alleles), 1 for the heterozygote, and 2 for the minor homozygote. This {0, 1, 2} recoding of genotypes (1) assumes additive linear effects based on the minor homozygote and (2) is often treated as metric/continuous values (as opposed to categorical or ordinal), even when known effects of risk are neither linear nor additive, such as haplotypic effects (Vormfelde & Brockmöller 2007) nor exclusively based on the minor homozygotes, such as ApoE in Alzheimer’s Disease (Genin et al. 2011). Interestingly other, less restrictive, models (e.g., dominant, recessive, genotypic) perform better (Lettre et al. 2007) than the additive model, but these are rarely used.

Here we introduce partial least squares-correspondence analysis-regression (PLS-CA-R): a latent variable regression modeling approach suited for the analysis of complex data sets. We first show that PLS-CA-R generalizes PLS regression (Wold 1975, Wold et al. 1984, Tenenhaus 1998, Abdi 2010), CA (Greenacre 1984, 2010, Lebart et al. 1984), and PLS-CA (Beaton et al. 2016)—which is the “correlation” companion and basis of PLS-CA-R. We then show that PLS-CA-R is a data-type general PLS regression method based on the generalized singular value decomposition (GSVD) that combines the features of CA (to accommodate multiple data types) with the features of PLS-R (as a regression method generalizing ordinary least squares regression when its assumptions are not met).

We illustrate the multiple uses of PLS-CA-R, with data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI). These data include: diagnosis data (mutually exclusive categories), SNPs (genotypes are categorical), multiple behavioral or clinical instruments (that could be ordinal, categorical, or continuous), and several neuroimaging measures and indices (generally either continuous or non-negative). PLS-CA-R can 1) accommodate different data types in a predictive or fitting framework, 2) regress out (i.e., residualize) effects of mixed and collinear data, and 3) reveal latent variables within a well-established framework.

In addition, we provide a Supplemental section that shows how PLS-CA-R provides the basis for further generalizations. These include different optimization schemes (e.g., covariance as in PLS or correlation as in canonical correlation), transformations for alternate metrics, ridge-like regularization, and three types of PLS algorithms (regression, correlation, and canonical). We also provide an R package and examples at https://github.com/derekbeaton/gpls.

This paper is organized as follows. In Section 2 we introduce PLS-CA-R. Next, in Section 3, we illustrate PLS-CA-R on the TADPOLE challenge (https://tadpole.grand-challenge.org/) and additional genetics data from ADNI across three examples: 1) a simple discriminant example with categorical data, 2) a mixed data example that requires residualization, and 3) a larger example of multiple genetic markers and whole brain tracer uptake (non-negative values). Finally in Section 4 we provide a discussion and conclusions.

## 2 Partial least squares-correspondence analysis-regression

### 2.1 Notation

Bold uppercase letters denote matrices (e.g., **X**), bold lowercase letters denote vectors (e.g., **x)**, and italic lowercase letters denote specific elements (e.g., *x*). Upper case italic letters (e.g., *I*) denote cardinality, size, or length where a lower case italic (e.g., *i*) denotes a specific value of an index. A generic element of matrix **X** would be denoted *x*_{i,j}. Common letters of varying type faces, for example **X, x**, *x*_{i,j} come from the same data structure. A preprocessed or transformed version of a matrix **X** is denoted **Z**_{X}. The sum of matrix will be denoted with _{++} where, for example, the sum of the matrix **X** is *X*_{++}. Vectors are assumed to be column vectors unless otherwise specified. Two matrices side-by-side denotes standard matrix multiplication (e.g., **XY).** The operator ⊙ denotes element-wise (Hadamard) multiplication and ∅ denotes element-wise (Hadamard) division. The matrix **I** denotes the identity matrix, **1** denotes a vector 1’s and **0** is a null matrix (all entries are 0). Superscript ^{T} denotes the transpose operation, superscript ^{−1} denotes standard matrix inversion, and superscript ^{+} denotes the Moore-Penrose pseudo-inverse. The diagonal operator, denoted diag{}, transforms a vector into a diagonal matrix, or extracts the diagonal of a matrix to produce a vector.

### 2.2 The Generalized SVD and Correspondence Analysis

Given an *I* × *J* matrix **X**, the singular value decomposition (SVD) decomposes **X** as
where **X** is of rank *A* (with *A* ≤ min(*I, J*)) and **Δ** is the *A* × *A* diagonal matrix of the singular values, and **Λ** = **Δ**^{2} is the *A* × *A* diagonal matrix of the eigenvalues.

The *generalized* singular value decomposition (GSVD) generalizes the SVD by integrating constraints—also called *metrics* because these matrices define a metric in a generalized Euclidean space—imposed on the rows and columns of the matrix to be decomposed. Specifically, when both an *I* × *I* matrix M and a *J* × *J* matrix **W** are positive definite matrices, the GSVD decomposes **X** as
where **X** is of rank *A* and **Δ** is the *A* × *A* diagonal matrix of the singular values, **Λ** = **Δ**^{2} is the *A* × *A* diagonal matrix of the eigenvalues, and **P** and **Q** are the *generalized* singular vectors. We compute singular vectors as and where **U**^{T}**U** = **I** = **V**^{T}**V.** From the weights, generalized singular vectors, and singular values we can obtain component (a.k.a. factor) scores as **F**_{I} = **MPΔ** and **F**_{J} = **WQΔ** for the *I rows* and *J* columns of **X**, respectively. For convenience, the GSVD is presented as a “triplet” with the row metric (weights), data, and column metric (weights) as GSVD(**M, X, W**). We present the GSVD triplet here in slightly different way than it is typically presented (see, e.g., **?**). Our presentation of the GSVD reflects the multiplication steps taken before decomposition (see our Supplmental Material).

Correspondence analysis (CA) is a technique akin to principal components analysis (PCA)—originally designed for the analysis of contingency tables—and analyzes deviations to independence. See Greenacre (1984), Greenacre (2010), and Lebart et al. (1984) for detailed explanations of CA, and then see Escofier-Cordier (1965) and Benécri (1973) for the origins and early developments of CA. CA analyzes an *I* × *J* matrix **X** whose entries are all non-negative. CA is performed with the GSVD as follows. First the *observed* frequencies matrix is computed as . The marginal frequencies from the observed matrix are then computed as

The *expected* frequencies matrix is computed as . The *deviations* from independence matrix is then

We compute CA from with **M**_{X} = diag{**m**_{X}} and **W**_{X} = diag{**w**_{X}}.

### 2.3 PLS-CA-R

PLS regression (Wold 1975, Wold et al. 1984, 2001) decomposition is an asymmetric and iterative algorithm to find latent variables (a.k.a. components) from a set of predictors from **X** that best explain responses **Y**. PLS-R is asymmetric one data table is privileged (i.e., treated as predictors). PLS-R methods simultaneously decompose **X** and **Y**, generally, through an iterative algorithm. For each iteration, we identify the maximum covariance between **X** mid **Y**, and then deflate each matrix.

Here we present the PLS-CA-R approach which generalizes PLS-R for multiple correspondence analysis (MCA) and correspondence analysis (CA)-like methods that generally apply to categorical (nominal) data. Our formulation of PLS-CA-R provides the basis for generalizations to various data types (e.g., categorical, ordinal, continuous, contingency), as well as different techniques, optimizations, and algorithms (see the Supplemental Material for more details).

For simplicity assume in the following formulation that **X** and **Y** are both complete disjunctive tables as seen in Table 1 (see SEX columns) or Table 2. This formulation also applies generally to non-negative data (see later sections). We define observed matrices for **X** and **Y** as

Next we compute marginal frequencies for the rows and columns. We compute row marginal frequencies as and column frequencies as

We then define expected matrices as and deviation matrices as

For PLS-CA-R we have row and column weights of **M**_{X} = diag{**m**_{X}}, **M**_{Y} = diag{**m**_{Y}}, **W**_{X} = diag{**w**_{X}}, and **W**_{Y} = diag{**w**_{Y}}. PLS-CA-R makes use of the rank 1 SVD solution iteratively and works as cross-product , where
where **U**^{T}**U** = **I** = **V**^{T}**V.** We can obtain the *generalized* singular vectors as and , where . Because we make use of the rank 1 solution iteratively, we only retain the first vectors and values from Eq. 10. We distinguish the retained vectors and values as , and . At each iteration, we compute the component scores, for the *J* columns of **Z**_{X} and the *K* columns of **Z**_{Y}, respectively, as and . We compute the latent variables as

Next we compute , and . We use **t**_{X}, b, and **û** to compute rank 1 reconstructed versions of **Z**_{X} and **Z**_{Y} as

Finally, we deflate **Z**_{X} and **Z**_{Y} as and . We then repeat the iterative procedure with these deflated **Z**_{X} and **Z**_{Y}. The computations outlined above are performed for *C* iterations where: (1) *C* is some pre-specified number of intended latent variables where *C* ≤ *A* where *A* is the rank of or (2) when **Z**_{X} = **0** or **Z**_{Y} = **0.** Upon the stopping condition we would have *C* components, and would have collected any vectors into corresponding matrices. Those matrices are:

two

*C*×*C*diagonal mat rices**B**and with each*b*and on the diagonal,the

*I*×*C*matric es**L**_{X}**L**_{Y}, and**T**_{X},the

*J*×*C*matrices , and , andthe

*K*×*C*matrices .

Algorithm 1 shows the key elements of the PLS-CA-R algorithm, from input to deflation.

### 2.4 Maximization in PLS-CA-R

PLS-CA-R maximizes the common information between **Z**_{X} and **Z**_{Y} such that
where is the first singular value from **Δ** for each *c* step. PLS-CA-R maximization is subject to the orthogonality constraint that when *c* ≠ *c*′. This orthogonality constraint propagates through to many of the vectors and matrices associated with **Z**_{X} where these orthogonality constraints do not apply to the various vectors and matrices associated with **Y**.

### PLS-CA-R algorithm. The results of a rank 1 solution are used to compute the latent variables and values necessary for deflation of Z_{X} and Z>_{Y}.

### 2.5 Decomposition and reconstitution

PLS-CA-R is a “double decomposition” where where . PLS-CA-R, like PLS-R, provides the same values as OLS where

This connection to OLS shows how to residualize (i.e., “regress out” or “correct for”) covariates, akin to how residuals are computed in OLS. We do so with the original **Z**_{Y} as . PLS-CA-R produces both a predicted and a residualized version of **Y**. Recall that . We compute a reconstituted form of **Y** as
which is the opposite steps of computing the deviations matrix. We add back in the expected values and then scale the data by the total sum of the original matrix. The same can be done for residualized values (i.e., “error”) as

Typically, **E**_{Y} is derived the data (as noted in Eq. 8). However, the reconstituted space could come from any model by way of generalized correspondence analysis (GCA) (Escofier 1983, 1984, Grassi & Visentin 1994, Beaton et al. 2018). With GCA we could use any reasonable alternates of **E**_{Y} as the model, which could be obtained from, for examples, known priors, a theoretical model, out of sample data, or population estimates. CA can then be applied directly to either **Ŷ** or **Y**_{ϵ}. The same reconstitution procedures can be applied fitted and residualized versions of **X** as well.

### 2.6 Concluding remarks

Now that we have formalized PLS-CA-R, we want to point out small variations, some caveats, and some additional features. We also briefly discuss here—but show in the Supplemental Material—how PLS-CA-R provides the basis for numerous variations and broader generalizations.

In PLS-CA-R (and PLS-R) each subsequent is not guaranteed to be smaller than the previous, with the exception that all are smaller than the first. This is a by-product of the iterative process and the deflation step. This problem poses two issues: (1) visualization of component scores and (2) explained variance. For visualization of the component scores— which use there is an alternative computation: and . This alternative is referred to as “asymmetric component scores” in the correspondence analysis literature (Abdi & Béra 2018, Greenacre 1993). Additionally, instead of computing the variance per component or latent variable, we can instead compute the amount of variance explained by each component in **X** and **Y**. To do so we require the sum of the eigenvalues of each of the respective matrices per iteration via CA (with the GSVD). Before the first iteration of PLS-CA-R we obtain the full variance (i.e., the sum of the eigenvalues) of each matrix from and , which we respectively refer to as *ϕ*_{X} and *ϕ*_{Y} We can compute the sum of the eigenvalues for each deflated version of **Z**_{X} and **Z**_{Y} through the GSVD just as above, referred to as *ϕ*_{X,c} and *ϕ*_{Y, c}. For each *c* component the proportion of explained variance for each matrix is and .

In our formulation, the weights we use are derived from the χ^{2} assumption of independence. However nearly any choices of weights could be used, so long as the weight matrices are at least positive semi-definite (which requires the use of a generalized inverse). If alternate row weights (i.e., **M**_{X} or **M**_{Y}) were chosen, then the fitted values and residuals are no longer guaranteed to be orthogonal (the same condition is true in weighted OLS).

PLS-CA-R provides an important basis for various extensions. We can use different PLS algorithms in addition to Algorithm 1, for examples, the PLS-SVD and canonical PLS algorithms. Our formulation of PLS-CA-R provides the basis for, and leads to a more generalized approach for other cross-decomposition techniques or optimizations (e.g., canonical correlation, reduced rank regression). This basis also allows for the use of alternate metrics as well as ridge-like regularization. We show how PLS-CA-R leads to these generalizations in the Supplemental Material.

Though we formalized PLS-CA-R as a method for categorical (nominal) data coded in complete disjunctive format (as seen in Table 1—see SEX columns—or Table 2), PLS-CA-R can easily accomodate various data types without loss of information. Specifically, both continuous and ordinal data can be handled with relative ease and in a “pseudodisjunctive” format, also referred to as “fuzzy coding” where complete disjunctive would be a “crisp coding” (Greenacre 2014). We explain exactly how to handle various data types as Section 3 progresses, which reflects more “real world” problems: complex, mixed data types, and multi-source data.

## 3 Applications & Examples

In this section we provide examples with real data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI). These examples illustrate how to approach mixed data with PLS-CA-R and the multiple uses of PLS-CA-R (e.g., for analyses, as a residualization procedure). We present three sets of analyses. First we introduce of PLS-CA-R through an example where we want to predict genotypes (categorical) from groups (categorical). Next we show how to predict genotypes from a small set Of behavioral and brain variables. This second example illustrates how to: (1) how to recode and analyze mixed data (categorical, ordinal, and continuous), (2) how to use PLS-CA-R, and (3) how to use PLS-CA-R as residualization technique to remove effects from data prior to subsequent analyses. Finally, we present a larger analysis with the goal to predict genotypes from cortical uptake of AV45 (i.e., a radiotracer) PET scan for beta-amyloid (“A*β*”) deposition. This final example also makes use of residualization as illustrated in the second example.

### 3.1 ADNI Data

Data used in the preparation of this article come from the ADNI database (adni.loni.usc.edu). ADNI was launched in 2003 as a public-private funding partnership and includes public funding by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and the Food and Drug Administration. The primary goal of ADNI has been to test a wide variety of measures to assess the progression of mild cognitive impairment and early Alzheimer’s disease. The ADNI project is the result of efforts of many coinvestigators from a broad range of academic institutions and private corporations. Michael W. Weiner (VA Medical Center, and University of California-San Francisco) is the ADNI Principal Investigator. Subjects have been recruited from over 50 sites across the United States and Canada (for up-to-date information, see www.adni-info.org).

The data in the following examples come from several modalities from the ADNI-G0/2 cohort. The data come from two sources available from the ADNI download site (http://adni.loni.usc.edu/): genome-wide data and the TADPOLE challenge data (https://tadpole.grand-challenge.org/) which contains a wide variety of data. Because the genetics data are used in every example, we provide all genetics preprocessing details here, and then describe any preprocessing for other data as we discuss specific examples.

For all examples in this paper we use a candidate set of single nucleotide polymorphisms (SNPs) extracted from the genome-wide data. We extracted only SNPs associated with the *MAPT, APP, ApoE*, and *TOMM40* genes because these genes are considered as candidate contributors to various AD pathologies: *MAPT* because it is associated with tau proteins, AD pathology, or cognitive decline (Myers et al. 2005, Trabzuni et al. 2012, Desikan et al. 2015, Cruchaga et al. 2012, Peterson et al. 2014), *APP* because of its association with *β*-amyloid proteins (Cruchaga et al. 2012, Huang et al. 2017, Jonsson et al. 2012), as well as *ApoE* and *TOMM40* because of their strong association with AD pathologies (Linnertz et al. 2014, Roses et al. 2010, Bennet et al. 2010, Huang et al. 2017). SNPs were processed as follows via Purcell et al. (2007) with additional R code as necessary: minor allele frequency (MAF) > 5% and missingness for individuals and genotypes ≤ 10%. Because the SNPs are coded as categorical variables (i.e., for each genotype) we performed an additional level of preprocessing: genotypes > 5% because even with MAF > 5%, it was possible that some genotypes (e.g., the heterozygote or minor homozygote) could still have very few occurrences. Therefore genotypes ≤ 5% were combined with another genotype. In all cases the minor homozygote (^{‘}aa’) fell below that threshold and was then combined with its respective heterozygote (^{‘}Aa^{’}); thus some SNPs were effectively coded as the dominant model (i.e., the major homozygote vs. the presence of a minor allele). See Table for an example of SNP data coding examples. From the ADNI-GO/2 cohort there were 791 available participants with 134 total SNPs across the four candidate genes. These 134 SNPs span 349 columns in disjunctive coding (see Table 2). Other data include diagnosis and demographics, some behavioral and cognitive instruments, and several types of brainbased measures. We discuss these additional data in further detail when we introduce these data.

### 3.2 Diagnosis and genotypes

Our first example asks and answers the question: “which genotypes are associated with which diagnostic category?”. In ADNI, diagnosis at baseline is a categorical variable that denotes which group each participant belongs to (at the first visit): control (CN; *N =* 155), subjective memory complaints (SMC; *N* = 99), early mild cognitive impairment (EMCI; *N* = 277), late mild cognitive impairment (LMCI; *N* = 134), and Alzheimer’s disease (AD; *N* = 126). We present this first example analysis in two ways: akin to a standard regression problem {‘a} la Wold [Wold (1975); Wold et al. (1984); Wold et al. (1987); cf. Eq. 15) and then again in the multivariate perspective of “projection onto latent structures” (Abdi 2010).

For this example we refer to diagnosis groups as the predictors (**X**) and the genotypic data as the responses (**Y**). Both data types are coded in disjunctive format (see Tables 1 and 2). Because there are five columns (groups) in **X**, PLS-CA-R produces only four latent variables (a.k.a. components). Table 4 presents the cumulative explained variance for both **X** and **Y** and shows that groups explain only a small amount of genotypic variance: *R*^{2} = 0.0065.

In a simple regression-like framework we can compute the variance contributed by genotypes or group (i.e., levels of variables) or variance contributed by entire variables (in this example: SNPs). First we compute the contributions to the variance of the genotypes as the sum of the squared loadings for each item: , where **1** is a conformable vector of ones. Total contribution takes values between 0 and 1 and describe the proportion of variance for each genotype. Because the contributions are squared loadings, they are additive and so we can compute the contributions for a SNP. A simple criterion to identify genotypes or SNPs that contribute to the model is to identify which genotype or SNP contributes more variance than expected, which is one divided by the total number of original variables (i.e., SNPs). This criterion can be applied on the whole or component-wise. We show the genotypes and SNPs with above expected variance for the whole model (i.e., high contributing variables a regression framework) in Figure 1.

Though PLS-R was initally developed as a regression approach—especially to handle collinear predictors (see explanations in Wold et al. 1984)—it is far more common to use PLS to find latent structures (i.e., components or latent variables) (Abdi 2010). From here on we show the latent variable scores (observations) and component scores (variables) for the first two latent variables/components in Figure 2. The first latent variable (Fig. 2a) shows a gradient from the control (CN) group through to the Alzheimer’s Disease (AD) groups (CN to SMC to EMCI to LMCI to AD). The second latent variable shows a dissociation of the EMCI group from all other groups (Fig. 2b). Figure 2c and d show the component scores for the variables. Genotypes on the left side of first latent variable (horizontal axis in Figs. 2c and d) are more associated with CN and SMC than with the other groups, whereas genotypes on the right side are more associated with AD and LMCI than with the other groups. Through the latent structures approach we can more clearly see the relationships between groups and genotypes. Because we treat the data categorically and code for genotypes, we can identify the specific genotypes that contribute to these effects. For example the ‘AA’ genotype of rs769449 and the ‘GG’ genotype of rs2075650 are more associated with AD and LMCI than with the other groups. In contrast, the ‘TT’ genotype of rs405697 and the ‘TT’ genotype rs439401 are more associated with the CN group than other groups (and thus could suggest potential protective effects).

This group-based analysis is also a discriminant analysis because it maximally separates groups. Thus we can classify observations by assigning them to the closest group. To correctly project observations onto the latent variables we compute ** where 1 is a 1 × ***K* vector of ones where **O**_{Y} ⊘ (**m**_{Y} *1*^{T}) are “row profiles” of Y (i.e., each element of Y divided by its respective row sum). Observations from **are then assigned to the closest group in ****F**_{J}, either for per component, across a subset of components, or all components. For this example we use the full set (four) of components. The assigned groups can then be compared to the *a priori* groups to compute a classification accuracy. Figure 3 shows the results of the discriminant analysis but only visualized on the first two components. Figures 3a and b respectively show the scores for **F**_{J} mid ** Figure 3c shows the assignment of observations to their closest group. Figure 3d visualizes the accuracy of the assignment, where observations in black are correct assignments (gray are incorrect assignments). The total classification accuracy 38.69% (where chance accuracy was 23.08%). Finally, typical PLS-R discriminant analyses are applied in scenarios where a small set of, or even a single, (typically) categorical responses are predicted from many predictors (Pérez-Enciso & Tenenhaus 2003). However, such an approach appears to be “over optimistic” in its prediction and classification (Rodríguez-Pérez et al. 2018), which is why we present discriminant PLS-CA-R more akin to a typical regression problem (i.e., here a single predictor with multiple responses).**

### 3.3 Mixed data and residualization

Our second example illustrates the prediction of genotypes from brain and behavioral variables: (1) three behavioral/clinical scales: Montreal Cognitive Assessment (MoCA) (Nasreddine et al. 2005), Clinical Dementia Rating-Sum of Boxes (CDRSB) (Morris 1993), and Alzheimer’s Disease Assessment Scale (ADAS13) (Skinner et al. 2012), (2) volumetric brain measures in mm^{3}: hippocampus (HIPPO), ventricles (VENT), and whole brain (WB), and (3) global estimates of brain function via PET scans: average FDG (for cerebral blood flow; metabolism) in angular, temporal, and posterior cingulate and average AV45 (A*β* tracer) standard uptake value ratio (SUVR) in frontal, anterior cingulate, precuneus, and parietal cortex relative to the cerebellum. This example higlights two features of PLS-CA-R: (1) the ability to accomodate mixed data types (continuous, ordinal, and categorical) and (2) as a way to residualize (orthogonalize; cf. Eq. 17) with respect to known or assumed covariates.

Here, the predictors encompass a variety of data types: all of the brain markers (volumetric MRI estimates, functional PET estimates) and the ADAS13 are quantitative variables, whereas the MoCA and the CDRSB can be considered as ordinal data. Continuous and ordinal data types can be coded into what is called thermometer (Beaton et al. 2018), fuzzy, or “bipolar” coding (because it has two poles) (Greenacre 2014)—an idea initially propsosed by Escofier for continuous data (Escofier 1979). The “Escofier transform” allows continuous data to be analyzed by CA and produces the same results as PCA (Escofier 1979). The same principles can be applied to ordinal data as well (Beaton et al. 2018). Continuous and ordinal data can be transformed into a “pseudo-disjunctive” format that behaves like complete disjunctive data (see Table 1) but preserves the values (as opposed to binning, or dichotomizing). Here, we refer to the transform for continuous data as the “Escofier transform” or “Escofier coding” (Beaton et al. 2016) and the transform for ordinal data as the “thermometer transform” or “thermometer coding”. Because continuous, ordinal, and categorical data can all be transformed into a disjunctive-like format, they can all be analyzed with PLS-CA-R.

The next example identifies the relationship between brain and behavioral markers of AD and genetics. We use the brain and behavioral data as the predictors and the genetics as responses. However, both data sets have their own covariates (some of which are confounds): age, sex, and education influence the behavioral and brain data, where sex, and population origins influence the genotypic variables. For genetic covariates, we use proxies of population origins: self-identified race and ethnicity categories. Both sets of covariates are mixtures of data types (e.g., sex is categorical, age is generally continuous). So, in this example, we illustrate the mixed analysis in two ways—unadjusted and then adjusted for these covariates. First we show the effects of the covariates on the separate data sets, and then compare and contrast adjusted vs. unadjusted analyes. For these analyses, the volumetric brain data were also normalized (divided by) by intracranial volume prior to these analyses to create proportions of total volume for each brain structure.

First we show the PLS-CA-R between each data set and their respective covariates. The main effects of age, sex, and education explained 11.17% of the variance of the behavioral and brain data, where the main effects of sex, race, and ethnicity explained 2.1% of the variance of the genotypic data. The first two components of each analysis are shown in Figure 4. In the brain and behavioral data, age explains a substantial amount of variance for Component 1. In the genotypic analysis, self-identified race and ethnicity are the primary overall effects, where the first two components are explained by those that identify as black or African-American (Component 1) vs. those that identify as Asian, Native, Hawaiian, and/or Latino/Hispanic (Component 2). Both data sets were reconstituted (i.e., **Y**_{ϵ} from Eq. 17) from their residuals.

Next we performed two analyses with the same goal: to understand the relationship between genetics and behavioral and brain markers of AD. In the unadjusted analysis, the brain and behavioral data explained 1.6% of variance in the genotypic data, whereas in the adjusted analysis, the brain and behavioral data explained 1.54% of variance in the genotypic data. The first two components of the PLS-CA-R results can be seen in Figure 4.

In the unadjusted analysis (Figures 4a and c) vs. the adjusted analysis (Figures 4b and d), we can see some similarities and differences, especially with respect to the behavioral and brain data. AV45 shows little change after residualization, and generally explains a substantial amount of variance as it contributes highly to the first two components in both analyses. The effects of the structural data—especially the hippocampus—are dampened after adjustment (see Figures 4a vs b), where the effects of FDG and CDRSB are now (relatively) increased (see Figures 4a vs b). On the subject level, the differences are small but noticeable, especially for distinguishing between groups (see Figure 6). One important effect is that on a spectrum from CON to AD, we can see that the residualization has a larger impact on the CON side, where the AD side remains somewhat homgeneous (see Figure 6c) for the brain and behavioral variables. With respect to the genotypic LV, there is much less of an effect (see Figure 6d), wherein the observations appear relatively unchanged. However, both pre- (horizontal axis; Figure 6d) and post- (vertical axis; Figure 6d) residualization shows that there are individuals with unique genotypic patterns that remain unaffected by the residualization process (i.e., those at the tails).

From this point forward we emphasize the results from the adjusted analyses because they are more realistic in terms of how analyses are performed. For this we refer to Figure 6b—which shows the latent variable scores for the observations and the averages of those scores for the groups—and Figures 5b and 5d—which show, respectively, the component scores for the brain and behavioral markers and the component scores for the genotypes. The first latent variable (Fig. 6b) shows a gradient from control (CON) on the left to Alzheimer’s Disease (AD) on the right. Brain and behavioral variables on the right side of the first component (horizontal axis in Fig. 5b) are more associated with genotypes on the right side (Fig. 5d), where brain and behavioral variables on the left side of the horizontal axis are more associated with genotypes on the left side. In particular, the AA genotype of rs769449, GG genotype of rs2075650, GG genotype of rs4420638, and AA genotype of rs157582 (amongst others) are related to increased AV45 (AV45+), decreased FDG (FDG-), and increased ADAS13 scores (ADAS13+), whereas the TT genotype of rs405697, GG genotype of rs157580, and TC+TT genotypes of rs7412 (amongst others) are more associated with control or possibly protective effects (i.e., decreased AV4, increased FDG, and decreased ADAS13 scores).

### 3.4 SUVR and genotypes

In this final example we use all of the features of PLS-CA-R: mixed data types within and between data sets, each with covariates (and thus require residualization). The goal of this example is to predict genotypes from *β*—amyloid burden (“AV45 uptake”) across regions of the cortex. In this case, because the AV45 uptake data are are strictly non-negative, we effectively treat these data as counts; that is, we would normally apply CA directly to this one data table. Of course, this is only one possible way to handle such data: It is possible to treat these data as row-wise proportions (i.e., percentage of total uptake per region within each subject) or even as continuous data.

Because not all subjects have complete AV45 and genotypic data, the sample for this example is slightly smaller: *N =* 778. Ethnicity, race, and sex (all categorical) explains 2.07% of the variance in the genotypic data where age (numeric), education (ordinal), and sex (categorical) together explain 2.22% of the variance in the in the AV45 uptake data. Overall, AV45 brain data explains 9.08% of the variance in the genotypic data. With the adjusted data we can now perform our intended analyses. Although this analysis produced 67 components (latent variables), we only focus on the first (0.57% of genotypic variance explained by AV45 brain data).

The first latent variable in Figure 7a is only associated with the horizontal axes (Component 1) in Figure 7b and c. The horizontal axis in Fig. 7a is associated with the horizontal axis in Fig. 7b whereas the vertical axis in Fig. 7a is associated with the horizontal axis in Fig. 7c. The first latent variable (Figure 7a) shows a gradient: from left to right we see the groups configured from CN to AD. On the first latent variable we do also see a group-level dissociation where AD+LMCl are entirely on one side whereas EMCl+SMC+CN are on the opposite side for both L_{X} (AV45 uptake, horizontal) and **L**_{Y} (genotypes, vertical). Higher relative AV45 uptake for the regions on the left side of Component 1 are more associated with EMCI, SMC, and CN than with the other groups, whereas higher relative AV45 uptake for the regions on the right side of Component 1 are more associated with AD and LMCI (Fig. 7b). The genotypes on the left side are associated with the uptake in regions on the left side; likewise the genotypes on the right side are associated with the uptake in regions on the right side (Fig. 7c). For example, LV/Component 1 shows relative uptake in right and left frontal pole, rostral middle frontal, and medial orbitofrontal regions are more associated with the following genotypes: AA and AG from rs769449, GG from rs2075650, GG from rs4420638, and AA from rs157582, than with other genotypes; these effects are generally driven by the AD and LMCI groups. Conversely, LV/Component 1 shows higher relative uptake in right and left lingual, cuneus, as well left parahippocampal and left entorhinal are more associated with the following genotypes: TT from rs405697, GG from rs6859, TC+TT from rs7412, TT from rs2830076, GG from rsl57580, and AA from rs4420638 genotypes than with other genotypes; these effects are generally driven by the GN, SMC, and EMCI cohorts. In summary, the PLS-CA-R results show that particular patterns of regional AV45 uptake predict particular genotypic patterns across many SNPs, and that the sources these effects are generally driven by the groups. Furthermore the underlying brain and genotypic effects of the groups exist along a spectrum of severity.

## 4 Discussion

Many modern studies, like ADNI, measure individuals with a variety of scales such as: genetics and genomics, brain structure and function, many aspects of cognition and behavior, batteries of clinical measures, and so on. These data are complex, heterogeneous, and more frequently these data are “wide” (many more variables than subjects) instead of “big” (more subjects than variables). But many current strategies and approaches to handle such multivariate heterogeneous data often requires compromises or sacrifices (e.g., the presumption of single numeric model for categorical data such as the additive model for SNPs; z-scores of ordinal values; or “dichotomania” (https://www.fharrell.eom/post/errmed/#catg): the binning of continuous values into categories). Many of these strategies and approaches presume data are interval scale. Because of the many features and flexibility of PLS-CA-R—e.g., best fit to predictors, orthogonal latent variables, accommodation for virtually any data type—we are able to identify distinct variables and levels (e.g., genotypes) that define or contribute to control (CON) vs. disease (AD) effects (e.g., Fig. 2) or reveal particular patterns anchored by the polar control and disease effects (CON → SMC → EMCI → LMCI → AD; see, e.g., Fig. 7).

While we focused on particular ways of coding and transforming data, there are many alternatives that could be used with PLS-CA-R. For example, we used a disjunctive approach for SNPs because they are categorical—a pattern that matches the genotypic model. However, through various disjunctive schemes, or other forms of Escofier or fuzzy coding, we could have used any genetic model: if all SNPs were coded as the major vs. the minor allele (‘AA’ vs. {‘Aa+aa’}), this would be the dominant model, or we could have assumed the additive model —i.e., O, 1, 2 for ‘AA’, ‘Aa’, and ‘aa’, respectively—and transformed the data with the ordinal approach (but *not* the continuous approach). We previously provided a comprehensive guide on how to transform various SNP genetic models for use in PLS-CA and CA elsewhere (see Appendix of Beaton et al. 2016). Furthermore, we only highlighted one of many possible methods to transform ordinal data. The term “fuzzy coding” applies more generally to the recoding of ordinal, ranked, preference, or even continuous data. The “fuzzy coding” approach is itself fuzzy, and includes a variety of transformation schemes, all of which conform to the same properties as disjunctive data. The many “fuzzy” and “doubling” coding schemes are described in Escoher (1979), Lebart et al. (1984), or Greenacre (2014). However, for ordinal data—especially with fewer than or equal to 10 levels, and without excessively rare (≤ 1%) Occurences—we recommend to treat ordinal values as categorical levels. When ordinal data are treated as categorical (and disjunctively coded), greater detail about the levels emerges and in most cases could reveal non-linear patterns of the ordinal levels.

We introduced PLS-CA-R in a way that emphasizes various recoding schemes to accomodate different data types. We designed PLS-CA-R as a mixed-data generalization of PLSR, which provides one strategy to identify latent variables and perform regression when standard assumptions are not met (e.g., “high dimensional low sample size” or high collinearity). PLS-CA-R—and GPLS—addresses the need of many fields that require *data type general* methods across multi-source and multi-domain data sets where we require careful considerations about how we prepare and understand our data (Nguyen & Holmes 2019). We introduced PLS-CA-R in a way that emphasizes various recoding schemes to accomodate different data types all with respect to CA and the χ^{2} model. PLS-CA-R provides key features necessary for data analyses as data-rich and data-heavy disciplines and fields rapidly move towards and depend on fundamental techniques in machine and statistical learning (e.g., PLSR, CCA). Finally, with techniques such as mixed-data MFA (Bécue-Bertaut & Pagès 2008), PLS-CA-R provides a much needed basis for development of future methods designed for such complex data sets.

We provide an `R` package with data, documentation, and examples that implements PLS-CA-R (and other cross-decomposition methods) here: https://github.com/derekbeaton/gpls.

Finally, we want to remind the reader to see the Supplemental Material, where we show (1) how PLS-CA-R generalizes most cross-decomposition techniques, (2) alternate PLS algorithms, (3) suggestions on alternate metrics, and finally (4) a ridge-like regularization approach that applies to PLS-CA-R and all the techniques that PLS-CA-R generalizes.

## 4.1 Acknowledgements

Data collection and sharing for this project was funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Lumosity; Lundbeck; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California.

## Footnotes

↵* Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (http://adni.loni.usc.edu/). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at http://adni.loni.ucla.edu/wpcontent/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf

A reorganization of the structure to move lots of the extra content to a supplemental section