## Abstract

While the promise of electronic medical record and biobank data is large, major questions remain about patient privacy, computational hurdles, and data access. One promising area of recent development is pre-computing non-individually identifiable summary statistics to be made publicly available for exploration and downstream analysis. In this manuscript we demonstrate how to utilize pre-computed linear association statistics between individual genetic variants and phenotypes to infer genetic relationships between products of phenotypes (e.g., ratios; logical combinations of binary phenotypes using ‘and’ and ‘or’) with customized covariate choices. We propose a method to approximate covariate adjusted linear models for products and logical combinations of phenotypes using only pre-computed summary statistics. We evaluate our method’s accuracy through several simulation studies and an application modeling various fatty acid ratios using data from the Framingham Heart Study. These studies show consistent ability to recapitulate analysis results performed on individual level data including maintenance of the Type I error rate, power, and effect size estimates. An implementation of this proposed method is available in the publicly available R package pcsstools.

## 1 Introduction

Researchers now have readily available access to massive quantities of genotypic and phenotypic data (Cox, 2018; Simell et al., 2019). For example, via the Electronic Medical Records and Genomics (eMERGE Network; https://www.genome.gov/Funded-Programs-Projects/Electronic-Medical-Records-and-Genomics-Network-eMERGE, the UK-Biobank (Bycroft et al., 2018) other initiatives and repositories (e.g., 23andMe, MGI http://pheweb.sph.umich.edu/ (Gagliano Taliun et al., 2020), FINRISK, CHOP (Diogo et al., 2018), among others), researchers can access a wide variety of phenotypic and genomics data on hundreds of thousands of individuals. However, important questions remain about how to best leverage these repositories. For example, the size of biobank datasets makes it challenging to transfer, store, and analyze data locally. While cloud computing minimizes some of these issues, it brings its own challenges related to cost (storage and computation), transfer, and access. Furthermore, data security and privacy issues are of paramount importance throughout all aspects of the data access, storage, and analysis pipeline (Heatherly, 2016; Jones et al., 2012; Simell et al., 2019).

A key innovation in this field is precomputing non-individually identifiable summary statistics on biobank data and maximizing access to this data (Pasaniuc & Price, 2017). For example, GeneAtlas provides basic summary statistics for simple linear regression models of single nucleotide variants (SNVs) with 1000s of available phenotypic variables across hundreds of thousands of individuals in the UK Biobank (Canela-Xandri et al., 2018), which also provides access to phenotype-phenotype correlations, single nucleotide polymorphism (SNP) minor allele frequencies (MAFs) and Hardy Weinberg Equilibrium (HWE) *p*-values. Likewise, PheWeb is a software toolkit which provides access to the UK Biobank and Michigan Genomics Initiative data via a series of easy-to-navigate visualization and summary tools (http://pheweb.sph.umich.edu/)(Gagliano Taliun et al., 2020; Neale, B. M., 2018). Others simply provide access to sets of pre-computed summary statistics (PCSS) from large datasets (e.g., https://www.leelabsg.org/resources). These resources mitigate many of the privacy and security concerns mentioned above since no individual participant data (IPD) is shared. In addition, the size of these repositories are only fractions of the size of IPD, making transfer and storage of the data much more efficient. Finally, these services provide PCSS, which alleviates much of the computational burden on researchers. Despite these advantages, significant limitations currently exist when using these repositories of PCSS.

For example, researchers may want to modify a phenotype with available PCSS to one that is of greater clinical interest or use different sets of covariates than those considered in pre-computed analyses. Recent work is beginning to address these limitations. In two recent papers by our group (Gasdaska et al., 2019; Wolf et al., 2020), we demonstrated how to use standard PCSS (only means, variances, and correlations of all predictors and responses) to calculate the coefficients and standard errors for the linear model for a linear combination of phenotypes with an arbitrary set of covariates. This can then be used to perform Principal Component Analysis (PCA) on a set of phenotypes since principal component scores are just linear combinations with weights derived from the phenotype covariance matrix. Further, we demonstrated that if the phenotype correlation matrix is not available, we can use the correlation of test statistics for each phenotype across all genetic markers in its place with little loss of efficiency. These innovations mean that researchers can, using only PCSS, select the unique set of covariates they wish to adjust for and model a linear combination of phenotypes.

Importantly, these two approaches which require a priori specification of a phenotype of clinical interest, contrast to other recently developed methods which jointly and simultaneously analyze multiple phenotypes (Dutta, Gagliano Taliun, et al., 2019; Dutta, Scott, et al., 2019; Guo & Wu, 2019; Li et al., 2020; Ray & Boehnke, 2018) without an explicit characterization of the relationship between the phenotypes. These joint phenotype tests aim to simultaneously analyze multiple phenotypes while satisfying statistical objectives such as maximizing power under certain conditions. Furthermore, some of these approaches (Guo & Wu, 2019; Ray & Boehnke, 2018) do so using PCSS readily available from existing repositories.

Currently, our group’s methods for using PCSS to analyze modified phenotypes with flexible covariate choices are limited to PCA and choosing a phenotype that is a linear combination of the phenotypes for which PCSS are available. In this manuscript, we demonstrate how to analyze modified phenotypes which are multiplicative combinations of an arbitrarily large number of phenotypes for which PCSS are available. We also demonstrate how to flexibly adjust for covariates in these modified phenotype models. Importantly, we also show how the multiplication of phenotypes, when applied to binary phenotypes, allows for logical combination (“and” and “or”) of phenotypes (e.g., to do inference on a phenotype ** y**, that is “

*y*_{1}or

*y*_{2}”). After presenting a mathematical framework for the method, we validate the method using comprehensive simulations and demonstrate the method on real data from the Framingham Heart Study.

## 2 Methods

Consider the *m* phenotypes *y*_{1}, …, *y*_{m} where each is an *n*×1 vector of measures across *n* subjects and the *n*×*p* design matrix ** X** = (

*x*_{1}, …,

*x*_{p}) which consists of variables including genotypic information, covariates, and an intercept column. Moreover, let

*w*_{m}=

*y*_{1}

*y*_{2}·

*y*_{m}denote the pairwise Hadamard product of all

*m*phenotypes for each subject. Our aim is to approximate the coefficients and standard errors of the covariate adjusted linear regression model for the product of

*m*phenotypes: using only readily available PCSS.

### 2.1 Assumed Pre-Computed Summary Statistics and Information

As is typically made available, we assume knowledge of the following PCSS: the means of every predictor (e.g. SNPs and covariates), the means of every phenotype, and the full variance-covariance matrix of all predictors and phenotypes (i.e. and for any *i, j, k, l* where 1 ≤ *i, j* ≤ *p* and 1 ≤ *k, l* ≤ *m*). These are all readily available in standard PCSS repositories. We also assume to know the distribution that each predictor and phenotype follows (e.g. binary, log-normal, etc.). Figure 1 displays the assumed information when modeling via both IPD and PCSS.

However, if some summary statistics are unknown, they may be able to be derived or approximated. For example, SNPs distributed in HWE can have their mean and variance approximated through a binomial distribution given the MAF. Furthermore, the covariance of a genetic variant and a non-genetic variable is calculated as the single-marker slope coefficient (for the model with the non-genetic variable as the response and the genetic variant as the predictor) divided by the variance of the genetic variant. Other published papers (Kim et al., 2015; Zhu et al., 2015) have shown that the correlation of two traits can be approximated by the correlation of *Z* statistics of SNPs not associated with either trait; i.e., where *z*_{k} and *z*_{l} are vectors of single-marker test statistics for traits *y*_{k} and *y*_{l} across a genome wide association study filtered such that the associated *p*-values are above a set threshold for both traits. This approximation method is described in detail in Ray & Boehnke (2018). Two of our previous papers (Gasdaska et al., 2019; Wolf et al., 2020) have demonstrated the accuracy of these three methods through both simulation and real-data applications.

### 2.2 Linear Regression with Covariates using Pre-Computed Summary Statistics

Given a response vector *w*_{m} and design matrix ** X** = (

*x*_{1}, …,

*x*_{p}) which includes

*p*variables including SNPs’ minor allele counts, covariates, and a possible intercept column, the normal error regression model

*w*_{m}=

**+**

*Xβ***where**

*ϵ***∼**

*ϵ**N*(

**0**,

*σ*

^{2}

**) has ordinary least squares estimate for . Further, Var. In a recent paper (Wolf et al., 2020), we demonstrated how to calculate these values using only PCSS: and where**

*I**S*(

**) is the**

*X**p*×

*p*variance-covariance matrix of the columns of the design matrix

**, is the**

*X**p*× 1 vector of column means of

**, is the mean of**

*X*

*w*_{m}, and is the sample covariance between

*w*_{m}and

*x*_{j}.

With these methods in mind and assumed access to standard PCSS, in order to approximate , and for this covariate adjusted multiple linear regression model, all that remains is to estimate and for each *j*. We will first demonstrate how to approximate these values when *m* = 2 and later show how recursion can be used to approximate covariances with *m* > 2 in Section 2.3.2.

### 2.3 Covariance Estimation

#### 2.3.1 Covariance Estimation with the Product of 2 Phenotypes

Let *w*_{2} = *y*_{1}*y*_{2} be the pairwise Hadamard product of *y*_{1} and *y*_{2}. Then, if *x*_{j} represents an “intercept” column of the design matrix with all elements unity (i.e. if *x*_{j} = (1, …, 1)′), we set . Otherwise, we proceed as follows:

We first approximate the conditional means and variances of *y*_{1} and *y*_{2} given *x*_{j} = *x* through a linear regression model:
and
where and . We note that this conditional variance will be constant at any value of *x*_{j} following from the linear regression assumption of homoscedasticity.

Then, we calculate the sample partial correlation of *y*_{1} and *y*_{2} controlling for *x*_{j} :
setting if either or . As the expectation of the conditional correlation equals the partial correlation under the assumption of a multivariate linear relationship between (*y*_{1}, *y*_{2}) and *x*_{j} (Baba et al., 2004), we use the partial correlation as an estimate of the conditional correlation of *y*_{1} and *y*_{2} at all possible values of *x*_{j}. So, we approximate the covariance of *y*_{1} and *y*_{2} conditional on *x*_{j} :
These terms let us approximate the conditional mean of *w*_{2} at a given value *x* of *x*_{j} :
Then, letting *f*_{j} (*x*) be an assumed probability distribution /mass function for *x*_{j} with support 𝒮_{j} (e.g. if *x*_{j} is a vector of minor allele counts with MAF *p*, letting and 𝒮_{j} ={0,1,2}) we approximate the sample covariance of *x*_{j} and *w*_{2}:
swapping the sums for integrals across the support when appropriate.

We calculate the sample mean of *w*_{2} as
To approximate the variance, we first approximate the conditional variances of *w*_{2} at all levels of *x*_{j} :
And then approximate the sample variance as:
once again swapping the sum for an integral across 𝒮_{j} when appropriate. This approach leads to a different variance estimate for each predictor *x*_{j}. We treat the median of these estimates across each *j* as the estimated variance.

Hence, taking the means, variances, and pairwise covariances of *x*_{j}, *y*_{1}, and *y*_{2} and a distributional assumption about *x*_{j}, we approximate the covariance of one variable (*x*_{j}) with the product of the other two (*w*_{2} = *y*_{1}*y*_{2}) as well as the product’s mean and variance.

Repeating this process for each predictor *x*_{j} and following the linear regression equations presented in Section 2.2 allows for calculation of covariate adjusted slope coefficients for the multiple regression model as well as the standard errors of these slope estimates.

#### 2.3.2 Covariance Estimation with the Product of 3 or More Phenotypes

Regression models for larger products of phenotypes can also be approximated by applying the established method recursively: first estimating the covariance of *x*_{j} and *w*_{2}, then leveraging the covariance of *x*_{j} and *w*_{2} and *x*_{j} and *y*_{3} to estimate the covariance of *x*_{j} and *w*_{3}, and so forth. This recursion procedure is described in more detail in the appendix and software to carry it out is discussed in Section 2.8.

### 2.4 Binary Phenotypes

While nothing in the previous sections precludes the use of the method on the product of binary phenotypes, some improvements to the method can be made in these cases.

#### 2.4.1 Changes to Estimations

The covariance of two binary phenotypes is estimated using the same general framework as developed in Section 2.3.1. The only changes are to the variance estimates. Instead of estimating a phenotype’s conditional variance from a linear model’s residual variance, we estimate it as Further, we calculate the product’s sample variance as

#### 2.4.2 Products as Logical Combinations

Binary phenotypes are of particular importance because their products can be interpreted as logical combinations.

We can represent the logical conjunction *y*_{1} ∧ *y*_{2} (read as “*y*_{1} and *y*_{2}”) as the product *y*_{1}*y*_{2}. Likewise, we express the logical disjunction *y*_{1} ∨ *y*_{2} (“*y*_{1} or *y*_{2}”) as **1**_{n} − ((**1**_{n} − *y*_{1})(**1**_{n} − *y*_{2})).

By framing both disjunctions and conjunctions in terms of phenotype multiplication, we can apply our established methods to approximate the covariances of these combinations with predictors and ultimately estimate linear models for these logical combinations.

While the case of the conjunction is a trivial application of the above methods of multiplying phenotypes, we will briefly describe how to model the disjunction. To do so, we consider the modified phenotypes and . (These represent the statements “not *y*_{1}” and “not *y*_{2}.”) This gives us . Then, , and . If we set , our method allow us to estimate for each *x*_{j} as well as and . Leveraging these estimates, , and , where *w*_{2} is equivalent to the disjunction *y*_{1} ∨ *y*_{2}. Using these terms as inputs for the framework presented in Section 2.2 allow for coefficient and standard error estimation for the linear model .

### 2.5 Simulation

#### 2.5.1 Simulation 1: Type I Error Maintenance

To verify that our linear model with PCSS approach appropriately maintained the Type I error rate at a variety of *α* thresholds, we carried out a simulation under the null hypothesis that the predictor variant has no linear association with any of the phenotypes of interest. This null hypothesis represents a reasonable subset of the exact null hypothesis which is that the *product* of phenotypes has no linear relationship with the predictor. We carried out this simulation with varying sample size, MAF, phenotype means, phenotype correlations, and for continuous phenotypes, phenotype variances, for products of two binary phenotypes, two continuous phenotypes, and three continuous phenotypes with 10^{8} simulations for each collection of continuous phenotypes and 10^{7} simulations for the case of binary phenotypes. Simulation parameters were generated from distributions (details are available in the Appendix in Table S1).

#### 2.5.2 Simulation 2: Comparisons to IPD Models

To evaluate our method’s ability to replicate the results of covariate adjusted linear models fit to IPD, we carried out three 2^{k} factorial simulations—one for the product of two binary phenotypes, one for the product of two positive continuous phenotypes, and one for the product of three positive continuous phenotypes. We carried out 1000 simulations at each possible combination of parameters. In each simulation, we modeled the phenotype product as a function of a SNP and binary covariate. For the simulations with only two phenotypes, we also included a continuous covariate in our models.

In all simulations, we simulated *n* subjects’ SNP minor allele counts *x*_{1} at HWE with varying MAF. We simulated a binary covariate *x*_{2} with log odds of success *α*_{2}*x*_{1}. When generating sets of two phenotypes we also generated a continuous covariate *x*_{3} from a linear regression model with *x*_{1} with correlation *α*_{3}, then centered and standardized. This resulted in a SNP with two covariates (*p* = 3) in our two phenotype simulations, and a SNP with one covariate (*p* = 2) in our three phenotype simulation.

We generated individual phenotype measures through the model
where *u*(*y*_{ik}) = *y*_{ik} for continuous phenotypes, *u*(*y*_{ik}) = logit(*y*_{ik}) for binary phenotypes, and follows a multivariate normal distribution with ** µ** =

**0**and

**Σ**

_{(i,j)}=

*σ*

_{i}

*σ*

_{j}

*ρ*

_{ij}. In all simulations, parameter values were selected such that, under optimal settings, empirical power was roughly 80–90% at a significance threshold of 10

^{−8}. Full details of simulation parameters are available in the Appendix in Table S2.

In each simulation, we found coefficients, standard errors and two-sided *p*-values for the null hypothesis that there was no relationship between the product of phenotypes and the SNP (*x*_{1}) after adjusting for covariates. Values were computed using IPD and PCSS.

Additionally, when simulating two binary phenotypes we fit covariate-adjusted logistic regression models for the logged odds that *y*_{1i}*y*_{2i} = 1 using IPD and returned the relevant two-sided *p*-value to compare the results of the linear model fit using PCSS to the correctly specified logistic model.

### 2.6 Real Data Application

#### 2.6.1 Fatty-Acid Conversion Ratios

Fatty acids are of broad importance for a wide range of cardiometabolic traits (Imamura et al., 2020), with ratios of fatty acids often used as a proxy for conversion efficiency. Previous genome wide association studies have explored the genetic architecture of fatty acids and their ratios (Kalsbeek et al., 2018; Lemaitre et al., 2011; N. L. Tintle et al., 2015; N. Tintle et al., 2020). We modeled 12 fatty acid ratios using both IPD and PCSS using data from the Framingham Heart Study’s Generation-3 and Offspring cohorts downloaded from dbGaP (Mailman et al., 2007).

The 12 ratios can be found in the first column of Table 3. Appendix Table S3 lists all fatty acids used in at least one of the 12 ratios alongside their abbreviations.

Quality control measures included setting Mendelian inconsistencies as missing and excluding SNPs with HWE *p* < 0.00001, MAF < 0.05, or missing values for over 10% of subjects. We excluded individuals missing over 10% of their genetic data after initial quality control and then took a subset of unrelated participants. After quality control we were left with 362,330 SNPs over 1455 individuals (657 from the Offspring cohort and 888 from the Generation-3 cohort).

In addition to the standard PCSS described in Section 2.1, we assumed access to pre-computed means and variances of the reciprocal of each fatty acid as well as the correlation between any fatty acid reciprocal and any other fatty acid, covariate, or SNP to model these ratios using PCSS.

We analyzed each fatty acid ratio through the linear model: Ratio ∼ SNP + age + sex for each SNP in our sample using both IPD and PCSS and tested each SNP for statistical significance with the Bonferroni adjusted threshold *α* = 1.37 × 10^{−7}.

### 2.7 Statistical Analysis

#### 2.7.1 Simulation Analysis

To analyze the results of our Type I Error simulations we calculated the empirical Type I Error rate when approximating linear models using PCSS at each specified significance threshold.

For all three 2^{k} factorial simulations, we assessed our PCSS method’s errors relative to models fit using IPD when estimating slope coefficients, standard errors, and *t* statistics as well as the test-decision disagreement rate between the IPD and PCSS approaches at a variety of significance thresholds.

We modeled errors in slope coefficients, standard errors, and test statistics through multiple regression models with logical indicators for each of the *k* parameter settings as predictors, testing at the Bonferroni adjusted significance threshold of 0.05/*k*. We also calculated the overall mean bias and variance.

We compared test decisions regarding the significance of the SNP when modeling the phenotype product and adjusting covariates. Test decisions were computed at significance thresholds 10^{−1}, 10^{−2}, …, 10^{−8}. When analyzing binary phenotypes we also compared test decisions between the linear model fit using PCSS and the logistic regression model fit on IPD to demonstrate the robustness of linear models to model binary outcomes. We reported test disagreement rates between the tests using PCSS and IPD at each significance threshold.

#### 2.7.2 Real Data Analysis

We measured our overall bias in slope, standard error, and test statistic estimates as well as the variance of each of these errors for each fatty acid ratio evaluated. We recorded test decisions for both the IPD and PCSS models and recorded which SNPs were found to have significant associations with a given fatty acid ratio. When one approach found a SNP to be significant and the other did not, we recorded if the non-significant result was “borderline” significant (*α* ≤ *p* < 10*α*).

### 2.8 Software

Software to perform these model approximations as well as those developed in Wolf et al. (2020) is available through the R package pcsstools, available on GitHub at https://github.com/jackmwolf/pcsstools.

## 3 Results

### 3.1 Simulation 1

Empirical Type I error rates when using PCSS are displayed in Table 1. In all simulations, the approach’s empirical Type I error rate was below the tested significance threshold.

### 3.2 Simulation 2

The PCSS method’s errors when approximating slope coefficients, their standard errors, and test statistics are available in Table 2. When aggregated over all simulation settings we observe (small, but) anti-conservative bias in its slope and test statistic estimates in each simulation. The magnitude of the mean test statistic error is comparable across all three simulations. Figure 2 displays our PCSS method’s approximated slope coefficients compared to slope coefficients calculated using IPD for the SNP while modeling the phenotype product and adjusting for covariates. Similar graphical comparisons of standard error and test statistic estimates are available in the Appendix in Figures S2 and S3.

When modeling estimation errors for two continuous phenotypes through a linear regression model with indicator variables for all of the simulation settings (*k* = 12, *n* = 2^{k} × 10^{3}), our model for the slope error found all settings except the residual phenotype variances, , to be significantly associated with the PCSS model’s slope estimate’s error at the adjusted significance threshold 0.05/*k*. All settings had significant associations with our error when estimating the standard error of the slope coefficient, or the test statistic. In the case of two binary phenotypes (*k* = 14, *n* = 2^{k} × 10^{3}), we found all settings to have significant associations with the error in slope, standard error, and test statistic estimates. For three continuous phenotypes (*k* = 13, *n* = 2^{k} × 10^{3}), we also found all settings to have significant associations with the error when predicting the slope coefficient, its standard error, and its test statistic.

Figure 3 shows comparisons of estimated and calculated *p*-values for a two-sided *t* test under the null hypothesis that the SNP had no linear association with the phenotype product after adjusting for covariates. Figure 4 shows various error rates rate between the IPD and PCSS models’ test decisions based on these *p*-values at differing significance thresholds. We see that all PCSS models overall disagreement rates to their IPD companions decrease as the significance threshold becomes more stringent. Likewise, when the IPD model rejected the null hypothesis, the PCSS model rarely failed to reject with error rates at most 13% which again decreased as the significance threshold decreased. When the IPD model failed to reject the null hypothesis, the PCSS approaches’ conditional error rates varied by the model’s response. When modeling the product of two continuous or binary phenotypes, the error rate stayed relatively constant across all thresholds at around 3% and 15%, respectively. But, when modeling the product of three continuous phenotypes, the error rate increased as the significance threshold became more strict. Lastly, we can see that when compared to the test decisions of a covariate adjusted logistic regression model, our PCSS approximation of the related linear model tends to reach the same conclusions, with a moderate conservative tendency, especially at more strict significance thresholds.

### 3.3 Real Data Application

Across all fatty acid ratio models we again observed anti-conservative bias in our slope and test statistic estimates. Our mean slope error was −2.93 × 10^{−3} (Mean Squared Error 0.114) while the mean slope estimate when using IPD was −1.3 × 10^{−3}. Our mean test statistic error was −3.34 × 10^{−4} (4.34 × 10^{−2}). Values are broken down by ratio in Table 3.

Table 3.3 summarizes the number of SNPs found significant when modeling using both IPD and PCSS across all 12 × 362,330 models. We see that of the 93% (58/62) of the time when an IPD model found a SNP to have a significant association with a given fatty acid ratio, the PCSS model also found the SNP to be significant. Moreover, 98% (61/62) of the time when the IPD model found a significant SNP, the PCSS model found the same SNP to have a *p*-value less than 10*α*. Conversely, of the 64 occasions when the PCSS model found a SNP to have a significant association with a given fatty acid ratio, only 6 (9%) occurred when the IPD model did not find the SNP to be significant. On all of these occasions, the IPD model’s *p*-value was less than 10*α*.

## 4 Discussion

We have developed a method that approximates the covariance of products of phenotypes with other variables using only bivariate and univariate pre-computed summary statistics (PCSS). We then demonstrated how this covariance estimation can be used to approximate linear models for products of phenotypes, how these can model logical “and” and “or” statements and how these models can include researchers choice of covariates. We demonstrated our approximation method’s accuracy relative to models fit on individual participant data through multiple simulations and applications to real genetic data.

The approximations shown here show good performance overall. There is a slight tendency towards anti-conservatism, however the Type I error is maintained. Areas of caution in application of the method include potential compounding of errors when applied to products of *m* phenotypes (where *m* is large), multiplying binary phenotypes that exhibit high negative correlation and when phenotypes take on negative values. Additional simulation studies and methodological improvements are needed in these cases and caution should be exhibited when applying our method in these cases. We also note that our method makes assumptions about the fit of the linear model to the data. While these assumptions are the same as in the corresponding analysis of IPD data (e.g., true underlying linear relationship between ** y** and

**), these assumptions may be more acutely important in our PCSS method.**

*x*Application of our method to real data from the Framingham Heart Study showed good performance. In general, we have tried to formulate this PCSS method to only rely on commonly available or easily estimated PCSS. However, in our application we assumed that we had the PCSS for ratios of fatty acids. This may not always be the case in practice, but may suggest that these PCSS may be important to pre-compute to assist downstream analyses of ratios.

A variety of limitations of our work are worth noting. First, we used linear regression for a binary response. Previous applications of PCSS have take this approach (Canela-Xandri et al., 2018), and it is generally robust; however, this approach is less precise than when the underlying relationship is truly linear. While some foundations for a logistic modelling approach were recently proposed by Wu et al. (2021), further work is needed to develop a comprehensive model for logistic regression using PCSS. Second, while our simulation study was comprehensive and we demonstrated our method on real data we note that further testing on simulated and real data is encouraged to explore special cases not considered here (e.g., linear combinations of products, adjusting for clustered/family data, etc.)

The use of PCSS provides numerous advantages over IPD data including computational efficiency and reduced concerns about data privacy. However, substantially improved and flexible methods are needed in order to fully leverage PCSS in customized downstream analyses. Our method allows researchers further customization of analyzed phenotypes by opening the door to multiplicative combinations of phenotypes, including logical combinations of binary phenotypes. Approximations used are reasonable, with near perfect maintenance of the Type I error rate and power in most situations. Further work is needed to apply the method to additional datasets and to expand the method to larger classes of combined phenotypes.

## Conflict of Interests

The authors declare there is no conflict of interests.

## Data Availability Statement

The simulated data that support the findings of this study are available from the corresponding author upon reasonable request. The real data analyses presented in the current publication are based on the use of study data downloaded from the dbGaP web site under dbGaP accessions phs000007.v29.p10 and phs000342.v20.p13.

## Appendix

### Recursive Covariance Estimation

Let *w*_{l} = *y*_{1}*y*_{2} · *y*_{l} = *w*_{l−1}*y*_{l}. In order to estimate through our established method, we use , and as inputs to the method described in Section 2.3.1. That is, replacing *y*_{1} with *w*_{l−1} and *y*_{2} with *y*_{l}. While , and are assumed to be known, we must estimate and .

Continuation of the recursive process starting at *l* − 1 and working down to 2 will yield an estimate for , or eventually the base case of .

To approximate , we re-express the term as . Then, treating *y*_{l} as the predictor (i.e. as we treat *x*_{j}), we approximate this term through the method described in Section 2.3.1.

A diagram of the start of this recursion is displayed in Figure S1.

This recursive estimation is impacted by the order in which the phenotypes are multiplied. So, any set of more than two phenotypes will render *m*!/2 possible ways to estimate the regression model through this method (with even more possible through different ways of recursion). Hence, we approximate the covariances and means using all permutations of length *m* of *y*_{1}, …, *y*_{m} unique up to the order of the first two terms as the order of our phenotypes, and take the median of each estimate as the predicted value.

## Acknowledgements

The authors of this work were supported by NIH Grant 2R15HG006915-03 and Dordt University. They would like to thank Martha Barnard, Xueting Xia, Nathan Ryder, and Jason Vander Woude for their help with preliminary stages of this project.