Linear regression of sampling distributions of the mean

We show that the simple and multiple linear regression coefficients and the coefficient of determination R2 computed from sampling distributions of the mean (with or without replacement) are equal to the regression coefficients and coefficient of determination computed with individual data. Moreover, the standard error of estimate is reduced by the square root of the group size for sampling distributions of the mean. The result has applications when formulating a distance measure between two genes in a hierarchical clustering algorithm. We show that the Pearson R coefficient can measure how differential expression in one gene correlates with differential expression in a second gene.

Linear regression coefficients and the Pearson R correlation has long been used to 2 quantify the relationship between dependent and independent variables [1]. However, 3 the "ecological fallacy" has shown that linear regression and correlation coefficients 4 based on group averages cannot be used to estimate linear regression and correlation 5 coefficients based on individual scores [2,3]. 6 It may not be well known that if all possible groups are considered, in the case of 7 sampling distributions of the mean, the Pearson R coefficient computed from the group 8 averages is equal to the Pearson R coefficient computed from the original individual 9 scores for one independent variable [4, 5]. 10 We extend this result and show that the linear regression coefficients (for simple and 11 multiple regression) and the coefficient of determination R 2 computed from sampling 12 distributions of the mean (with or without replacement) are the same as the coefficient 13 of determination and linear regression coefficients computed with the original individual 14 data. The sampling distributions of the mean can also be constructed using differences 15 between two groups of different size. The result has implications for hierarchical distributions of the mean with and without replacement or differences between two 27 groups of sampling distributions. In Section 3, we show that the coefficient of 28 determination is the same whether it is calculated using individual scores or all possible 29 group averages from sampling distributions. Section 4 shows that the standard error of 30 estimate is reduced by the square root of the group size for sampling distributions of the 31 mean. Section 5 performs numerical simulations to illustrate these principles. Section 6 32 applies these results and shows that the Pearson R coefficient can be used to measure 33 how differential expression in one gene correlates with differential expression in a second 34 gene when the z-statistic is used. 35 1 Computing regression coefficients 36 Multiple regression requires one to compute the coefficients {β * j , j = 0, ..., K} that 37 minimize the sum of squares where x (1) , x (2) , ..., x (K) represent K different independent variables, y is the 39 dependent variable, and the i th realization of variables x (j) and y are x and the coefficients are related by 45 β j = β * j , j = 1, 2, ...K.
To solve for β 0 , we set the partial derivative of (2) with respect to β 0 to zero which 46 yields which implies that β 0 = 0. Thus one can redefine the problem of computing multiple 49 regression coefficients (1) to be selection of the coefficients {β j , j = 1, ..., K} that 50 minimizes the sum of squares January 25, 2021 2/18 In the matrix approach to minimizing the sum of squares which can be derived by 52 setting the partial derivatives ∂ ∂βj of (8) to zero, the system of equations in (8) , and X is a N by K matrix whose entries are: One can solve for the multiple regression coefficients in the vector β by left multiplying 57 (9) by the transpose X T and solving the linear system The elements in the square K by K matrix X T X will be sums of the form Similarly the entries in the K by 1 vector b ≡ X T (y −ȳ) will be sums of the form It should be noted that for each pair of fixed indices i and j, the sum in either 61 expression (11) or (12) can be represented using a sum of the form In the following section, we show that if the variables x (1) , x (2) , ..., x (k) , and y are 63 replaced with all the elements from the sampling distributions of the mean, the system 64 (14) is obtained for some constant α. Moreover, we obtain a closed form for the constant α. If m is the 66 group size and we account for order, the size of the matrix X will be N m × K for 67 selections with replacement and (N −m+1)! (N −m)! × K for selections without replacement.

68
However, the resulting system (14) will still be a K × K system. Since the system (14) 69 is equivalent to the system (10), the regression coefficients for the sampling 70 distributions of the mean will be the same as the regression coefficients computed from 71 the original data according to (5) for 1 ≤ j ≤ K. The equivalence of β * 0 follows from 72 β 0 = 0, equation (4), and the fact that the means of the original data (ȳ andx (j) ) are 73 the same as the means computed using all the elements from the sampling distributions 74 of the mean (with or without replacement). If we assume that there are N p elements in 75 the sampling distribution, this can be stated mathematically as where w can represent y or x (j) andw = N i=1 w i /N . The sum P is a sum over all 78 possible index values in the sampling distribution.
chosen from the sets U and V . We assume without loss of generality that m 1 ≥ m 2 . The first m 2 choices are paired    The means of the original scoresū = Sections 2.1 and 2.2 show that Sū ,v will be equal to a factor α times S as defined by 99 (13) for sampling distributions with and without replacement respectively. Section 2.3 100 generalizes these results to differences of two groups of sampling distributions. In all 101 cases, the elements of the matrix X T X and the vector X T (y −ȳ) will be multiplied 102 by the same factor α when elements from the sampling distributions of the mean are 103 used. Start with Sū ,v as defined by (23). Since we are considering sampling distribution with 106 replacement, the values chosen for summation indices p i do not need to be different. We 107 will show that 108 If we distribute the sums inside the parentheses in (23), two types of terms are formed. 109 The first type of term takes the form where the same summation index p i is used for u pi and v pi . The second type of term 111 takes the form where different summation indices, p i and p j are used.

113
All the terms of the form shown in (26) are zero since when the sums as noted by equation (7). Thus we must only consider terms of the form (25). The sum (20) acting on (u pi −ū)(v pi −v) can be rearranged as and simplified to contributes a factor of N . 116 Multiplying the right side of (28) by m 2 to ensure all summation indices p i , i = 1, ..., m 2 117 are accounted for and multiplying by the factor 1 m1m2 present in (23) yields (24). One 118 can also derive (24) using random variables and expected values.   124 We will show that

Sampling distribution without replacement
for sampling distributions created without replacement. If we distribute the sums inside 126 the parentheses of (23), we again distinguish between two types of terms: terms of the 127 where the sum N pi=1 with summation index p i is placed first and the term excludes previously chosen index values and the index value chosen for p i . The right side of (30) can be simplified to since the choice made for p i in the first sum to the sum Sū ,v .

136
We now consider terms of the form (u pi −ū)(v pj −v) with two different summation 137 indices p i and p j . The sum (21) applied to (u pi −ū)(v pj −v), i = j can be written as where the sums since the choice made for p i in the first sum  (31) and add them to (33) to form which is zero by (27). This leaves January 25, 2021 7/18 remaining terms from (31) which simplifies to (29) after multiplying by the factor 1 m1m2 144 present in (23) 145 Since Sū ,v is a multiple of S as defined by (13), the system of equations (14) will be 146 formed where α = (N −2)! m(N −m−1)! when we set m = m 1 = m 2 . Thus the multiple regression 147 coefficients β computed from sampling distributions of the mean without replacement 148 will be equal to the multiple regression coefficients computed from the original scores.
S d can be written in the form Distributing gives January 25, 2021 8/18 After one accounts for the sums Q in the first term and P in the fourth term, one 161 can write where when selections are made with replacement and when selections are made without replacement.

169
Note that the mean of a difference of two groups of sampling distributions of the 170 mean is zero. Whenū andv are set to zero in (19) and a difference of two groups of 171 sampling distributions are used, it is evident that S d is similar in format to (19). Thus 172 the system of equations (14) will be formed where α = C. Thus the multiple regression 173 coefficients β computed from a difference of two groups of sampling distributions of the 174 mean will be equal to the multiple regression coefficients computed from the original 175 scores. We also consider the case where Group 1 and Group 2 do not simultaneously share any 178 elements. We assume the selections are done without replacement. Under these 179 January 25, 2021 9/18 restrictions, one can write (36) as The notation Q = P is used to exclude any elements in the sum Q from indices previously selected in the sum P. Turning to the second half of the second term of (39), which we define to be where E q,p l := {q l | q l = q k , k = 1, 2, ..., l − 1}\{p 1 , p 2 , . . . , p m1 } excludes previously chosen indices in the Q sum and any previously chosen indices {p 1 , p 2 , . . . , p m1 } selected from the P sum. Applying the sum to the specific term (u qi −ū), the sum Q can be rearranged as Bear in mind that the sums will contribute the same factor to (u qi −ū) regardless of the selected value for the summation index q i . Taking care to avoid selecting a index that has been already chosen, we note that m 1 choices have already been made for the set {p 1 , p 2 , . . . , p m1 }.
In addition, for each choice of q i in choices for the (m 2 − 1)'th sum or Using the definition of the excluded terms E qi,p (41) in the sum one can replace N qi=1 qi =pj (u qi −ū) in (43) with the right hand side of (44) to yield, The sum N qi=1 (u qi −ū) is zero by (27). Since there are m 2 terms of the form (u qi −ū) in (40), S u can be written as We can apply the same steps to the second half of the third term of (39), which we define to be S v Using these results in (39), Using equation (29), (45) simplifies to keeping in mind that the selections are made without replacement. AgainS d is a 191 multiple of S. Therefore the system of equations (14) will be formed where α =C. The coefficient of determination R 2 is the proportion of variability in the dependent 194 variable that can be accounted for by the independent variables [6]. It is defined using 195 whereŷ i is the prediction provided by the surface of regression Substituting (48) into (47), Again we see the presence of sums  The sum of squares error SSE is defined to be Given this definition, the standard error of estimate s e can be defined Now by [7] where SST is the total variation and SSR is the sum of squared regression, With these definitions, the coefficient of determination (47) can also be written as Solving (56) for SSE and dividing by N where 213 σ 2 = SST N (58)   While not entirely obvious due to the density of points, all differences decrease 254 from approximately 10 −6 to less than 10 −13 in the last 0.001% of the total selections. In 255 addition, the differences do not always decrease monotonically as the fraction of total 256 selections increase, and the differences decrease to very small values (less than 10 −5 ) at 257 certain points during the course of the convergence as noted by the downward spikes. 258 6 Gene expression and distance between genes 259 A useful way of organizing the data obtained from microarrays or RNA-seq data is to 260 group together genes that exhibit similar expression patterns through hierarchical 261 clustering. A hierarchical clustering algorithm generates a dendrogram (tree diagram). 262 However, the algorithm requires that a distance be defined to quantify similarities in 263 expression between two individual genes.

264
Let A i denote the expression level of gene A for patient i and let B i denote the 265 expression level for gene B for patient i, 1 ≤ i ≤ N . Distances between genes can be 266 computed using many metrics [9], but two common ones are the Euclidean distance and the Manhattan distance, Correlation coefficients [10] are also used to measure the similarities between two genes. 268 One measure of distance using the Pearson R is or D R = 1 − R 2 [11] if the sign of R is not important. If R is close to 1 or -1, the 271 distances D R , D R will be close to zero.

272
The purpose of the next section is to propose a new distance based on the differential 273 expression of two genes. We then show the new measure of distance is the same as the 274 Pearson R coefficient computed from the original scores (64), thus lending support to 275 the use of the Pearson R coefficient in measuring the distance between two genes. random patients and assign their expression levels to Group 2. Repeat the process using 281 the same selections for gene B. Since both groups are sampled from a population with a 282 known variance σ 2 , the z-statistic [12] for two independent samples can be used to 283 measure differential expression for gene A which if m 1 , m 2 ≥ 30 will be approximately normally distributed. Let z B be the 285 z-statistic for gene B for the same selection of patients using the same equation (65).

286
This process can be repeated multiple times giving a set of ordered pairs (z k A , z k B ) for 287 each different selection (k) of groups. The Pearson R value, R t can then be computed 288 from these ordered pairs using all possible selections K The new distance will now be defined as D T or alternatively D T Given N total patients, there exist K =  (65), that the distance D R will be equal D T and D R will be equal 294 D T . 295 7 Conclusion 296 We have shown that the linear regression coefficients (simple and multiple) and the 297 coefficient of determination R 2 computed from sampling distributions of the mean (with 298 or without replacement) are equal to the regression coefficients and coefficient of 299 determination computed with the original data. This result also applies to a difference 300 of two groups of sampling distributions of the mean. Moreover, the standard error of 301 estimate is reduced by the square root of the group size for sampling distributions of the 302 mean.

303
The result has implications for the construction of hierarchical clustering trees or 304 heat maps which visualize the relationship between many genes. These processes require 305 one to define a distance between two genes using their expression levels. We developed a 306 new measure of distance based on how differential expression in one gene correlates with 307 differential expression in a second gene using the z-statistic. We showed that the new 308 measure is equivalent to the Pearson R coefficient computed from the original scores, 309 thus lending support to the use of the Pearson R coefficient for measuring a distance