Characterisation of the genetic architecture of immune mediated disease through informed dimension reduction

Genome-wide association studies (GWAS) have uncovered pervasive genetic overlap between common clinically related immune-mediated diseases (IMD). To distinguish axes of IMD risk, and extend genetic knowledge of rare IMDs and subtypes, we developed a Bayesian shrinkage approach to perform a disease-focused decomposition of IMD GWAS summary statistics. We derive 13 components which summarise the multidimensional patterns of IMD genetic risk including those related to raised eosinophil count and serum IP-10. Projection of UK Biobank data demonstrated the IMD-specificity and accuracy of our reduced dimension basis in independent datasets. By projecting 22 rare IMD or IMD subtypes onto the basis we were able to identify disease-discriminating components and suggest novel associations. Requiring only summary level data, our approach allows the genetic architectures across any range of clinically-related traits to be characterised in fewer dimensions, facilitating the analysis of studies with modest sample size, where classical GWAS approaches are challenging.

Genome-wide association studies (GWAS) have elucidated the polygenic component of common human diseases 1 and comparative studies of summary GWAS results have highlighted sharing of genetic variants between different diseases with related aetiology, for example the collection of immune-mediated diseases (IMD) 2 . However, comprehensive overviews of sharing between multiple diseases are made difficult by the dimension of these statistics (100,000s of SNPs) and the complex patterns that exist. Such analyses have typically been approached from one of two angles: a variant-by-variant analysis across multiple diseases focusing on individual variants in turn, 3,4 or pairwise analysis of diseases across multiple variants at a regional or genome-wide level. 5,6 Both approaches have limitations. Different patterns of sharing identified at different variants make generalisations about inter-disease relationships difficult. On the other hand, disease-pairwise approaches make comparison of more than two diseases challenging. Thus, a need exists to represent a multi-dimensional view of shared genetic architectures between multiple diseases.
The GWAS approach explicitly accounts for the number of tests (SNPs) by requiring successively larger samples, of the order of tens of thousands of cases and healthy controls, to identify variants which cumulatively explain greater proportions of disease heritability.
Large samples present an insurmountable barrier for rare diseases, where effort has instead been generally directed to searching for rare variants of high penetrance through whole exome 7 or whole genome 8,9 sequencing studies. Despite this, moderate sized GWAS-style studies of rare diseases find not only polygenic association with common variants 9,10 but also evidence for differential genetic associations between clinical subtypes within these rare diseases, despite the challenges presented by division of already small sample sizes. 11 A need also exists to democratise GWAS to less common diseases, which may be enabled by considering them in the context of more common diseases with related aetiology.
We propose studying multifactorial genetic risks of related diseases in an informed dimension-reduction approach based on matrix decomposition. Matrix decomposition, for example via principal component analysis (PCA), expresses a matrix as the product of two smaller matrices, and has been used extensively in genetics, for example to summarise population structure and address its confounding effects in association studies. 12 It has also been used to explore structure in genetic association with multiple traits, either through aggregated signals across SNPs according to physical proximity to genes 13 or using a linkage disequilibrium (LD) independent subset of SNPs. 14 In either case, the reduced dimensional space was used to explore the same datasets as used to define it, with two implications. First, GWAS summary statistics are a composite of biological signal, technical noise, and sampling variation. Decomposition aims to find axes that maximise variance explained in the input datasets, and cannot distinguish between these three sources of variability. We therefore expect it to magnify technical and random differences as well as biological, a problem related to over-fitting in high-dimensional datasets. Second, there is no treatment of uncertainty in the reduced dimension space, meaning we can measure the distance between diseases, but not test whether that difference is non-zero.
Here we build a genetic basis for IMD, using PCA of GWAS summary statistics augmented by a Bayesian shrinkage approach that mitigates overfitting. Our central aim is to define a reduced dimension space, with axes that describe different patterns of IMD genetic susceptibility corresponding to underlying biological risk factors. In a transfer learning paradigm, we project independent datasets into this space, allowing us to study the distinct and shared genetic contributions to related diseases, and use standard statistical techniques to test for genetic association of rare diseases or genetic differences between disease subtypes.

A genetic basis for immune mediated diseases
We used well powered, publicly available case/control GWAS summary statistics (estimated log odds ratios, ) across 13 IMD (Supplementary Table 1 To increase the disease-relevance of the basis, we wanted to preferentially use information from truly associated SNPs, while avoiding double counting evidence from SNPs in LD. DeGAs 14 deals with this by thinning SNPs by LD and hard thresholding, replacing by Z β scores, setting these to 0 when the associated p > 0.001. As Z scores are standardised , β this has the effect of shrinking towards 0 when uncertainty is high, such as when allele or β disease frequencies are low, which means information from more common diseases will dominate. We chose instead to only partially standardise (for the effects of allele frequency), and to deal with LD and remaining noise simultaneously via regularization, adopting ideas from Bayesian fine mapping which jointly models association across neighbouring SNPs.
This allowed us to define a continuous weight which adaptively shrinks towards 0 ( Fig. 1, β Methods). Finally, we report projected results as , the difference between the projected δ ︿ β and a projected synthetic control with all entries 0, which allows us to make statistical inference about whether its estimand, , differs from control. δ

[ Figure1 about here ]
To illustrate the importance of our informed shrinkage procedure, we built four bases , with GWAS summary statistics for the 13 IMDs shrunk differently in each case. We assessed their relative performance by projection of matching self-reported diseases (SRD) from UK BioBank (UKBB) 15 using summary statistics from a compendium provided by the Neale lab [ http://www.nealelab.is/uk-biobank/ ]. While all bases found structure in the input data, in the basis without shrinkage (Fig. 2a), the UKBB SRD clustered with each other rather than their GWAS comparator, suggesting that the structure identified related to between study differences other than disease. In hard-thresholded, LD-thinned bases using either Z scores ( Fig. 2b) Fig. 1), enabling us to identify 107-373 "driver SNPs" that are required to capture genetic associations on any individual component.
[ Figure 2 about here ] We projected data from three classes of study onto the IMD basis with shrinkage. First, we used all self-reported disease and cancer traits from UKBB to characterise the basis components, to examine specificity to IMD, and to assess power as a function of sample size: case numbers for UKBB self reported IMD range from 41,000 (asthma) to 105 (vitiligo).
Second, we identified IMD GWAS with smaller sample sizes than used in basis construction, focusing on diseases related to basis traits or studied in different ancestral backgrounds (e.g. ankylosing spondylitis). Third, we identified studies of IMD that are too rare (e.g. eosinophilic granulomatosis with polyangiitis, EGPA -a rare form of vasculitis) or clinically heterogeneous (e.g. juvenile idiopathic arthritis, JIA) to build large GWAS cohorts.

Genetic analysis of IMD in reduced dimensions
Across all 312 projected UKBB traits (Supplementary Table 2), 27 had significantly non-zero (FDR < 1%). These were overwhelmingly immune-related traits ( Supplementary Fig. 2): δ ︿ no significance was observed for traits such as coronary artery disease, stroke, or obstructive sleep apnea, confirming the immune-mediated specificity of our basis.
Significant results were detected with as few as 105 cases for vitiligo, emphasising the potential of this approach to unlock the genetics of rare IMD GWAS.  Some individual components could be biologically interpreted due to their pattern of disease or other trait associations. PC1 (Fig. 4), which explained the greatest variation in the training datasets, appears to represent an autoimmune/(auto)inflammatory axis 16 , also characterised by whether diseases are considered antibody 'seropositive' / 'seronegative', contrasting IBD, AS, primary sclerosing cholangitis (PSC), and IgA nephropathy with rheumatoid arthritis (RA), autoimmune thyroid disease (ATD), Sjörgen's disease, systemic lupus erythematosus (SLE), vitiligo and autoimmune diabetes. On the inflammatory/seronegative side, we also saw weaker but still significant signals for atopy, basal cell carcinoma and malignant melanoma. Both malignant melanoma and non-melanoma skin cancer incidence is increased in IBD, but the relative role of treatment or IBD itself in driving this is hard to determine. 17,18 On the seropositive side, we saw significant results for pernicious anemia, a disease strongly associated with anti-gastric parietal cell and anti-Intrinsic Factor antibodies, as well as with autoimmune thyroiditis, T1D and vitiligo. 19 To help characterise the biology captured by individual components we projected additional datasets: blood counts, 20 immune cell counts 21 and serum cytokine concentrations 22 (Supplementary Tables 5, 6 and 7). We expect that significant results will occur when the projected score is a composite of many small effects working in consistent directions.
However, false positives could also result if a single SNP with a large weight in the basis is in LD with a SNP with a large effect on the projected trait due to chance. To guard against this, we used Spearman rank correlation which is robust to such outlier observations to test the "consistency" of each projection (Supplementary Note). We found, reassuringly, that amongst disease traits, increasing deviation from control correlated with increasing consistency ( Supplementary Fig. 17). Similar results were seen for both serum cytokine concentrations and immune cell counts but in the blood count data, which had been generated from a much larger sample, we found highly significant projected results could occur without any evidence for consistency, and so we additionally filtered on consistency in that dataset. These data aided interpretation of two further components.
PC13 was striking for the general association of many diseases across all four main clusters in a concordant direction, and was the only component for which any projected trait was more extreme than any original basis trait (Fig. 5). EGPA, which showed the most extreme projected values on this component of any diseases, is classified as an eosinophilic form of anti-neutrophil cytoplasmic antibody (ANCA)-associated vasculitis (AAV) and both asthma and raised eosinophil count are included in its diagnostic criteria. We found PC13 was strongly associated with higher eosinophil counts in a population cohort 20 (FDR<10 -200 ), suggesting that this component describes eosinophilic involvement in IMDs. Mendelian Randomization (MR) analysis of blood cell traits with six IMDs had previously associated eosinophils with RA, celiac disease (CEL), asthma and T1D. 20 We conducted MR analysis twice, first selecting SNPs according to significant association with eosinophil count and second using the driver SNPs for PC13. Results were similar, although estimates using PC13 driver SNPs tended to be larger, which suggests some heterogeneity, for example that only a subset of eosinophil-associated SNPs also associated with IMD risk. Our analysis thus confirms earlier findings, and extends the list of IMD with genetically supported involvement of eosinophils to include EGPA, JIA subtypes, AS, ATD, MS, hayfever and eczema.
PC3 (Fig. 6) was the only component which showed a significant relationship with any serum cytokine concentration. Higher concentrations of CXCL9 (MIG) and CXCL10 (IP-10), Th1 chemoattractants and ligands to the regulator of leukocyte trafficking CXCR3, were both significant in the same direction as several autoimmune diseases, with strongest signals for myasthenia gravis, several JIA subtypes, as well as IBD, CEL, AS and sarcoidosis. In MR analysis, while PC3 driver SNPs predicted association of IP-10 and MIG with these IMDs, SNPs selected by significant association to cytokine levels themselves did not. This suggests that raised serum IP-10 and MIG are not themselves causally associated with IMD risk, but that these driver SNPs mark a risk factor that contributes to serum IP-10 and MIG.

Genetic distinctions within clinically heterogeneous and rare immune-mediated diseases.
Our basis has only 13 dimensions. If the genetic susceptibility of rare IMD and IMD subtypes overlaps that of common IMD, we can increase power by focusing on these dimensions. Of 22 diseases or disease subtypes with < 1000 cases, 12 were significant (FDR<1%), even with as few as 132 cases (NMO IgGPos).
Most disease subtypes clustered together, even when not significant (Fig. 3). For example, myasthenia gravis, a chronic, autoimmune, neuromuscular disease characterized by muscle weakness, has been shown to have a bimodal incidence pattern by age, and some genetic associations have been identified only for the late onset subtype. 23 However, within the basis, both subtypes fall in very similar locations across all components, and cluster together along with several subtypes of JIA.
EGPA is a rare form of AAV (annual incidence 1-2 cases per million) for which genetic differences relating to autoantibody status have been identified. 11 We included both myeloperoxidase (MPO) ANCA+ and ANCA-cases, as well as a study of non-eosinophilic MPO+ ANCA-associated vasculitis. 24 While all forms of vasculitis fell on the adaptive immunity side on PC1, the EGPA subtypes typically resembled each other much more closely than the MPO+ EGPA resembled MPO+ ANCA-associated vasculitis, with EGPA showing a particularly strong signal on PC13, consistent with the diagnostic criteria which include overt eosinophilia .
For two other diseases, however, subtypes did not consistently cluster together. NMO is a rare (prevalence 0.03-0.4:10,000) disease affecting the optic nerve and spinal cord, for which HLA association is established 9 and which can be divided according to aquaporin 4 autoantibody seropositivity status (IgG+ or IgG-). The projections of seropositive and seronegative NMO showed non-significant differences on several components, leading to differential clustering. While seropositive NMO clustered with the classical autoimmune diseases, most closely with SLE and Sjögren's disease, IgG-NMO clustered away from the classic seropositive diseases, most closely with MS. This finding mirrors analysis which directly compared NMO subtypes to each of SLE and MS via polygenic scores 9 , and strengthens the findings by specifically identifying SLE and MS as the nearest neighbours of IgG+ and IgG-NMO respectively, out of 60 IMDs considered for clustering.
JIA is a heterogeneous paediatric disease, with an overall childhood prevalence in Europe of 20:10,000 25 , and with seven recognised subtypes. 26 While studies have begun to identify distinct genetics of the systemic subtype 27 and have shown subtype-specific differences in the MHC, 28 systematic comparison between subtypes has been underpowered. Although, the systemic and enthesitis-related arthritis (ERA) subtypes were not significant despite relatively moderate sample sizes (219 and 267 cases respectively), they clustered together with MS and AS respectively, apart from the other JIA subtypes which clustered with the other autoimmune diseases.

Mapping component-level associations to SNPs
Given that most of the IMD and subtypes with small GWAS studies have only a few established genetic associations, we sought to exploit the component-level associations above to detect new disease associations. We found a strong enrichment for small GWAS p values at driver SNPs on trait-significant components ( Supplementary Fig. 18). Using a "subset-selected" FDR approach, 29 we analysed driver SNPs for 22 significant trait-component pairs (12 unique traits), and identified 25 trait-SNP associations (subset-selected FDR < 1%, Table 1) after pruning SNPs in LD. Twelve of these were genome-wide significant (p < 5x10 -8 ) either in this study (4 associations) or in other published data (8 associations) and a further five were significant in other published analysis that levered external data. These included, for example, the non-synonymous PTPN22 SNP rs2476601 which was associated with myasthenia gravis (overall and the late onset subset) by subset-selected FDR < 0.01. This SNP was previously associated with myasthenia gravis in a different study, 30 and lack of clear replication in the data analysed here (SNP P=6x10 -5 ) was attributed to differences in population structure. Eight associations (five variants) were not previously reported to our knowledge, including associations near IRF1/IL5 for myasthenia gravis, near TNFSF11 for rheumatoid factor negative (RF-) JIA and near CD2/CD2 8 for EGPA.

Discussion
Our motivation in this work was threefold. First, to overcome the problems of dimensionality to allow an overview of genetic association patterns from multiple related diseases without over-simplification. While previous efforts to relate different traits through GWAS statistics have focused on large studies of a wide variety of diseases, and shown that they can distinguish broad classes of IMD, cardiovascular and metabolic diseases, 5,13 we have tackled the problem of finding structure within a single class of diseases. Unlike other applications of PCA to genetics, we split our datasets into "training" and "test" sets, enabling standard statistical hypothesis testing and providing robustness against overfitting.
Our second motivation was to extract different axes underlying IMD genetic risk. Work in metabolic 31 and psychiatric 32 diseases have taken related approaches to attempt to learn composite factors underlying risk of these related diseases through deeper phenotyping of patients before testing these factors for genetic association. Alternatively, decomposition of estimated effects at 94 type 2 diabetes risk variants, together with their effects on 46 metabolic traits was used to cluster these variants into 5 groups, three focused on insulin resistance and two on beta cell function. 33 Here, we hoped to learn the same sorts of factors by decomposing only summary GWAS data on clinical disease endpoints. Our continuous shrinkage weight learnt across all 13 training datasets appears to enable us to extract disease-relevant structure, with projected traits lying close to their training data counterparts, something achieved with disease-specific hard thresholded weights 14 for only the largest datasets.
Three factors we identified had clear interpretations. The autoimmune/(auto-)inflammatory axis in IMD represented by PC1 is well documented, with the gradient along PC1 corresponding to a shift from auto-antibody seronegative to seropositive diseases. The exception is vitiligo, in which, despite strong evidence of T cell autoimmunity, autoantibodies are reported but are not consistent features of disease. 34 Weaker but significant association of Psoriatic arthritis (PsA) among the other seropositive IMD is also consistent with a recent report of novel pathogenic antibodies in PsA. 35 Our basis offered alternative viewpoints of this collection of diseases. For example, significant IMD on the MIG/IP-10-associated PC3 included both 'seropositive' and 'seronegative' diseases, although not atopy, while all three groups were represented on the eosinophil associated PC13. Eosinophils are pro-inflammatory leukocytes with an established role in atopic diseases such as asthma 36 and inflammatory diseases such as IBD. 37 The PC13-associated IMD we identify include these, as well as autoimmune diseases which have been previously noted to have an eosinophil relationship, such as RA. 38 Our results suggest eosinophilic involvement in a wide variety of autoimmune diseases, in addition to inflammatory diseases, in agreement with other recent findings. 39 IP-10 and MIG are chemokines, secreted by epithelial and dendritic cells (amongst others), which act as chemoattractants for immune cells which express the receptor CXCR3, including Th1 cells. Both MIG and IP-10 expression at the site of autoimmune target have been implicated in the development of autoimmunity 40,41 and IP-10 has been observed to be upregulated in follicular cells of patients with myasthenia gravis. 42 Serum IP-10 has also been found to be raised in patients with recent onset T1D, 43,44 and Graves' disease (hyperthyroidism) 41 and to correlate with increased disease activity in SLE 45 and AS. 46 While these observations support a link between certain IMD and serum cytokine levels, our results do not directly implicate these cytokines as causal. Both cytokines and blood count data were measured in unselected population cohorts which will include individuals with IMD, such that the association with IMD may be causal or consequential. We suggest that PC3 represents an IMD-related process that contributes to serum cytokine levels.
Nonetheless, clinical efficacy of MDX1100, a monoclonal antibody to IP-10, has been demonstrated in RA 47 and a dose-response relationship observed in UC 48 and our results suggest IP-10 blockade might also be considered in patients with myasthenia gravis, JIA, AS, CEL and sarcoidosis.
Our final motivation was to exploit the lower dimensional representation to generate new knowledge in rare IMDs. The number of polymorphic human genetic variants together with our understanding that genetic effects on human disease are generally modest has lead to massive GWAS in order to overcome the penalty that must be applied for multiple testing. This is simply not possible for rare diseases. One of the tools which has enhanced rare disease GWAS is the borrowing of information from larger GWAS of aetiologically related diseases 11 and our basis serves a similar function here, by levering information about a SNP's potential to be IMD-associated through both the regularization and the PCA, we can both increase genetic discovery and place less common diseases in the context of their more prevalent counterparts. More generally, studies of SRD are being enabled on massive scale by UKBB 49 and 23andMe, 50 although studies of such cohorts tend to focus on the more common diseases such as type 2 diabetes and coronary heart disease. Our results provide reassurance that SRD associations are consistent with those from targeted GWAS studies, and extend their utility to IMD and other diseases which are generally found at a lower frequency.
While we have focused on IMD, this approach has potential to be applied to other groups of clinically related traits, such as metabolic or psychiatric, and may increase understanding of both the underlying components of disease risk as well as placing lower prevalence diseases in context of their related common diseases.

Summary statistic datasets
We constructed a compendium of publicly available GWAS summary statistics across a wide range of traits including UKBB traits ( http://www.nealelab.is/uk-biobank , http://geneatlas.roslin.ed.ac.uk/ - Supplementary Tables 2 and 4), IMD relevant GWAS studies (Supplementary Table 3 Disease GWAS data were obtained from the URL given in Supplementary Tables 1, 3, or via request to study authors, with the exception of those listed below.

Vasculitis GWAS analysis
AAV belongs to a group of IMD characterised by inflammation of the small and medium-sized blood vessels with evidence of circulating pathogenic autoantibodies. It comprises three main syndromes: granulomatosis with polyangiitis (GPA), microscopic polyangiitis (MPA) and EGPA. The two primary antigenic targets of ANCA are proteinase 3 (PR3) and myeloperoxidase (MPO). Although PR3-ANCA is the predominant serotype in GPA and MPO-ANCA is more commonly found in MPA, there is a significant overlap between these syndromes.
The vasculitis cohort used to construct the basis was part of the discovery cohort from the AAV GWAS performed by the European Vasculitis Genetics Consortium, 24

JIA and PsA GWAS analysis
The JIA and PsA GWAS datasets were generated and QC'd using the same strategies. Imputation: Prior to imputation SNPs with ambiguous alleles (C/G and A/T) were excluded and remaining SNPs were aligned to the Haplotype Reference Consortium (HRC) panel (version 1.1) using the HRC imputation preparation tool (https://www.well.ox.ac.uk/~wrayner/tools/). Imputation was performed using the Michigan Imputation server where phasing was performed with Shapeit2 and the HRC panel.
Following imputation SNPs were excluded based on a MAF < 0.01 and imputation accuracy (r 2 ) < 0.4.
Association testing, PsA: case-control association testing was performed using the SNPTEST software package (version 2.5.2) using the score method to account for imputation uncertainty. Three principal components, calculated as described above, were included as covariates to account for any residual population structure.
Association testing, JIA: case-control association testing was performed using the snp.rhs.estimates function in the R package SnpStats, comparing in turn overall, or JIA subtypes to the control group. Three principal components, calculated as described above, were included as covariates to account for any residual population structure.

Construction of basis
There are three particular challenges with performing PCA on GWAS summary statistics.
First, the SNP effect estimates must be on the same scale; second, we must deal with variable correlation between input dimensions (SNPs) due to LD; and third, while all SNPs are expected to show small deviations between studies due to random noise, different genotyping platforms and data processing decisions, only a minority of SNPs will be truly related to the diseases of interest.
The uncertainty attached to depends on both study sample size and SNP minor allele β ︿ frequency (MAF). We adjusted for the variance due to MAF, , as this varies between σ 2

M AF
SNPs, but not variance due to sample size, as this would overly shrink smaller studies relative to larger. We dealt with the second two challenges simultaneously, using a Bayesian fine mapping technique which calculates the posterior probability that each SNP is causal for each trait, under the assumption that at most one causal variant exists in each recombination hotspot-defined block of SNPs 54,55 . At each SNP, we computed a weighted average of the posterior probabilities across input studies to create an overall weight for that SNP, . will be close to zero when there is no association in a region, limiting the effects w w of technical noise between studies, and will otherwise act to weight associated SNPs according to the extent of LD in a region. The final input for basis creation is a matrix of We identified 13 IMD GWAS studies with >6,000 samples of European ancestry for which full summary statistics were publicly available (Supplementary Table 1). We selected SNPs present in all 13 studies, with MAF>1% in the 1000 Genomes Phase 3 EUR data.
Additionally we excluded SNPs overlapping the MHC region (GrCh37 Chr6:20-40Mb) or for which the unambiguous assignment of the effect allele was impossible (e.g. palindromic SNPs). We harmonised all effect estimates to be with respect to the alternative allele relative to the reference allele as defined by the 1000 genome reference genotype panel. After filtering, harmonised effect estimates were available for 265,887 SNPs across all 13 selected `basis' traits. In order to provide a baseline for subsequent analyses we created an additional synthetic `control' trait, for which effect sizes across all traits were set to zero. We used these to construct two matrices and where elements reflect raw ( ) and shrunk M M ′ β effect sizes ( respectively, such that, rows and columns reflect traits (n=14) and SNPs (p=265,887). After mean centring columns we used the R command prcomp to carry out PCA of both and to generate naive and "shrunk" IMD bases, retaining M M ′ m=n-1=13 components, which corresponded to the fewest components needed to minimise the mean squared reconstruction error (Supplementary Fig 19).
We noted that the majority of entries in the p x m PCA rotation matrix, R , were close to 0, and chose to hard threshold these to 0 for computational efficiency and to identify which driver SNPs were relevant to each component. To do this, using R k to represent the k th column of R, we define R k ( )=R k x I(|R k |> ) where I() is an indicator function and "x" represents element-wise multiplication. We quantify the distance between projection with R k and R k ( ) by We chose the threshold for each component, k , as the largest value such that D k ( ) < 0.001. Finally, we defined the sparse basis rotation matrix Q as the matrix constructed from the column vectors R k , k=1,...,m. This identified both driver SNPs which define the support for each component, and enabled computationally efficient examination of many traits in the reduced dimension space defined.

Projection
Prior to projection, effect alleles were aligned to the 1000 genome reference genotype panel.
For traits sensitive to missing data (studies of NMO 9 Table 8. We used the hclust() function in R to cluster diseases in the basis using agglomerative hierarchical clustering according to Ward's criterion (method="Ward.D2") on the Euclidean distance between projected locations of each disease in the basis.
We calculated p values for null hypotheses that the vector across all 13 components δ = 0 using a chisq test (Supplementary Note). We called significant associations according to FDR < 0.01, calculated using the Benjamini-Hochberg approach, and run independently within the broad categories: primary analysis (UKBB self reported disease and cancer, plus IMD-relevant GWAS); blood cell counts; cytokines; immune cell counts. This was our primary measure of significance. We took the same strategy to independently calculate FDR for each component individually for additional annotation, and traits were considered "component-significant" if they were significant on that component and overall.
Classification of diseases according to autoantibody status was performed by a specialist clinician using available medical literature. This assignment was blinded to the PC1 results.

Proportionality of effects across different datasets for the same trait
We tested the null hypothesis of proportionality using the coloc.test function in the coloc package 58 which takes into account the uncertainty in the projection estimate, assuming different PCs are independent. Small p values in this test correspond to the observed data being unlikely under the null of proportionality, and would suggest that studies of the same disease in different populations were not comparable.

Candidate significant driver SNPs
For each of 10 diseases or subtypes with < 2000 cases and significant on at least one component (Myasthenia gravis, late onset; EGPA, MPO+, ANCA-and combined; JIA, extended oligo (EO), persistent oligo (PO), RF-, and RF+), we selected all driver SNPs on any significant component, and calculated the FDR within this set of SNPs as a subset-selected FDR. 29 We ordered SNPs by increasing values of ssFDR, and deleted any SNPs in the list that were in LD (r 2 >0.1) with a higher placed SNP, leaving a set of unlinked SNPs associated with each trait shown in Table 1. These were annotated through literature searches.

Code availability
An R implementation of the method is available from https://github.com/ollyburren/ cupcake .
Code to run the analyses presented here is available from https://github.com/ollyburren/imd-basis .
An online tool to allow projection and exploration of additional data into the basis is available at https://grealesm.shinyapps.io/IMDbasisApp/ .

Author contributions
Conceived the study, drafted paper: CW, OB.
Wrote the software: OB.      q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q